May 3, 2019

Dude, Where’s My Byte?

by Srivathsava Rangarajan
in Data

Preface

At Enova, PostgreSQL can be found at the heart of a large number of software deployments. This comes as no surprise given the power, popularity and support of the database combined with the ubiquitous need for relational storage in FinTech. My journey with PostgreSQL began as an intern at Enova as I explored the storm that is Software Engineering with its flavor-of-the-month Javascript frameworks and Ruby libraries. PostgreSQL was one of the more established technologies in the stack, but it was also, unsurprisingly, one of the less well-understood by engineers given how difficult it is to master despite being easy to learn.

As often happens in the industry, my learning of PostgreSQL was not a sit-down, formal affair of the theoretical kind. It was a series of need-to-know, stack-overflow driven quests. I consider this blog post a structured compendium of my learnings through experience to help a newcomer quickly gain ground in understanding this amazing piece of software, garner some mechanical sympathy for their installations, and maybe even appreciate the beauty of its grand design.

The Plot

We are going to follow the journey of a single byte of data through the internals of PostgreSQL from birth to death to understand what the database is doing with our data. For the movie buffs out there, think Lord of War, but for a byte of data! Along the way we will make stops along crucial milestones that represent the heart of the PostgreSQL database such as:

File Structure (Heap/Block/Tuples/Pages)
Caches
ACID
MVCC & Vacuum

Act 1: A Byte is Born

Databases are cheaper than you think

In PostgreSQL, not unlike most other storage technologies, a database is a software process running over a logical partition of a hunk of data – a cluster. In other words, a cluster is a collection of databases and is consequently the level at which the discussion of storage begins. Thus, the next time we type:

postgres=# create database bar;
CREATE DATABASE

we should view this just simply creating a logical partition for the storage of some set of data and not some expensive initialization operation. This being said, it is also common practice in the industry to run larger databases individually in a cluster for many administrative reasons that are beyond the scope of this post.

Spoiler alert: Dude, your byte is in PGDATA

PGDATA is an environment variable that denotes the data directory to be used during the initialization of a cluster. This can be seen when we try to initialize a new PostgreSQL cluster using the initdb command as shown below:

clm-C02V90ERHTDG:~$ initdb
initdb: no data directory specified
You must identify the directory where the data for this database system
will reside.  Do this with either the invocation option -D or the 
environment variable PGDATA.

Creating a home for our byte

Let’s go ahead and create a cluster by specifying a location for the data directory for the same. This way, we know roughly where to start looking for our byte of data.

initdb -D /usr/local/var/iblog2019
The files belonging to this database system will be owned by user "srangarajan".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
creating directory /usr/local/var/iblog2019 ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
Success. You can now start the database server using:
    pg_ctl -D /usr/local/var/iblog2019 -l logfile start

And so, cluster creation is the expensive initialization operation, not creation of a database. As we can see, amongst a lot of other setup, the data directory is created by the initdb command.

Exploring our home

If we take a peek inside what has been created inside the data directory, we will find many files and folders. The ones relevant to this blog post are:

base: contains per database system catalogs and user data (default)
global: contains cluster-wide data
pg_tblspc: contains filesystem symlinks to defined tablespaces

And so, the base directory under the cluster’s data directory is likely where our byte is going to end up by default. If we look inside it, we will see that there are already files and folders in there. This should be surprising because we haven’t done anything with our cluster yet – not even started it up! Well, it’s time to start the cluster up and see what’s going on inside:

clm-C02V90ERHTDG:~$ pg_ctl -D /usr/local/var/iblog2019/ -l /usr/local/var/iblog2019/server.log start
waiting for server to start.... done
server started

We used the pg_ctl command to start up the cluster of databases by pointing to the data directory of our cluster. Additionally, we specified a location for a server log file. As mentioned above, note that we didn’t start up a database, we started a cluster. In PostgreSQL, much like other SQL databases, the software process runs over the cluster, not a single database. To confirm this, let’s connect to a database on our cluster which is running on the default port (5432) and see what other databases are available in the cluster:

As we can see, there are actually 3 databases running in this cluster by default. It is interesting as a side note that the template0 and template1 databases are used as the templates for new database creation, the difference between them being factors such as character encoding, which is beyond the scope of this post. However, it is more interesting to note that the oids, short for object identifiers, line up with the names of folders inside our data directory:

clm-C02V90ERHTDG:~$ ls /usr/local/var/iblog2019/base/
1 12557 12558

This should be a dead giveaway that the storage mechanism for PostgreSQL data is a directory spaced hierarchy based on some unique identifiers for objects in the cluster starting with the database itself.

If we create a new test database for our post, we see it show up immediately in the data directory:

And so, we can guess that any data we insert into our test database is going to show up under the 16385 folder under the base directory in a yet to be determined shape and form.

Creating the table for our byte

Let’s go ahead and create a table that holds exactly 1 byte of user data:

We created a table with a column with an internal PostgreSQL type of char which represents exactly 1 byte of data, However, before we created a table, we created something called a tablespace and associated our table with the same. A tablespace is a specially designated location on the filesystem where the data associated with database objects such as tables will live. This control over file layout is useful to administrators for tasks and optimizations such as partition and volume management, which is beyond the scope of this post. Note here that for simplicity of demonstration purposes, the tablespace is created in the same directory as PGDATA. In practice, as PostgreSQL is already warning us, tablespace locations should not be inside the data directory.

Creating our byte

We insert a ‘1’ into our table, which being a character is exactly 1 byte of data:

test=# insert into byte_test values('1');
INSERT 0 1

By default, we would expect this data to show up somewhere under the /usr/local/var/iblog2019/base/16385 folder as discussed above. However, we modified this behavior by using a tablespace to configure exactly where our data will go, and so let’s look to see if our data has shown up in that folder:

clm-C02V90ERHTDG:~$ ls /usr/local/var/iblog2019/special/PG_10_201707211/16385
16393

We do see that a directory structure was created under our tablespace: /usr/local/var/iblog2019/special, and as a part of that hierarchy, we see the oid of our database, 16385. At the leaf of this hierarchy we see a file with the name 16393, presumably a logical identifier associated with the table. We can check this by checking the relfilenode of our table:

test=# select relfilenode from pg_class where relname = 'byte_test';
  relfilenode 
--------------
    16393
(1 row)

which indeed turns out to be true. Note that while in most cases the oid of a relation does match its relfilenode, certain operations such as CLUSTER and VACUUM FULL can change the relfilenode without changing the oid.

PostgreSQL exposes a simple administration function to save us the trouble of tracking all this metadata in the form of the inbuilt function:

test=# select pg_relation_filepath('byte_test');
            pg_relation_filepath             
---------------------------------------------
pg_tblspc/16389/PG_10_201707211/16385/16393
(1 row)

This is very convenient, but in this specific case a little confusing because we expected our data to show up in:

/usr/local/var/iblog2019/special/PG_10_201707211/16385/16393
- not -
/usr/local/var/iblog2019/pg_tblspc/16389/PG_10_201707211/16385/16393

This is because while the physical location of our data is indeed at /usr/local/var/iblog2019/special/PG_10_201707211/16385 as configured, the cluster is tracking the same in the pg_tblspc folder by means of a symlink as can be seen:

clm-C02V90ERHTDG:~$ ls -al /usr/local/var/iblog2019/pg_tblspc/16389
lrwx------  1 srangarajan  admin  32 Apr 28 10:18 /usr/local/var/iblog2019/pg_tblspc/16389 -> /usr/local/var/iblog2019/special

And unsurprisingly, 16389 is the oid of the tablespace, “special” that we created, as can be confirmed:

test=# select oid from pg_tablespace where spcname = 'special';
  oid 
-------
16389
(1 row)

And thus, finally, if we print out the contents of the file, we should see our byte of data:

Wait, what?! It’s empty?! … Dude, where’s my byte?!

Hunting for our byte

First, let’s double check that our “1” did indeed get saved:

OK – so even if we initiate a new session through our cluster to our database, we see our byte saved – so it isn’t some kind of client local cache. PostgreSQL being the quality database it is can be depended upon to keep our byte safe, so while there is no need to be alarmed, it is very interesting to understand exactly how and where PostgreSQL has placed our byte.

The Write Ahead Log (WAL)

The short answer to where PostgreSQL has promptly put away our byte is the WAL – Write Ahead Log. While this doesn’t mean that the byte won’t make its way into the file eventually, the WAL is the first landing location for our byte of data. This means that there exists more than 1 copy of our byte, and as we shall shortly see, even more than 2. The location of the WAL records, unsurprisingly, is in our data directory, under the pg_wal folder.

The WAL can be thought of as an append-only record of DML (data manipulation language) commands – INSERT/UPDATE and DELETE. It is a master register of such operations that when “replayed” in-order can recreate the net effect of these operations at any point in time. The subtleties of WAL extend far beyond the scope of this post, but suffice it to say that the WAL system is at the heart of almost any relational database, not just PostgreSQL.

And so, since our operation has been recorded in the WAL, even if we were to shutdown/kill (former should not be performed on production without gratuitous supervision and the latter almost never), our byte is safe. In fact, let’s go ahead and do the same:

And now, let us restart the database and tail the logs:

As we can see, the engine complains about our rude interruption, and goes into automatic recovery using some sort of “redo” mechanism. As we might have already guessed, this mechanism is based on the WAL records. The database system performs a reconciliation of operations that were committed to the WAL records but not to the data files in the system. This can be checked by checking the contents of our data file now:

clm-C02V90ERHTDG:~$ cat /usr/local/var/iblog2019/special/PG_10_201707211/16385/16393
??a?  ??21

Voila! We now have some data showing up where there was nothing before. Thus, we can conclude that all else remaining the same, the recovery process has sync’d this data into the file from the WAL records. This instance of synchronization is called a CHECKPOINT, marking the start of the next “redo”/recovery event if there were to be an unexpected failure among other things.

Aside from playing a key role in guaranteeing the Atomicity and Durability properties of ACID compliant databases, the WAL also has some optimization advantages by allowing us to batch and throttle disk I/O. Also, since the WAL record is by default the single most important synchronous disk operation that happens to record a data change, mounting the pg_wal directory on optimized hardware is easy and one of the first optimizations a database administrator would make.

CHECKPOINTs, page_swaps and background workers

In the previous section we have seen that our byte was first written to the WAL records and sync’d back into the data file during recovery because we killed our PostgreSQL cluster. It is safe to assume that this isn’t normally how data gets into these files, since one does not simply routinely kill the database processes every so often 🙂

And so, normally, our byte would have made its way into the data file through one of the three operations detailed below:

CHECKPOINT

CHECKPOINT is an instance of synchronization between the WAL records and data on the file system. It also happens to be a command that can be issued to force this synchronization as such:

test=# CHECKPOINT;
CHECKPOINT

This operation would result in a “flush” of dirty data pages from the in-memory, process buffers shared between the client backend processes to the disk, marking a synchronization between commands recorded in the WAL and the data in the data files. The journey of our byte in this case would look something like the journey of the colored dot in the picture below:

The solid colored lines represent the synchronous operations and the dotted colored one the asynchronous. Thus, our byte is first written to a shared memory buffer, which explains why we were able to open a second client connection and read it back. Synchronously, the byte is also written to the WAL record on the file system, which is why even when we killed our database and lost the copy in our shared memory, it made its way onto disk through process #1, which was recovery.

The CHECKPOINT operation causes process #2 to occur which triggers a synchronous flush of all dirty data pages from the shared buffers into the data files, thus causing the synchronization event between the data files and the WAL. While the operational impact of this command is beyond the scope of this post it is worth pointing out that without proper defaults, tuning and administration, the size of dirty pages in the shared buffers can be quite large and a flush of the same using the CHECKPOINT command is rarely advisable without the same.

As a side note, it is worth pointing out that every time a client connects to a cluster, a backend is initialized for the same. A backend is a client process, running on the server with a dedicated amount of memory. As we can imagine, this can get very expensive with a large number of clients for many reasons, and this is the prime reason why it is highly recommended to run a proxy pooler in front of PostgreSQL.

page_swap

A page swap operation is the standard OS rotation of a memory page onto disk to make room for a newly requested one. The net effect of this operation would be to resynchronize any dirty (modified) memory pages with their copies on disk and then discard them from memory before bringing in a new copy of a page on disk for rapid access. In this case, the memory page in question would be a page in the shared buffers and the resynchronization operation would occur with the pages in the data files, thus effectively flushing our byte to its file on disk. This process may be synchronous or otherwise depending on the OS and related parameters, depicted by process #3 in the diagram above.

background writer and checkpointer

The PostgreSQL software processes include 2 processes – the background writer and checkpointer which run asynchronously in the background to move data and thus our byte out from the shared buffer into the data file. The specifics of how they work, how they’re tuned and optimized are beyond the scope of this post, but the net effect of their operation is also represented by process #3, asynchronously moving data from the shared buffer into the data files.

This also helps us understand the process snapshot of a running cluster better now since we know roughly what the checkpointer and background writer processes do:

It is also not hard to imagine with everything we’ve learned and seen so far about the WAL that that too may have some form of memory buffer before being flushed to disk, and it would be logical to assume that the wal writer process would implement something to that effect.

It is worth observing here that since the WAL is a complete record of all data manipulation operations performed against the database, it is crucial for Point In Time Recovery, or recovery of any kind for that matter. With all the WAL records since inception and sufficient time, the state of the database can be constructed by trivially replaying all the commands. However, as the size of these records increases, it becomes impractical to both maintain a “hot” backlog of the entire WAL stream and to use it as the source of recovery from scratch because of the amount of time it would take to replay all the commands. Thus, in common practice, WAL files are rotated (shipped to cold storage periodically so that new ones can take their place), and CHECKPOINTS preclude the need for replaying from scratch by allowing the system to start from the last known good CHECKPOINT instead as long as a matching backup of the database file system is also available.

It is also worth disclaiming that tinkering around with the storage system of PostgreSQL in production is serious business which none but the most skilled professionals should even attempt. Any partial loss of files (data, WAL, commit logs) can be disastrous and so sound judgement should be applied.

Understanding our byte

Now that we have understood the numerous ways our byte may end up in our data file, we can ask the more pressing question which is:

clm-C02V90ERHTDG:~$ cat /usr/local/var/iblog2019/special/PG_10_201707211/16385/16393
??a?  ??21

“??a? ??21” looks nothing like the “1” that we inserted into our table? While it is logical to assume that PostgreSQL is using some sort of data structure to represent our data, what does that exactly look like?

The PostgreSQL file structure

To understand the PostgreSQL file structure, we must first understand some terminologies:

tuple: representation of a row in the data file
ctid*: static pointer to a tuple in a data file as tuples tend to get rearranged
heap: a collection of unordered (as per the user) tuples. A heap file is basically the same as a data file with a max size of 1GB. If the size of a relation exceeds the same, PostgreSQL will automatically chunk out the storage with suffixes of …/{relfilenode}_1, …/{relfilenode}_2 and so on
page: an 8KB chunk of rearrangeable tuples that can be “paged” into memory (shared buffers)
page_header: metadata about page including checksums and WAL information
row_header: metadata about row including field sizes if they’re of variable length (varchar, numeric)
field: data in a column of the row

Knowing this, the file structure is fairly self-explanatory: a file is a series of 8KB pages which in turn are collections of internal representations of rows of relational data called tuples, all represented on disk as encoded binary data.

If we really want to “decode” the “??a? ??21″, we can use a utility called pg_filedump like so:

Amongst all the metadata and header information that we saw is stored in the file, we see our byte of data.

Act 2: Bytamorphosis

Now that we have a fairly good understanding of where our data on disk is, let us go ahead and update this byte in place. In fact, let’s do that a few hundred times using some sort of SQL script:

We would correctly expect there to still only be 1 byte of data in our table, which checks out:

clm-C02V90ERHTDG:~$ psql -d test
psql (10.1)
Type "help" for help.

test=# select * from byte_test;
test_byte
-----------
<
(1 row)

Let us flush these changes to disk with CHECKPOINT following which we would expect the data file to have just the latest byte in it. Printing out the same results in:

Wow! OK, while the file is not human readable as we’ve seen before, we can see enough here to say that whatever happened did not simply change our byte in place because we have a lot more data than we needed before to represent our single byte! What’s going on?

MVCC: What you want to believe is what you see

To understand this odd behavior, let us run a select statement that also returns the ctid* of the tuple along with the row:

As mentioned in the earlier section about file structure, the ctid* represents the location of the tuples within a page because tuples tend to get rearranged for numerous reasons. Oddly enough, we seem to be shuffling tuples when we make an update command. At first glance this makes no sense because there is only one tuple and so there should be nothing to shuffle! However, given that we can clearly see that there is some sort of shuffling around going on, it leads us to the only logical conclusion – when we UPDATE the row, it is somehow causing some form of duplication instead of an overwriting UPDATE to create new physical versions of the same logical row.

We can confirm this hypothesis to an extent by reissuing the pg_filedump command against the data file:

We definitely see many more copies of same the item, with new values for “Flags” such as Redirect and mostly Unused.

This is indeed true because PostgreSQL uses a simple and powerful form of versioning to deal with concurrent database access called MVCC – Multi Version Concurrency Control, which causes data modification operations to become duplicitous instead of overwriting.

MVCC is at the heart of guaranteeing the Isolation property of ACID compliant databases: reads and writes between transactions should be isolated. The final state of my data as a result of all the commands in different transactions should be reachable by placing all these transactions in a queue and performing them one at a time without concurrency.

Let’s open two sessions to our database and run the following commands:

We have opened a transaction in both sessions and issued a SELECT command to select our byte of data along with some internal metadata associated with each tuple. txid_current is the transaction identifier associated with each transaction and is consequently different between sessions.

xmin is the txid of the transaction that created the tuples
xmax is the txid of the transaction that logically (not physically) deleted this row, or 0 if the row is the latest version of the logical row

These internal metadata columns are maintained by the MVCC system of PostgreSQL to restrict access to newer versions of data that concurrent transactions may not be interested in, and consequently access to older versions of data that newer transactions may not be privy to.

Let us now update our byte in one of these transactions and re-run our SELECT command:

First and foremost we notice that these two transactions are seeing two different values of our byte – the one that updated it is seeing the updated value, and the other is seeing the “older” value. This is because of a concept called read isolation which we shall explore in detail in upcoming sections. But presently, we can explain this by the logical argument that since the other transaction’s UPDATE hasn’t been committed, it does not “make sense” for our transaction to see it just yet.

We also see that the xmin in the transaction issuing the UPDATE has now been updated to be equal to the current txid. This is because this new physical version of the logical row containing our byte of data was created by the UPDATE statement in the transaction. Meanwhile, the xmax in the other session has been set to the txid of the updating transaction since that transaction caused the previous version of the row to be logically deleted, thus hiding it from newer transactions.

Let us go ahead and commit this transaction now and see the effect of the same:

We now see an interesting behavior – the transaction which wasn’t able to see the byte ‘a’ while it was being updated can now see it! And xmin and xmax now match the transaction that committed the update.

This behavior of a changing reality can be summarized as only being able to read committed data. And that is exactly what this isolation level is called:

Transaction Isolation

Transaction isolation can be thought of as how a transaction interacts with its peers – both in terms of being able to see their work and in them being able to see is work. Transaction isolation levels determine the degrees to which this peering is permitted. There are four ANSI SQL standard levels:

Read Uncommitted
Read Committed
Repeatable Read
Serializable

These charts can be used to think through how these levels are different, as pertinent to PostgreSQL:

PostgreSQL does not allow its users to answer the question “Do you want to see the effect of an uncommitted concurrent transaction” with a “Yes”. The “Yes” answer would lead to the Read Uncommitted level, and while PostgreSQL has a setting for the same in accordance with the ANSI SQL standards, the standards allow PostgreSQL to default behavior at that level to a higher level of safety, which is Read Committed. In other words, PostgreSQL favors safety over performance, and does not support the Read Uncommitted level in spirit.

Once again, we are going to ignore the Read Uncommitted level because it is the same as Read Committed in PostgreSQL.

In the following sections we are going to explore the 3 practically supported isolation levels in details by taking examples of concurrent requests, highlight the different behaviors in each mode.

Read Committed Isolation

As the name suggests, the Read Committed isolation level allows a transaction to only read data modifications committed by other transactions. In other words, we answer “No” to the first question, having no other option, but then “Yes” to the following one. Thus we are not able to see halfway work of concurrent transactions and instead are only able to see work that is committed by the same:

However, this does mean that the non-updating transaction sees a moving target of data – that is, it may perform non-repeatable reads – in this specific case that read being our byte ‘c’, which was modified to ‘b’ by a concurrent updating transaction and can thus no longer be re-read by our non-updating transaction. Now, let’s answer “Yes” to the final question:

It looks like we are not able to update in parallel because of PostgreSQL’s internal locking mechanism which only allows a single modification of any given row at a time. The specifics of locking levels and their interactions are beyond the scope of this post. Once we commit the transaction that is currently holding the lock having performed the update successfully:

We can see that the blocked transaction’s update and commit goes through successfully, having been queued up behind the other one. It is interesting to note that internally PostgreSQL holds a queue of such transactions and their lock targets on a first come first served basis. And so, we can see that we are able to change data that was changed in parallel to us making the change successfully.

Whether or not answering “Yes” to the final question is safe is a judgement call left to the engineer designing the system. Read Committed isolation happens to be most common isolation level used in enterprise production systems and is the PostgreSQL default because it sits in the sweet spot between just enough safety and simplicity.

Repeatable Read Isolation

The Repeatable Read isolation level prevents the non-repeatable read that we saw happen in the Read Committed isolation level when a concurrent transaction committed its work partway through another transaction. In other words, we answer “No” to the first question, and “No” to the following one as well. Thus we are not able to see halfway work of concurrent transactions, and neither are we able to see work that is committed by the same:

This means that we see a consistent snapshot of the data from the start of the transaction – that is, there are no non-repeatable reads. Now, let’s answer “No” to the final question as well:

Once again it looks like we are not able to update in parallel because of PostgreSQL’s internal locking mechanism which only allows a single modification of any given row at a time. Now, let’s observe what happens once we commit the transaction that is currently holding the lock having performed the update successfully:

We see that although the blocking transaction was able to obtain a lock, its update fails immediately, even before commit. This is because we answered “No” to the final question of “Do you want to change something that might have been changed otherwise by the time your command has executed?”.

While this behavior can be very useful to avoid unintended overwrites due to unexpected races, it does cause an increase in complexity on the application side to reason through the situation and retry if the change “makes sense”. Once again, whether or not this level is acceptable is a judgement call left to the engineer designing the system.

Serializable

The third and final isolation level is Serializable – the strictest isolation level by ANSI standards. This level tries to emulate the serialization of parallel transactions. That is, Serializable attempts to guarantee that the net effect of concurrent transactions would be the same no matter which order the transactions ended up committing. Serializable works very similar to the Repeatable Read isolation level for most use-cases. The difference comes from Serializable being able to detect data dependencies between transactions. Before we dig into that, let’s confirm that the behavior is the same as Repeatable Read by answering “No” again to the first and second questions, thus not being able to see halfway or committed work of concurrent transactions:

There isn’t much to discuss about the first picture since it is a repeat of the behavior that we saw with the Repeatable Read isolation level.

And the same is true for answering “No” to the final question as well:

Once again, same blocking behavior is exhibited until the locking transaction commits, upon which:

Exactly the same behavior as Repeatable Read.

And so, how exactly do we materialize the difference in behavior? To show the “prediction” behavior, let us try the following series of operations:

In the example above, we see two transactions trying to count the number of a certain byte in a table before making a decision to insert our byte. The catch is that the byte being inserted in both cases would have changed the count that had been run before doing so. In more formal terms, there is now a read/write dependency between these transactions because what each one reads is now dependent on what the other one is trying to write.

When Transaction A tries to commit, since there is no other transaction that has committed that could have affected the outcome of the same, PostgreSQL allows the commit to go through.
When Transaction B tries to commit, as Transaction A just committed and the outcome of Transaction B might have changed based on the same because we might have read a different count for byte ‘b’, PostgreSQL blocks the commit because if it had re-ordered these commits, we might have had a different outcome

The same if run with Repeatable Read would have committed just fine because while that level guarantees that reads within concurrent transactions are repeatable, it makes no claims about the serializability of the effects of the same.

It is important to note that what PostgreSQL is offering through the Serializable isolation level is read/write dependency detection, not race detection. In other words, while such dependencies often result from racing concurrent transactions, it is absolutely possible to have other kinds of races that are not read/write dependencies which the Serializable isolation level will not handle. In addition to other tools and features of the engine such as deadlock detection, advisory locking and pessimistic locking, it is up to the engineer to acquire and manage the right semaphores between concurrent threads to achieve the right level of safety and performance for the application.

This impressive feat of engineering at scale across thousands of concurrent transactions committing simultaneously is achieved through a completely non-blocking mechanism called predicate locking, the scope of which is well beyond that of this post.

MVCC: Duplication for alternate realities

As we have seen in the previous sections, PostgreSQL supports 3 levels of beliefs that we may choose to hold for the duration of each of our transactions allowing us to trade off performance and simplicity for safety and complexity.

The mechanism it uses to achieve this is MVCC, and the cost we pay is duplication. Since we are free to hold a variety of beliefs across our transactions, PostgreSQL needs to track all versions of our byte as it changes so that all these beliefs can be supported, because even if the byte has effectively been updated from a ‘1’ to ‘2’, somebody somewhere in some transaction may have chosen to not look past the reality where our byte was still a ‘1’. This explains why when we made a series of UPDATEs on our single byte of data, the operations were not updating the data in place.

MVCC implements isolation using xmin and xmax as a function of visibility, and not any from actual serialization of concurrent transactions which would have killed concurrency. This is why we are able to get both throughput and safety with configurability over both.

Thus, we are trading space for time by duplicating these alternate realities instead of reconciling them serially. However, does this mean that older copies of our byte are never “retired” somehow? If this is the case, will our cluster use an infinitely growing amount of space? If not, how does it know when to purge older copies? To answer these questions, we move on to our last and final section.

Act 3: Requiem for a Byte

As we have learned from the previous section, PostgreSQL causes duplication of data to deal with concurrency effectively. At some point, all transactions that believed in an older version of our byte will end. At this point, that duplicate version of our logical byte is now dead weight. It can never be referenced again because it has no more believers and it can garner no more new ones because it is now hidden from view from all new transactions. This dead weight is referred to as bloat.

Before we delve into how we deal with bloat, let us make a logical guess about what would happen when we DELETE our byte of data. Keeping transaction isolation in mind, when we commit our delete, we would still have to respect the beliefs of concurrent transaction that are living in the reality of our byte still existing. This means that MVCC would come in to play again and would probably not physically delete our byte, but would instead just logically delete / “hide” it from future transactions. Eventually, this too will become bloat.

Let us confirm this behavior through a simple test:

Our Repeatable Read transaction is able to view our byte ‘b’ even after a transaction has deleted it and committed the same, thus confirming our hypothesis about the logical delete through MVCC.

Bloat

While we successfully did DELETE our byte, we have not removed it from our system yet. It’s still in our data file. Our byte is stuck in purgatory, not visible to any newer transactions, but also not fully removed from our world. This data purgatory composed of dead tuples caught juxtaposed against live, visible is commonly referred to as bloat.

Here is an experiment to show that this deleted byte is still living in our data file:

We create and delete our byte, making sure we flush all changes to the file system from shared buffers using CHECKPOINTs along the way, but at the end are still left with a byte ‘b’ in our data file. This bloat is deadweight that can adversely affect the performance of our system because of dead page hits, and can be controlled using a setting called fillfactor, the discussion of which is beyond the scope of this post.

VACUUM

To reclaim this dead space, PostgreSQL has a feature called VACUUM. VACUUM is an operation that goes over all the dead tuples in a data file and marks them “free for use” for any new tuples that are inserted into the relation. Note that this is not equal to reclaiming disk space. That is, VACUUM (usually) does not reduce the size of a data file, it merely repossesses bloat for re-use.

VACUUM obtains a lightweight lock which does not interfere with the general operations performed on a relation and thus can be performed in parallel with regular transactional read/write/delete workloads.

VACUUM FULL is the heavyweight cousin of VACUUM that works by locking out all operations on a relation. However, in return, VACUUM FULL does not repossess bloat for re-use, it instead reclaims disk space by rewriting the entire table, thus compacting all the dead space. In doing so, VACUUM FULL effectively creates a new table, with a new relfilenode as shown below:

While VACUUM FULL seems great, the heavyweight lock it acquires can grind a section of operations on a cluster to a halt and hence great caution must be exercised before running the same.

PostgreSQL runs VACUUM by default on all tables in a cluster as another background process, thus making our understanding of the process snapshot of a running PostgreSQL cluster even clearer:

Following VACUUM FULL, our byte of data is fully purged from the system, and that brings us the end of the journey of our byte – from birth to death.

Does this mean that there is no way to get our byte back?

If we think back to our byte making its way from the client, through the shared buffers onto our data file, there is an important stop it makes before the same – the WAL! Remember, the WAL is append only, meaning that we cannot “purge” our byte from the WAL records. And so, if we really wanted to get our byte back, we could Point In Time recover (PITR) our cluster to a state before we committed the deleting transaction. This process may be very expensive for multiple reasons ranging from WAL retention times, database file system backup policy and retention, WAL rotation and the ensuing database downtime, which is why it may be an overkill for our single byte of data, but it is not uncommon in industry settings to consider Point In Time recovering a cluster as part of a disaster recovery strategy to annul the effects of malicious or erroneous transactions.

Epilogue

And so – we’ve done it – taken a long journey through the internals of PostgreSQL by following the birth, metamorphosis and death of a single byte of data. Along the way, we made reference to a lot of nuances, judgement calls and complexities that are beyond the scope of this post, such as but not limited to:

Cluster layout
Database character encoding
Filesystem mounts for cluster size and throughput management
WAL rotation, retention and shipping
Backup and disaster recovery
Tuning CHECKPOINTs
Predicate locking
Pessimistic locking, advisory locking and deadlocks
Scheduling VACUUM FULL
Fill factor and bloat management

This is why we at Enova, like much of the rest of the industry, have an amazing team of Database Administrators answering these questions, and tinkering around with configuration to keep our PostgreSQL installations running smoothly.

While the intent of this post is to holistically inform newcomers to PostgreSQL, it also aspires to promote a sense of curiosity to peek beneath powerful abstractions in the era of DIY, BYO toolchain cloud computing. For most simple use-cases, one can be blissfully ignorant of these grisly details and use quality software like PostgreSQL effectively out of the box. For the few other complex, novel ideas that may go on to become the differentiators in our competitive software engineering landscape, the line between user and author dims rapidly. Long live novel ideas, long live curiosity!

About