Introduction to WAL (write-ahead logging)

Feb 23, 2017 by joonas.fi

WAL is a concept to achieve atomicity & durability - usually found in filesystems and database systems.

Both atomicity (the “A” in ACID) and durability (the “D” in ACID) are properties of ACID - the set of properties used to describe database systems.

What is atomicity & durability? Why do I want it?

When you add or insert data to, say an SQL database, when the database system responds with “OK - I saved your data”, it has to be sure that even in the face of a power loss or application crash, the updated data is still found from the file. This is known as durability.

When a database saves data, it almost always changes more than one value. If the operation fails for some reason, we don’t want the end result to be one value ending up as changed but the other value as unchanged. We want either all things to happen or none at all. This is known as atomicity. Imagine a banking system where a money transfer debits from your account but is never credited to the receiver. Not atomic.

Imagine ordering something from an online shop like Amazon, paying for it and never receiving the order.

When you ask the customer service about it, they don’t find your order. It turns out that during the order their computer crashed and the saved order was just lost. Not durable.

You would be pretty bummed about this, so competent tech companies use systems for saving data that guarantee atomicity & durability so we don’t suffer these problems.

The same goes for filesystems - you don’t want to lose all your files if your computer crashes.

Aren’t file writes atomic & durable by default?

Unfortunately, no! Yes I know I previously said that filesystems are atomic & durable, but those guarantees only apply at the filesystem level. The filesystem is protected against corruption, but the individual files are not. This is something that has to be achieved at the application level!

To begin explaining this, let’s oversimplify how a database system might write to a file. We might have a “users” table, stored in a file in a filesystem as users.txt:

id=1 username=joonas.fi registered=2017-02-01

Now, let’s do an insert into the database: INSERT INTO users (id, username, registered) VALUES (2, "hello", "2017-02-23")

After that write the file should look like this:

id=1 username=joonas.fi registered=2017-02-01
id=2 username=hello     registered=2017-02-23

The database system used pseudocode like this to write that change to the file:

fwrite("users.txt", "id=2 username=hello     registered=2017-02-23")

But, the system write call used to write to the file is not atomic (the “A” in ACID). Atomicity in short means that either the whole write call succeeds or the whole write call fails. And we don’t have that, so in the face of power loss or application crash we can easily end up with this in users.txt:

id=1 username=joonas.fi registered=2017-02-01
id=2 userna

Or even this:

id=1 username=joonas.fi registered=2017-02-01
           me=hello     registered=2017-02-23

Because nothing guarantees that the bytes written to the file are written in the order that you specified.

Writing to a file can always fail. There is no getting around this.

But I’ve heard about fsync, it fixes this right?

fsync alone does not fix this. fsync() is only a mechanism to ask the operating system to wait until all the pending writes are actually confirmed as written on the disk.

So, without fsync have this pseudocode:

fwrite("users.txt", "id=2 username=hello     registered=2017-02-23")

print("Changes were successfully written to users.txt")

This code is lying, since fwrite() can under the hood do anything it wants. Usually for performance reasons the operating system either:

  • Queues the write to the disk as a background task. It will finish soon, maybe.
  • Or even batches/buffers the write and waits for more writes to come so it can more efficiently write more data as a single operation.

This is where we can fix the above code with fsync()

fwrite("users.txt", "id=2 username=hello     registered=2017-02-23")

fsync("users.txt")

print("Changes were successfully written to users.txt")

Now the code is correct. fsync() tells the OS to wait until all the buffers are flushed to disk. When we receive the success message, we can be sure we’re not being lied to.

Okay fsync() solves durability, i.e. confirmed data written to disk is durable!

But what about atomicity? What if the fwrite() fails while we’re waiting for the fsync() to finish?

Good question! Like we explained before, nothing still guarantees that the fwrite() call succeeds or fails as a whole.

The fwrite() call can always land on the disk only partially when an application/operating system crashes or the power goes out. So with fsync() we get atomicity & durability but only if it succeeds. And it is not guaranteed to succeed and thus we can still end up with corrupted data (no atomicity).

In short, sequence of events:

  1. fwrite() start
  2. – queue write op 1
  3. – queue write op 2
  4. fwrite() returns
  5. fsync() start
  6. – wait for write op 1 to finish
  7. – wait for write op 2 to finish
  8. fsync() returns (atomicity & durability boundary only achieved here)

Remember: the power can go out in any of those states. Only in state #8 we are atomic & durable. In all the other states before we will end up with corrupted data if we’re not careful.

Ok, so how does WAL achieve atomicity & durability?

The problem boils down to:

We don’t want incomplete data in users.txt. We either want the write to fail as a whole or succeed as a whole.

That is human speak for wanting atomicity and durability.

So finally, this is where the subject, write-ahead logging, comes in!

WAL in essence: before writing the line to users.txt, we write first to write-ahead-log.txt:

wal-entry line=2 content=[id=2 username=hello     registered=2017-02-23]

Sidenote: WAL’s usually use byte offsets instead of line offsets. This is just easier to explain.

After we have fsync()ed the WAL, we can start writing to users.txt. If during that write the power goes out and we end up with:

line 1 | id=1 username=joonas.fi registered=2017-02-01
line 2 | id=2 userna

After we restart the database system, it scans the write-ahead-log.txt first and notices that according to the WAL the line 2 in users.txt is not what it should be and repairs it:

line 1 | id=1 username=joonas.fi registered=2017-02-01
line 2 | id=2 username=hello     registered=2017-02-23

It doesn’t matter if the repair operation fails because the power goes out again during the reparing fwrite(), because after database restart the repair operation will start again and eventually reach valid state.

As long as the WAL is intact, we are safe.

Okay that’s cool, but what if write to the WAL fails?

Excellent question again, champ! So we know that every fwrite() can fail mid-write, so writing the “repair entry” to WAL can fail as well and no amount of fsync() fixes that?

Correct. The trick is implementing the WAL in such a way that it is resilient to corruption. I.e. we need to be able to detect corrupted WAL entries - the ones that can occur before fsync() completes and power goes out. Corrupted WAL could look like this:

wal-entry line=2 content=[id=2 username=hello     registered=2017-02-23]
wal-entry line=3 content=[id=3 user

Ok, clearly that is corrupted. So when trying to write WAL entry for line 3 the fsync() never finished and we wound up with that.

But because the fsync() didn’t finish we didn’t confirm to the user that the change was applied - instead the user got an error (or timeout because the power went out), and she can know for sure that the operation didn’t go through. Or perhaps the application even automatically re-tries with a healthy server and the user never suffers from this error.

Remember:

We never acknowledge writes to database without fsync() to WAL.

We never write to users.txt before the WAL entry is fully fsync()ed, because only durable & atomic WAL entries can repair the tracked file.

Moving on, because we didn’t finish writing the WAL entry, the original users.txt was not even tried to be modified and thus is in entirely valid state. Only succesfully written WAL entries will be written to users.txt because only those entries can be repaired.

Broken WAL entries are discarded (either removed or ignored) and writes to users.txt correctly reported to clients as either succeeding or failing as a whole atomic operation, in a durable fashion.

Atomicity and durability. Thanks to WAL! Congratulations, you now understand one of the most fundamental aspects of a database system! :)

Recapping

With WAL:

  • The tracked file (users.txt) is effectively (after possible repair) always fully atomic & durable.
  • The log file (write-ahead-log.txt) as a whole is never atomic & durable.
    • Individual WAL entries though are atomic & durable.
    • We just need to be able to detect and deal with corrupt WAL entries.
comments powered by Disqus