What happens when databases crash?

Databases have tables and indexes stored in files. As you create rows, the database system writes to data pages in memory which is then written to data files on disk.

There is a problem though, what happens if you lose power half-way through writing to the file? As the database starts back up we have corrupted files. We quickly realize that we need something else to save us from crashes and power loss, and that is WAL (Write-ahead log) or Redo log.

As you write to tables and indexes we create a log entry in the WAL of those changes. We write the WAL to its own files and also write to the data pages in memory. It is OK not write to the actual table and index data files on disk, those can stay in memory, as long as we have a log we can always construct the table.

So WAL is much smaller than the actual data files, so we can flush them relatively faster. They are also sequential unlike data pages changes which are random. Hari from Yugabyte puts it nicely

WALs are not just smaller but also need fewer IOs. You have to write out a full sector to disk even if only one bit in it changed. WALs are append only and you only write to the end of the file. If you modify 4 rows that data may be in 4 sectors but may only need 1 WAL sector.

And even if you write multiple WAL sectors they are sequential which is faster than doing random IOs. The advantage was massive with spindle drives but even on SSDs it is faster.

The WAL can also be used for all sort of things like replications and backup and yes crash recovery.

What if we crashed while writing the WAL? Well, that is why we have transaction state. You can crash after writing WAL entries, it is fine, as long we know the transaction state belonging to each WAL entry we can discard or omit uncommitted WAL entries upon recovery.

For example if you are in the middle of a transaction and the database crashed, we consider the transaction rolled-back by default, so WAL entries flushed by this uncommitted transaction will be discarded.

But if you were able to issue a COMMIT and the WAL entry for a transaction commit makes it to disk and the client gets a success, that transaction is considered committed even if we crashed right after.

So we have established that WAL is the source of truth, as we are writing ahead of the data files, we write the changes to data pages in memory of course (for on going transactions to use the latest and greatest) but we delay flushing the data pages to disk because the WAL made it to disk.

Now we have data files on disk that are out of sync with what is in memory which is absolutely fine. As long as the database is running, we will only read from memory which has the latest, but if we did crash then its a problem.

As the database starts back up, the file is out of date we can’t just pull it on memory and have clients read them, the WAL is the source of truth, we need to REDO the changes in the WAL back on the data files, and during that process nothing is allowed to read (otherwise we get corruption). The database startup time slows down the more out of sync the data files are from the WAL (many writes has happened but the data files were not updated for a long time).

The question here can I control how data pages are flushed to disk? And the answer is yes, this is called as checkpointing, where a new WAL record is created called checkpoint record write after the entire data pages in memory are flushed to disk. This is a very delicate process, too much checkpointing sure makes start up faster, but it indeeds takes a toll on your I/O as checkpointing is IO heavy operation as we mentioned before data pages are very large.

Some database systems have also undo logs which I covered in details here.

Leave a Reply

Discover more from The AuroraTECH

Subscribe now to keep reading and get access to the full archive.

Continue reading