Implementation Details

Write-Ahead Logging

To maximize overall disk throughput, ForestDB writes all incoming mutations to a database file in a log-structured append-only manner; this is called the Write-Ahead Log (WAL). However, the index structures (i.e., ID index and Seq index) are not immediately updated in order to reduce the overall write amplification, and a batch of these pending mutations is eventually applied to the index structures later. For this, ForestDB maintains in-memory WAL ID index and WAL Seq index for indexing WAL documents that are not yet indexed by the main index structures. In this way, those documents can be efficiently retrieved by looking up in-memory WAL indexes.

WAL in-memory structure

As shown in the above figure, these in-memory WAL indexes are organized as hash tables that map from the hash value of a key (ID) or its sequence number to the offset in the database file where the (metadata, value) entry is located. Since these index structures reside only in memory, there is no further disk I/O overhead when we look up WAL entries.

WAL disk layout

There is a threshold on the total number of WAL entries, and we check the WAL size for every commit operation. The batch of main index updates for WAL entries are aggregated and written to both the ID index and Seq index if the total number of WAL entries is larger than the configured threshold. Updated nodes in the main index structures are sequentially appended at the end of a database file, and all entries in in-memory WAL indexes are removed. The upcoming mutations from an application will be appended at the end of a database file and added to the in-memory WAL indexes again.

File Layout

ForestDB supports optional block alignment in writing a document (i.e., key's metadata and value) in a database file. If block alignment is enabled, then when a document is written to a database file, ForestDB first checks the remaining space of the current block to avoid a document being written across several blocks. The document is simply appended if the remaining space is larger than the size of the document. If not, the document is written at the first byte of the next block, skipping the remaining space in the current block. When the size of a document is larger than the block size, we compare the total number of blocks to be occupied by this document if it were written at the end of the current position, versus if it were written at the beginning of the next block. Then, we choose the location that occupies fewer blocks. The following figure shows a file layout when the block align option is enabled.

Note that block alignment is a compile-time option that is disabled by default. (It can be enabled by defining the preprocessor symbol DOCIO_BLOCK_ALIGN, either in the source file option.h or on the compiler command-line.)

Block aligning for a document

Unlike documents, B+ tree nodes always exactly fit into a block, and they are always written at the beginning of a new block to avoid interleaving with documents. Since ForestDB does not allow in-place updates, a modified node is written at the end of file, invalidating the old node. If an update occurs on a leaf node, its parent node must also be modified and written in a new block, because the location of the leaf node has changed. For the same reason, all the nodes along the path from the updated leaf node to the root node must be updated and appended to the end of the file recursively. After writing all index updates, we finally write a database header block at the end of the file, which includes a pointer to the new root node.

File layout for B+ Tree

Buffer Cache

ForestDB has a separate global buffer cache for keeping frequently accessed data (e.g. index nodes of B+-trees) in memory, and also provides an option for bypassing the OS page cache through using the direct I/O flag (e.g., O_DIRECT on Linux.)

Buffer Cache Block Diagram

This global buffer cache is a shared resource among writers and readers, and consists of a global LRU list for opened database files. We maintain a separate AVL tree for each database file to keep track of the list of dirty blocks belonged to that file. All dirty blocks are sequentially written back to the corresponding file in ascending order of their offset values by traversing the AVL tree, which allows us to maximize the write throughput. Each file maintains a hash table with a key {block_id} and a value {a pointer to a cache entry in either the clean list or AVL tree} for a fast lookup in the global buffer cache.

Buffer Cache LRU

Recovery

ForestDB detects database file corruption by reading the last block of the file. If the last block is a B+ Tree index node, document or invalid database header, then the file is corrupted. In this case, ForestDB scans each block in reverse order to find the last valid database header written after index updates, and then reconstructs WAL entries normally committed until the last valid header in the file.

Recovery from crash

ForestDB also maintains a 4-byte checksum value (using CRC32 by default) for each B+-Tree index node, document, and database header to detect corruption in the interior of a database file.

A 1-byte block marker is added to the end of each block (4KB by default) to identify the block type (index node, document, or database header) as shown in the following figure.

Checksum based on CRC32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation Details

Write-Ahead Logging

File Layout

Buffer Cache

Recovery

Clone this wiki locally