Skip to content

Overview

Chiyoung Seo edited this page Jan 14, 2015 · 24 revisions

Overview

A general B+ tree implementation usually has the following drawbacks:

  • A B-tree stores an entire key inside not only leaf nodes but also index (internal) nodes including the root node. As the length of the keys increases, the fan-out degree of each node gets smaller, which increases the height of the entire tree. Since the number of disk accesses is proportional to the height of the tree (because we traverse the tree from root to leaf), storing the entire key is inefficient for disk I/O. Furthermore, we must compare the entire key in each node with the query while traversing from root to leaf. This B-tree implementation is inefficient to index variable-length keys.

  • If a file layout is not disk block-aligned, it will require more I/O operations than expected. Although the logical view of a file provided by the OS file system is byte-addressable, on the actual device I/O operations are performed on a per-block basis. When we try to read a node that was written across two blocks, the device driver has to read both blocks even if the node was short enough to fit in one. To avoid this additional overhead, a node should be aligned to fit into the minimum number of disk blocks, and different types of data should not be interleaved in a single block.

  • A B+ tree implementation can rely on the OS's file system cache to cache B-tree nodes in memory. However, the file system cache is a shared resource among other processes, and it has been widely known that relying on the OS page cache significantly increases latency and slows down I/O performance on Solid-State Drives (SSDs) especially.

To address these limitations, we propose ForestDB, a fast key-value storage engine that is based on a Hierarchical B+-Tree based Trie, or HB+-Trie, which was originally presented at ACM SIGMOD 2011 Programming Contest by Jung-Sang Ahn who works at Couchbase. ForestDB provides efficient indexing and retrieval on a large number of variable-length keys over tree-like structures, in terms of block device I/O operations. In addition, documents and internal index nodes are block-aligned so that we can minimize the number of block I/O accesses while indexing and retrieving the documents.

ForestDB supports Multi-Version Concurrency Control (MVCC) and persists database changes in an append-only manner. This allows us to have multiple concurrent readers and writers. Writers are all serialized, but can interleave with readers, which means that an arbitrary number of readers can access a given database instance concurrently without being blocked by writers. However, we usually recommend one writer and multiple readers for each database instance to avoid synchronization overhead among multiple writers.

Main Index Structure

HB+-Trie

ForestDB's main index structure is a variant of trie (i.e., prefix tree) whose nodes are B+ trees. Each B+ tree has its own nodes where all leaf nodes of each B+ tree have pointers to other B+ trees (sub-trees) or actual documents. There is a root B+ tree at the top of the HB+-trie, and other sub-trees are created on-demand as new nodes are created in trie.

HB+-Trie Construction

As in the original trie, an HB+-trie splits a document key into fixed-length chunks. The size of each chunk is configurable, defaulting to 8 bytes. Each chunk is used as the key for each sequential level of B+ tree. Searching the index structure starts by retrieving the root B+ tree with the first (leftmost) chunk as a key. If a pointer from a leaf node of the root B+ tree points to a target document, the lookup operation is terminated. Otherwise the pointer points to the next-level B+ tree, and we continue the index traversal at that tree with the next key chunk. This repeats until the target document is found. This retrieval process is basically same as that of the original trie.

There are several benefits of the HB+-trie over typical tree-like structures. First, dealing with a long key can be fast and space-efficient. Since we do not need to compare an entire key for each node (only a chunk), a lookup operation can be faster than in tree-like structures which compare an entire key for each node on the path from a root to leaf nodes. Second, we do not store an entire key in the HB+-trie, whereas a typical B+ tree must provide appropriate space (up to key size * fan-out degree) for every node to store keys.

The HB+-trie also uses a common prefix compression scheme, as does the original trie. This scheme skips any common branch among keys sharing the common prefix, reducing unnecessary tree traversals. The tree using the first different chunk (i.e. the chunk after the common prefix) as a key stores the entire common prefix inside its header to reconstruct an original key for all searches passing through the tree. The next figure shows an example when chunk size is one byte. The number in the triangle (i.e. n-th) denotes the chunk number used as a key in the tree, and the string represents the common prefix skipped by the tree.

Prefix Compression in HB+-Trie

We maintain two separate indexes per ForestDB instance: ID and Sequence (Seq) indexes. An ID index (i.e., primary index) uses a document ID as its index key, while an Seq index's key is a 64-bit sequence number. Each value (document) stored in the database has a sequence number, which is updated on each mutation (i.e., insert, update, delete) by incrementing a per-database sequence counter and assigning its current value. The Seq index thus provides a chronologically-ordered history of changes to the database.

The ID index is an HB+-trie, while the Seq index is a regular B+ tree. We use the same B+ tree implementation for the Seq index as for interior HB+-trie nodes. Since the key size of this B+ tree is fixed, there is no space overhead to support a variable-length key.

We can get lots of benefit from an HB+-trie when keys are sufficiently long (8 bytes at least) and their distribution is uniformly random (e.g. hash values). The next figure shows an example of random keys with a chunk size of 2 bytes. Since there is no common prefix, the keys can be distinguished by the first chunk and we do not need to create sub-trees to compare subsequent chunks. Supposing that the chunk size is n bits and key distribution is (ideal) uniformly random, up to 2n keys can be indexed by storing only their first n-bit chunks in the first B+ tree. This makes an HB+-trie much more efficient in terms of space occupied by the index structure and the number of disk accesses, compared to a naive B+ tree implementation that stores entire keys in its index nodes.

HB+-Trie with two bytes of a chunk size and random key distributions

Avoiding HB+-Trie Skew

An HB+-Trie can be skewed under specific key patterns. The figure below shows an example of a skewed HB+-Trie whose chunk size is 1 byte. If we insert a set of keys consisting of repetitions of the same pattern (and each pattern is exactly aligned to the chunk size) such as a, aa, aaa and aaaa, the HB+-Trie will be skewed as illustrated in the figure. This makes node (block) utilization very low, and the number of disk accesses on a skewed branch is significantly increased.

HB+-Trie Skew Example 1

The following figure represents another example of HB+-Trie skew. If all the keys have only two branches for each chunk, then all B+ Trees will store only two key-value pairs. This makes the overall block utilization very low so that lots of near-empty blocks will be created. As a result, both space overhead and the number of disk accesses will increase rapidly.

HB+-Trie Skew Example 2

To avoid these problems, we first define a leaf B+ Tree as a B+ Tree (other than the root) that has no child sub-trees. Unlike a normal (non-leaf) B+ Tree, a leaf B+ Tree uses the entire sub-string (suffix) just after the chunk used for its parent B+-Tree, as its key. The next figure depicts how they are organized.

HB+-Trie: Leaf B+ Tree

The root B+ Tree indexes documents using the first (1-byte) chunk as before, while leaf B+ Trees use the entire remainder of the keys starting from the second chunk as their keys. For example, the left leaf B+ Tree indexes documents with keys ‘aaabc’ and ‘aabb’ using their suffixes ‘aabc’ and ‘abb’, respectively. In this way, even though the HB+-Trie indexes key-patterns that may trigger skew, no more sub-trees are created due to a leaf B+ Tree as shown in the right leaf B+ Tree in the figure.

However, this data structure has almost all the problems that the typical B+ Tree has. To address this, we extend a leaf B+ Tree when the total number of keys stored in the leaf B+ Tree exceeds a certain threshold. The following figure shows an example of extension of the left leaf B+ Tree.

HB+-Trie: Extension of a Leaf B+ Tree

For extension, we first investigate the common prefix among the keys stored in the target leaf B+ Tree. A new regular B+ Tree using the first different chunk as a key is created, and the documents are re-indexed by using the chunk. If there are keys sharing the same chunk, then we create a new leaf B+ Tree for that chunk, which will use the key suffixes following that chunk as its keys. As presented in the above figure, a normal B+ Tree using a third (1-byte) chunk is created, and documents with keys ‘aabb’ and ‘aac’ are indexed by their third chunk ‘b’ and ‘c’. Since ‘aaaa’ and ‘aaabc’ share the same third chunk ‘a’, we create a new leaf B+ Tree that indexes the documents using the rest of sub-strings ‘a’ and ‘bc’, respectively. In this way, we can heuristically distinguish between rarely occurring skewed patterns and hot spots.

Features

Create, Read, Update, Delete (CRUD) Operations

The fdb_set function allows an application to perform create or update operations with a key, its metadata, and its value. The maximum sizes of a key, metadata, and value are 64KB, 64KB, and 4GB, respectively. They are all simply processed as binary objects.

ForestDB provides multiple ways to retrieve the metadata and value for a given key. fdb_get and fdb_get_byseq retrieve the metadata and value by a key and a sequence number, respectively. fdb_get_byoffset allows an application to retrieve the metadata and value by their disk offset, which avoids the main index lookup and supports a faster lookup operation.

fdb_del deletes the value for a key. In ForestDB, a delete operation simply marks a key as "deleted", and still keeps the key and its metadata in the database. A deleted key entry can be permanently purged as part of a database compaction operation, as explained below. ForestDB provides a configuration parameter that allows an application to keep deleted key entries in the database for a given period of time.

Multi-Version Concurrency Control (MVCC)

ForestDB supports Multi-Version Concurrency Control for concurrent accesses to a database instance. With MVCC support and an append-only storage model, a writer and reader do not block each other: multiple readers and a single writer can access the same database instance concurrently without any locking overhead. This allows us to utilize disk I/O bandwidth more efficiently than traditional in-place update and locking approaches.

Sequence Index

As explained briefly in the previous section, ForestDB supports a sequence index to provide more flexible ways of retrieving the metadata and value for a key. Each ForestDB database instance maintains its own sequence counter. On every mutation (i.e., create, update, delete), the database increments its sequence counter, assigns the new value to the corresponding key, and inserts the sequence into the sequence index. The fdb_set function returns the sequence number to the application, which can later use it to retrieve the key's metadata and value by calling fdb_get_byseq.

When a commit is performed on a given database instance, the database instance's current sequence counter is persisted in the database header. A sequence index is also used to support snapshot, iterator, and rollback features that are explained below.

The primary intended purpose of the sequence index is to provide a timeline of database changes. By iterating the sequence index starting from a last-known sequence, the client can visit every document that has been changed since that time. Finally it saves the last sequence as the starting point for the next traversal. This procedure is very useful for backup, replication, or updating external secondary indexes.

Iterators

The fdb_iterator abstract type supports traversing a database instance with a partial or full range of keys or sequence numbers, and also can seek to a given key and traverse from that key. An iterator is created by calling fdb_iterator_init, and closed by calling fdb_iterator_close. An iterator traverses a snapshot of the database as of the moment of its creation, so the caller sees a consistent view of the database, regardless of changes made afterward.

Snapshots

An immutable point-in-time view of the database (snapshot) can be created by calling fdb_snapshot_open with a sequence number corresponding to a valid database header blocks. All the read-only functions (fdb_get, fdb_get_byseq, fdb_get_byoffset, fdb_iterator_init, etc.) can be used with a snapshot. A snapshot instance can be closed and destroyed by calling fdb_close API. If more than one reader requests the creation of the same snapshot, ForestDB does not create a separate copy of the snapshot for each reader, but instead allows them to share a single instance of the snapshot.

Custom Compare Functions

Keys are ordered lexicographically (i.e. according to memcmp) in the main index by default. However, an application can use its custom compare function to override this default ordering behavior. The fdb_open_cmp function allows an application to provide its own custom compare function for the main index when the database is created.

Write-Ahead Logging (WAL)

To maximize overall disk throughput, ForestDB writes all incoming mutations to a database file in a log-structured append-only manner. However, the index structures (i.e., ID index and Seq index) are not immediately updated in order to reduce the overall write amplification, and a batch of these pending mutations is eventually applied to the index structures later. For this, ForestDB maintains in-memory WAL ID index and WAL Seq index for indexing documents that are not yet indexed by the main persistent index structures. In this way, those documents can be efficiently retrieved by looking up in-memory WAL indexes. Refer to WAL Implementation for more details.

Durability

ForestDB provides multiple options for durability:

  • Synchronous commit through the OS page cache (default mode)
  • Asynchronous commit through the OS page cache
  • Synchronous commit with the direct I/O option (O_DIRECT) to bypass the OS page cache
  • Asynchronous commit with the direct I/O option (O_DIRECT) to bypass the OS page cache

As ForestDB has its own buffer cache whose capacity is configurable dynamically, it can bypass the OS page cache to write dirty pages to the disks directly. However, asynchronous options can suffer data loss issue in the case of system crash or power failure.

Transaction

ForestDB supports transactions with ACID (Atomicity, Consistency, Isolation, Durability) properties. However, the isolation levels supported are currently limited to "read committed" and "read uncommitted". We plan to implement both "serializable" and "repeatable read" isolation levels in upcoming releases.

A transaction is started by calling fdb_begin_transaction and committed or aborted by calling fdb_end_transaction.

Rollback

The rollback operation reverts the entire database state to a specific earlier point in time represented by a sequence number from one of the valid database header blocks. In other words, it rewinds (truncates) the database to the header block that has the sequence number passed to the fdb_rollback function. Note that all other writers will be blocked while the rollback operation is being executed.

Compaction

Since the database file is append-only, it grows monotonically and accumulates more and more obsolete data over time. Thus it must periodically be compacted. Compaction copies all of the live data to a new file, then replaces the old file with the new one.

This is performed by either calling fdb_compact manually or having a daemon thread manage compaction automatically. Manual (default) automated compaction can be chosen when the database is opened. Even if the daemon compaction mode is enabled for a given database file, manual compaction by fdb_compact API can still be invoked by the application if that database file is not being compacted by the daemon compaction thread at that moment.

When the compaction task starts, it scans the file to retrieve each (key, metadata, value) entry and then writes it into a new file to construct a new compacted HB+-Trie. While the compaction task is running, another writer thread can still write dirty items into the WAL section of the new file, which allows the compaction thread to be interleaved with the writer thread. When the compaction task is completed, the writer thread can reflect the WAL items into the new file’s HB+-Trie by flushing the in-memory WAL index entries.

If the compaction task fails due to a crash or other reasons, then the next time the database is opened the ForestDB recovery module opens the corrupted compact file and reverse-scans it to retrieve the list of WAL items written by the write thread, and finally moves those items back to the original uncompacted file.

We recommend choosing one of the following options in interleaving write and compaction operations:

  • A single thread performs both regular writes and manual compaction operation. In other words, write and compaction operations are serialized by a single thread.

  • A separate thread is created in the application to perform compaction by invoking fdb_compact API manually. Note that the fdb_get_dbinfo function provides the database fragmentation ratio which can be used to decide if a compaction is required or not. With this option, an application can have a separate thread that does regular write operations only, which can be interleaved with its compaction thread. This option provides better disk utilization than the above single-threaded approach.

  • The compaction is automatically managed and performed by the daemon thread inside the ForestDB engine, while an application writer thread performs regular write operations which can be interleaved with the ForestDB daemon compaction thread. This still provides better disk utilization than the serialized approach, while simplifying the application code. However, the first two approaches give the application more flexibility in scheduling and executing a compaction task.

If compaction is managed by a separate thread in the application or the daemon thread inside ForestDB, then there are possible race conditions in using fdb_snapshot and fdb_rollback APIs. Those APIs will return an error, if the compaction thread just finishes compacting a given database file when those APIs calls are made with an earlier sequence number that is already removed by the compaction.

Compression

The application can choose to enable data compression of keys' values. ForestDB uses the Snappy library, which provides a moderate degree of compression with low CPU overhead.

Buffer Cache

ForestDB manages its own buffer cache to keep frequently accessed pages (e.g. index nodes) in memory and not evict them. Consequently, this allows us to provide an option for bypassing the OS page cache by using the OS's Direct I/O facility (i.e., O_DIRECT). In addition, this dedicated buffer cache allows us to implement some optimizations in managing the cache. For example, we give a higher weight to index nodes than data pages, so that more index nodes can be cached for faster index traversal, which shows much better read and write performance.

A file's buffer cache is shared between all handles opened on the file. This saves memory in multithreaded use cases (since each thread will use its own ForestDB handle) or when multiple snapshots are in open.

Note that the buffer cache size can be configurable dynamically, and the cache can be disabled completely. Please refer to Buffer Cache for implementation details.

Network Byte Ordering

ForestDB's data format is endian-safe, so that ForestDB files can be transferred and used across various platforms without any transformations. This feature will be quite useful when a database file is transferred and shared among heterogeneous mobile devices and desktop machines.

Clone this wiki locally