Skip to content

Overview

Chiyoung Seo edited this page Jul 22, 2014 · 24 revisions

Overview

A general B+ tree implementation usually has the following drawbacks:

  • B-tree stores an entire key inside not only leaf nodes but also index (internal) nodes including root node. As the length of key gets longer, the fan-out degree of each node gets smaller, and it makes the height of entire tree longer. Since the number of disk accesses is proportional to the height of tree (because we should traverse the tree from root to leaf), storing the entire key is inefficient for disk I/O. Furthermore, we should compare the entire key in each node with query while traversing from root to leaf. This B-tree implementation is inefficient to index variable-length keys.

  • If a file layout is not block-aligned, it will require more IO operations than expected. Although the logical view of a file provided by the OS file system is byte-addressable, the actual device I/O operation is performed on a per-block basis. When we try to read a node written over two blocks, the device driver has to read the two blocks even though whatever the node size is. To avoid this additional overhead, the size of a node should be fitted into multiple size of a block, and different types of data should not be interleaved in a single block.

  • B+ tree implementation can rely on the OS file system cache to cache B-tree nodes in memory. However, the file system cache is a shared resource among other processes, and it has been widely known that the OS page cache significantly increases the latency and slows down the IO performance on Solid-State Drives (SSDs) especially.

To address these limitations, we propose ForestDB, a fast key-value storage engine that is based on Hierarchical B+ tree Trie data structure, called HB+-Trie. ForestDB provides efficient indexing and retrieval on a large number of variable-length keys over tree-like structures, in terms of block device I/O operations. In addition, documents and internal index nodes are stored as a block-aligned form so that we can minimize the number of block I/O accesses while indexing the documents.

ForestDB supports Multi-Version Concurrency Control (MVCC) and persists database changes in an append-only manner. This allows us to have multiple concurrent readers and writers. Writers are all serialized, but can interleave with readers, which means that an arbitrary number of readers can access a given database instance concurrently without being blocked by writers. However, we usually recommend one writer and multiple readers for each database instance to avoid synchronization overhead among multiple writers.

Main Index Structure

HB+-Trie

ForestDB's main index structure is a variant of trie (i.e., prefix tree) whose nodes are B+ trees. Each B+ tree has its own nodes where all leaf nodes of each B+ tree have pointers to other B+ trees (sub-trees) or actual documents. There is a root B+ tree on top of HB+-trie, and other sub-trees are created on-demand as new nodes are created in trie.

HB+-Trie Construction

As in the original trie, HB+-trie splits a document key into fixed-length chunks. The size of each chunk is configurable, and we set 4 bytes as default. Each chunk is used as a key for each level of B+ tree sequentially. Searching the index structure starts by retrieving the root B+ tree with the first (leftmost) chunk as a key. If a pointer from a leaf node of the root B+ tree points to a target document, the lookup operation is terminated. Otherwise (i.e., the pointer points to the next-level B+ tree), we continue the index traversal at the next-level tree with the next chunk recursively until the target document is found. This retrieval process is basically same as that of the original trie.

There are several benefits of HB+-trie over typical tree-like structures. First, dealing with a long key can be fast and space-efficient. Since we do not need to compare an entire key for each node, a lookup operation can be faster than tree-like structures which compare an entire key for each node on the path from a root to leaf nodes. Second, we do not store an entire key in HB+-trie, while typical B+ tree should provide appropriate space (up to key size * fan-out degree) for every node to store keys.

HB+-trie also provides a common prefix compression scheme as the original trie. This scheme skips any common branch among keys sharing the common prefix, reducing unnecessary tree traversals. The tree using the first different chunk (i.e. the chunk next to the common prefix) as a key stores the entire common prefix inside its header to reconstruct an original key for all searches passing through the tree. The next figure shows an example when chunk size is one byte. The number in the triangle (i.e. n-th) denotes the chunk number used as a key in the tree, and the string represents the common prefix skipped by the tree.

Prefix Compression in HB+-Trie

We maintain two separate indexes per ForestDB instance, which are ID and Seq indexes, respectively. An ID index (i.e., primary index) uses a document ID as its index key, while an Seq index uses a sequence number (64 bits) as its index key. A sequence number (maintained per database instance) is incremented for each mutation (i.e., insert, update, delete) and assigned to the corresponding document, which allows us to look up a document with its sequence number on an Seq index.

HB+-trie is used as an ID index, while a B+ tree as Seq-index. We use the same B+ tree implementation for HB+-trie node and Seq index. Since the key size of this B+ tree is fixed, there is no space overhead to support a variable-length key.

We can get lots of benefit from HB+-trie when keys are sufficiently long (8 bytes at least) and their distribution is uniform random (e.g. in case of hash values). The next figure shows an example of random keys when chunk size is 2 bytes. Since there is no common prefix, they can be distinguished by the first chunk and we do not need to create sub-trees to compare next chunks. Suppose that the chunk size is n-bit and key distribution is (ideal) uniform random, up to 2^n keys can be indexed by storing only their first chunks in the first B+ tree. This makes HB+-trie much more efficient in terms of space occupied by the index structure and the number of disk accesses, compared to the naive B+ tree implementation that stores entire keys in its index nodes.

HB+-Trie with two bytes of a chunk size and random key distributions

Avoiding HB+-Trie Skew

HB+-Trie can be skewed under specific key patterns. The figure below shows an example of skewed HB+-Trie whose chunk size is 1 byte. If we insert a set of keys composed of monotonically cumulating the same pattern (and each pattern is exactly aligned to the chunk size) such as a, aa, aaa and aaaa, HB+-Trie is skewed as illustrated in the figure. This makes a node (block) utilization very low, and the number of disk accesses on a skewed branch is significantly increased.

HB+-Trie Skew Example 1

The following figure represents another example of HB+-Trie skew. If all the keys have only two branches for each chunk, then all B+ Trees will store only two key-value pairs. This makes the overall block utilization very low so that lots of near-empty blocks will be created. As a result, both space overhead and the number of disk accesses will increase rapidly.

HB+-Trie Skew Example 2

To avoid those problems, we first define a leaf B+ Tree as an B+ Tree that has no child sub-trees, except for the root B+ Tree. Unlike a normal (non-leaf) B+ Tree, a leaf B+ Tree uses the entire sub-strings (postfix) just after the chunk used for its parent B+-Tree, as its keys. The next figure depicts how they are organized.

HB+-Trie: Leaf B+ Tree

The root B+ Tree indexes documents using the first chunk as before, while leaf B+ Trees use the rest of sub-strings starting from the second chunk as their keys. For example, the left leaf B+ Tree indexes documents with keys ‘aaabc’ and ‘aabb’ using their sub-strings ‘aabc’ and ‘abb’, respectively. In this way, even though HB+-Trie indexes key patterns that may trigger skew, no more sub-tree is created due to a leaf B+ Tree as shown in the right leaf B+ Tree in the figure.

However, this data structure share almost all problems that the typical B+ Tree has. To address this, we extend a leaf B+ Tree when the total number of keys stored in the leaf B+ Tree exceeds a certain threshold. The following figure describes an example of extension on the left leaf B+ Tree.

HB+-Trie: Extension of a Leaf B+ Tree

For extension, we first investigate the common prefix among the keys stored in the target leaf B+ Tree. A new regular B+ Tree using the first different chunk as a key is created, and the documents are re-indexed by using the chunk. If there are keys sharing the same chunk, then we create a new leaf B+ Tree for the chunk and the leaf B+ Tree uses the rest of sub-strings just after the chunk as its key. As presented in the above figure, a normal B+ Tree using third chunk is created, and documents with keys ‘aabb’ and ‘aac’ are indexed by their third chunk ‘b’ and ‘c’. Since ‘aaaa’ and ‘aaabc’ share the same third chunk ‘a’, we create a new leaf B+ Tree that indexes the documents using the rest of sub-strings ‘a’ and ‘bc’, respectively. In this way, we can heuristically distinguish between the rarely occurring skewed patterns and hot spots.

Features

Create, Read, Update, Delete (CRUD) Operations

fdb_set API allows an application to perform create or update operations with a key, its metadata, and its value. The maximum key size supported is 3840 bytes as of this time, which is determined by considering a page size(4KB) and the space overhead of B+ tree node header. We plan to remove this size limitation in the near future. There are no limits on the sizes of a key's metadata and value. A key, metadata, and value are all simply processed as binary objects.

ForestDB provides multiple different ways for retrieving the metadata and value for a given key. fdb_get and fdb_get_byseq APIs retrieve the metadata and value by a key and a sequence number, respectively. fdb_get_byoffset API allows an application to retrieve the metadata and value by their disk offset, which avoids the main index lookup and supports a faster lookup operation.

fdb_del API is used to support a delete operation. In ForestDB, a delete operation simply marks a key as "deleted", and still keeps the key and its metadata in database. A deleted key entry can be permanently purged as part of a database compaction operation that is explained below. ForestDB provides a config parameter that allows an application to keep deleted key entries in database for a given period of time.

Multi-Version Concurrency Control (MVCC)

ForestDB supports Multi-Version Concurrency Control for concurrent accesses to a database instance. With the MVCC support and append-only storage model, a writer and reader do not block each other and multiple readers can access the same database instance concurrently without any locking overhead. This allows us to utilize the disk capacity in a much better way than traditional in-place update and locking approaches.

Sequence Index

As explained briefly in the previous section, ForestDB supports a sequence index to provide more flexible ways of retrieving the metadata and value for a key. Each ForestDB database instance maintains its own sequence counter, and increments it for every mutation (i.e., create, update, delete) received, and finally assigns it to the corresponding key and updates the sequence index. The application then receives a mutation's sequence number from ForestDB and can use it to retrieve the key's metadata and value by calling fdb_get_byseq API.

When a commit is performed on a given database instance, the database instance's current sequence counter is persisted in the database header. A sequence index is also used to support snapshot, iterator, and rollback features that are explained below.

Iterators

fdb_iterator APIs support traversing a database instance with a partial or full range of keys or sequence numbers, and also can seek to a given key and traverse from that key. A point-in-time view of database is created by calling fdb_iterator_init API, and closed by calling fdb_iterator_close API.

Snapshots

A point-in-time view of database (snapshot) can be created by calling fdb_snapshot_open API with a given sequence number that is from one of the valid database header blocks. All the read-only APIs (e.g., fdb_get, fdb_get_byseq, fdb_get_byoffset, fdb_iterator, etc.) can be used with the snapshot created with fdb_snapshot_open API. A snapshot instance can be closed and destroyed by calling fdb_close API. If more than one reader request the creation of the same snapshot, ForestDB does not create a separate copy of the snapshot for each reader, but instead allows them to share a single instance of the snapshot.

Custom Compare Functions

Keys are ordered lexicographically in the main index by default. However, an application can use its custom compare function to override this default ordering behavior. fdb_open_cmp APIs allows an application to provide its own custom compare function for the main index when the database is created.

Write-Ahead Logging (WAL)

For maximizing the overall disk throughput, ForestDB writes all incoming mutations to a database file in a log-structured append-only manner. However, the index structures (i.e., ID index and Seq index) are not immediately updated in order to reduce the overall write amplification, and the batch of these pended mutations is eventually applied to the index structures later. For this, ForestDB maintains in-memory WAL ID index and WAL Seq index for indexing WAL documents that are not indexed by the main index structures. In this way, those WAL documents can be efficiently retrieved by looking up in-memory WAL indexes. Refer to WAL Implementation for more details.

Durability

ForestDB provides multiple options for durability:

  • Synchronous commit through the OS page cache (default mode)
  • Asynchronous commit through the OS page cache
  • Synchronous commit with the direct I/O option (O_DIRECT) to bypass the OS page cache
  • Asynchronous commit with the direct I/O option (O_DIRECT) to bypass the OS page cache

As ForestDB has its own buffer cache whose capacity is configurable dynamically, it can bypass the OS page cache to write dirty pages to the disks directly. However, asynchronous options can cause the data loss issue in cases of system crash or power failure.

Transaction

A transaction with ACID (Atomicity, Consistency, Isolation, Durability) properties can be supported in ForestDB. However, the isolation levels supported are "read committed" or "read uncommitted" as of this time, but we plan to implement both "serializable" and "repeatable read" isolation levels in the upcoming releases. A transaction can be started by calling fdb_begin_transaction API and committed or aborted by fdb_end_transaction API.

Rollback

Rollback operation reverts the database state to a specific point represented by a sequence number that is from one of the valid database header blocks. In other words, the rollback operation rewinds the database to the header block that has the sequence number passed to fdb_rollback API. Note that all other writers will be blocked while the rollback operation is being executed.

Compaction

The compaction can be performed by either calling fdb_compact API manually or having a daemon thread manage the compaction automatically. The manual (default) or automated daemon compaction can be chosen when the database is opened. Even if the daemon compaction mode is enabled for a given database file, the manual compaction by fdb_compact API can be still invoked by the application side if that database file is not being compacted by the daemon compaction thread.

When the compaction task starts, it scans a given ForestDB file to retrieve each (key, metadata, value) entry and then writes it into a new file to construct a new compacted HB+-Trie. While the compaction task is running, another writer thread can still write dirty items into the WAL section of a new file, which allows the compaction thread to be interleaved with the writer thread. When the compaction task is completed, the writer thread can reflect the WAL items into the new file’s HB+-Trie by flushing the in-memory WAL index entries.

If the compaction task fails due to system crash or other reasons, the ForestDB recovery module opens the corrupted compact file and reverse-scans it to retrieve the list of WAL items written by the write thread, and finally moves those items back to the original uncompacted file as part of the database open operation.

We recommend an ForestDB user to choose one of the following options in interleaving write and compaction operations:

  • A single thread performs both regular writes and manual compaction operation. In other words, write and compaction operations are serialized by a single thread.

  • A separate thread is created in an application side to perform the compaction by invoking fdb_compact API manually. Note that fdb_get_dbinfo API provides the database fragmentation ratio that can be used by the compaction thread to decide if a compaction is required or not. With this option, an application can have a separate thread that does regular write operations only and can be interleaved with its compaction thread. This option can provide a better disk utilization than the above single thread approach.

  • The compaction is automatically managed and performed by the daemon thread inside ForestDB engine, while a writer thread in an application side can perform regular write operations and be interleaved with the ForestDB daemon compaction thread. This option can provide a better disk utilization compared with the serialized approach and make an application simpler. However, the first two approaches give an application more flexibilities in scheduling and executing a compaction task.

Compression

The data compression can be optionally chosen by an application. ForestDB uses Snappy library to compress/decompress a key's value.

Buffer Cache

ForestDB manages its own buffer cache to keep frequently accessed pages (e.g. index nodes) in memory and not to evict them. Consequently, this allows us to provide an option for bypassing the OS page cache by using Direct I/O flag (i.e., O_DIRECT). In addition, this dedicated buffer cache allows us to implement some optimizations in managing the cache. For example, we give a higher weight to index nodes than data pages, so that more index nodes can be cached for faster index traversal, which shows much better read and write performance. Note that a buffer cache size can be configurable dynamically and also disabled completely. Please refer to Buffer Cache for implementation details.

Network Byte Ordering

ForestDB supports an endian-safe format, so that ForestDB files can be transferred and used across various platforms without any transformations. This feature will be quite useful when a database file is transferred and shared among heterogeneous mobile devices and desktop machines.

Clone this wiki locally