-
Notifications
You must be signed in to change notification settings - Fork 174
Overview
A general B+-tree implementation usually has the following drawbacks:
-
B-tree stores an entire key inside not only leaf nodes but also index (internal) nodes including root node. As the length of key gets longer, the fan-out degree of each node gets smaller, and it makes the height of entire tree longer. Since the number of disk accesses is proportional to the height of tree (because we should traverse the tree from root to leaf), storing the entire key is inefficient for disk I/O. Furthermore, we should compare the entire key in each node with query while traversing from root to leaf. This B-tree implementation is inefficient to index variable-length keys.
-
If a file layout is not block-aligned, it will require more IO operations than expected. Although the logical view of a file provided by the OS file system is byte-addressable, the actual device I/O operation is performed on a per-block basis. When we try to read a node written over two blocks, the device driver has to read the two blocks even though whatever the node size is. To avoid this additional overhead, the size of a node should be fitted into multiple size of a block, and different types of data should not be interleaved in a single block.
-
B+-tree implementation can rely on the OS file system cache to cache B-tree nodes in memory. However, the file system cache is a shared resource among other processes, and it has been widely known that the OS page cache significantly increases the latency and slows down the IO performance on Solid-State Drives (SSDs) especially.
To address these limitations, we propose ForestDB, a fast key-value storage engine that is based on Hierarchical B+-tree Trie data structure, called HB+-Trie. ForestDB provides efficient indexing and retrieval on a large number of variable-length keys over tree-like structures, in terms of block device I/O. In addition, documents and internal index nodes are stored as a block-aligned form so that we can minimize the number of block I/O accesses while indexing the documents.
The main index structure is a variant of trie (i.e., prefix tree) whose nodes are B+-trees. Each B+-tree has its own nodes where all leaf nodes of each B+-tree have pointers to other B+-trees (sub-trees) or actual documents. There is a root B+-tree on top of HB+-trie, and other sub-trees are created on-demand as new nodes are created in trie.