-
Notifications
You must be signed in to change notification settings - Fork 6.6k
External Table (Experimental)
RocksDB has a few pre-defined SST formats, such as block based, plain and (now deprecated) cuckoo. External tables allow a user to extend RocksDB by using a custom table file format. This can be useful for specific use cases that demand more performance than the pre-defined formats. It also makes it easier to experiment with new file formats, such as columnar, and evaluate them before deciding if its worth incorporating into core RocksDB.
External table support is implemented as a table factory that can be configured in ColumnFamilyOptions. When an external table factory is in use, there are several restrictions on that column family. An external table file can only contain Puts. Tombstones, merges, range deletions etc. are not supported. Sequence numbers for KVs is also not supported and each key has an implicit sequence number of 0. While its not enforced, live writes to the CF are not allowed and might result in undefined behavior if it occurs. Taken together, these restrictions mean that external table files are applicable for read-only ingestion only use cases where the files are generated using SstFileWriter and ingested into a column family. The files should be confined to the bottommost level, which can be accomplished by ingesting files withe the atomic_replace ingestion option.
The external table can support one or both of the following layouts -
- Total order seek - All the keys in the files are in sorted order, and a user can seek to the first, last, or any key in between and iterate forwards or backwards till the end of the range. To support this mode, the implementation needs to use the comparator passed in ExternalTableOptions to enforce the key ordering. The prefix_extractor in ExternalTableOptions and the ExternalTableReader interfaces can be ignored.
- Prefix seek - In this mode, the prefix_extractor is used to extract the prefix from a key. All the keys sharing the same prefix are ordered in ascending order according to the comparator. However, no specific ordering is required across prefixes. Users can scan keys by seeking to a specific key inside a prefix, and iterate forwards or backwards within the prefix. The prefix_same_as_start flag in ReadOptions will be true.
- Both - If supporting both of the above, a user can seek inside a prefix and iterate beyond the prefix. The prefix_same_as_start in ReadOptions will be false. Additionally, the total_order_seek flag can be set to true to seek to the first non-empty prefix (as determined by the key order) if the seek prefix is empty.
An external table format is implemented as a factory class - ExternalTableFactory
. There are two main methods that must be implemented - NewTableBuilder()
and NewTableReader()
. See external_table.h for the full interface.
NewTableBuilder()
should open a file for writing at the given path, construct and return an ExternalTableBuilder
object. The sequence of operations to write an external table is as follows -
- Add() is called one or more times to write all key-values to the table. Its called in increasing key order, as determined by the comparator. The input key is a user key, i.e sequence number and value type are stripped out.
- After every Add() operation, status() is called to check the current status.
- After the last key is added, Finish() is called to do whatever is necessary to ensure the data is persisted in the table file.
- If there is a failure midway for some reason, Abandon() is called instead of Finish().
- At the end, PutTableProperties() is called to write a properties block. FileSize() and status() are called to get the final size of the file and the final status. GetFileChecksum() and GetFileChecksumFuncName() may also be called to get checksum information about the whole file, but their implementation is optional.
NewTableReader()
is called when the file is loaded, typically during DB open. It should construct and return an ExternalTableReader
object which implements methods for point lookups and iterator creation. To scan multiple KVs, NewIterator()
is called to construct an ExternalTableIterator
. The scan might have a pointer to an upper bound key specified in iterate_upper_bound
in ReadOptions
. The Prepare()
method may optionally be called to allow the iterator to prefetch data upfront for a series of scans.
An external table factory can be configured for a column family as follows -
std::shared_ptr<ExternalTableFactory> my_factory = std::make_shared<MyExternalTableFactory>();
Options options;
options.table_factory = NewExternalTableFactory(my_factory);
// Disable live writes
options.disallow_memtable_writes = true;
See here for an example of the full workflow of creating, ingesting and querying an external table.
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- SST Partitioner
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc