-
Notifications
You must be signed in to change notification settings - Fork 6.5k
Index Block Format
An index
block contains one entry per data block, where the key is a string >=
last key in that data block and before the first key in the successive data block. The value is the BlockHandle
for the data block. If kTwoLevelIndexSearch is used as IndexType, the index
block is a 2nd level index on index partitions, i.e., each entry points to another index
block that contains one entry per data block. In this case, the format will be
[index block - 1st level]
[index block - 1st level]
...
[index block - 1st level]
[index block - 2nd level]
Up to RocksDB version 5.14, BlockBasedTableOptions::format_version
=2, the format of index and data blocks are the same, where the index blocks use same key format of <user_key
,seq
> but special values, <offset
,size
>, that point to data blocks. format_version=
3,4 offer more optimized, yet forward-incompatible format for index blocks.
-
format_version
=3 (Since RocksDB 5.15): In most of the cases the sequence numberseq
is not necessary for keys in the index blocks. In such cases, thisformat_version
skips encoding the sequence number and setsindex_key_is_user_key
in TableProperties, which is used by the reader to know how to decode the index block. -
format_version
=4 (Since RocksDB 5.16): Changes the format of index blocks by delta encoding the index values, which are the block handles. This saves the encoding ofBlockHandle::offset
of the non-head index entries in each restart interval. If used,TableProperties::index_value_is_delta_encoded
is set, which is used by the reader to know how to decode the index block. The format of each key is (shared_size, non_shared_size, shared, non_shared). The format of each value, i.e., block handle, is (offset, size) whenever the shared_size is 0, which included the first entry in each restart point. Otherwise the format is delta-size = block handle size - size of last block handle.
The index format in format_version=4
would be as follows:
restart_point 0: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz)
restart_point 1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz)
...
restart_point n-1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz)
where, k is key, v is value, and its encoding is in parenthesis.
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc