-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Tiered Storage (Experimental)
RocksDB Tiered storage feature can now assign the data to various types of storage media based on the temperature of the data (how hot the data is) within the same db column family. For example, the user can set the temperate of the last level to cold:
AdvancedColumnFamilyOptions.last_level_temperature = Temperature::kCold
Then the temperature information will be passed to the FileSystem APIs like NewRandomAccessFile()
, NewWritableFile()
, etc. It's up to the user to place the file in its corresponding storage with the implementation of its own FileSystem
. Also use the temperature information to find the file in corresponding storage.
In general, the high levels data are written most recently and more likely to be hot. Also high level data is much more likely to go though compaction, having them in a faster storage media can improve the compaction process.
Currently, only the last level temperature can be specified. Which has its limitation, for example for a skewed data set, the hot data set may be compacted frequently and compacted to the last level. To prevent that, a per-key based hot/cold data splitting compaction is introduced.
If the data is skewed or major compaction (more likely for universal compaction), the recent inserted data may be compacted to the last level, which is stored in cold storage tier. To prevent that, the user can specify the hot data time range by:
AdvancedColumnFamilyOptions.preclude_last_level_data_seconds = 259200 // 3 days
Then the data written in the last 3 days, won't be compacted to the last level.
Internally, RocksDB compaction can split the hot and cold data in its last level compaction:
A per-key based placement is implemented to place the data older than
now - preclude_last_level_data_seconds
to the last level (cold tier) and other data to penultimate level (hot tier). RocksDB uses the data sequence number to estimate its' insertion time. Once the feature is enabled, RocksDB samples the sequence number to time information and stores that with the SSTable. Based on that, compaction is able to estimate the time its inserted.
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc