-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Reducing memcpy overhead when using Iterators
In certain scenarios the user may need to Iterate over range of keys and keep them in memory to process them. A simple example could be something like this
ReadOptions ro;
Iterator* iter = db_->NewIterator(ro);
// Get the keys from the DB
std::vector<std::string> db_keys;
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {
db_keys.push_back(iter->key().ToString());
}
// Process the keys (in this case we simply sort them)
auto key_comparator = [](const std::string& k1, const std::string& k2) {
if (k1.size() == k2.size()) {
return k1 < k2;
}
return k1.size() < k2.size();
};
std::sort(db_keys.begin(), db_keys.end(), key_comparator);
for (size_t i = 0; i < db_keys.size(); i++) {
// Use processed keys
}
delete iter;
In this example we simply load keys from the DB into memory, sort them using a comparator that is different from DB comparator and then use the sorted keys.
The issue with this approach is in this line
db_keys.push_back(iter->key().ToString());
If our keys are huge the cost of copying the key from RocksDB into our std::vector will be significant and we cannot escape this overhead since iter->key() Slice will be invalid the moment our Iterator move forward when iter->Next() is called.
We have introduced a new option for Iterators, ReadOptions::pin_data. When setting this option to true, RocksDB Iterator will pin the data blocks and guarantee that the Slice returned by Iterator::key() will be valid as long as the Iterator is not deleted.
ReadOptions ro;
// Tell RocksDB to keep the key Slice valid as long as
// the Iterator is not deleted
ro.pin_data = true;
Iterator* iter = db_->NewIterator(ro);
// Get the keys from the DB
std::vector<Slice> db_keys;
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {
std::string is_key_pinned;
iter->GetProperty("rocksdb.iterator.is-key-pinned", &is_key_pinned);
assert(is_key_pinned == "1");
// iter->key() Slice will be valid as long as
// the Iterator is not deleted
db_keys.push_back(iter->key());
}
// Process the keys (in this case we simply sort them)
auto key_comparator = [](const Slice& k1, const Slice& k2) {
if (k1.size() == k2.size()) {
return k1.compare(k2) < 0;
}
return k1.size() < k2.size();
};
std::sort(db_keys.begin(), db_keys.end(), key_comparator);
for (size_t i = 0; i < db_keys.size(); i++) {
// Use processed keys
}
delete iter;
After setting ReadOptions::pin_data to true, now we can use Iterator::key() Slice without copying it
db_keys.push_back(iter->key());
Right now to use this feature, RocksDB must be created using BlockBased table with BlockBasedTableOptions::use_delta_encoding set to false.
To verify that the current key Slice is pinned and will be valid as long as the Iterator is not deleted,
We can check rocksdb.iterator.is-key-pinned
Iterator property and assert that it's equal to 1
std::string is_key_pinned;
iter->GetProperty("rocksdb.iterator.is-key-pinned", &is_key_pinned);
assert(is_key_pinned == "1");
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc