Accelerate hash table iterator with prefetching #1501
+193
−79
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Batch Iterator
This PR introduces improvements to the hashtable iterator, implementing the advanced prefetching technique described in the blog post Unlock One Million RPS - Part 2 . The changes lay the groundwork for further enhancements in use cases involving iterators. Future PRs will build upon this foundation to improve performance and functionality in various iterator-dependent operations.
Implementation
The core of this improvement is the new
hashtableNext
function, which implements an optimized batch iterator for hashtable traversal. It's important to note that while we refer to 'threads' in this implementation, we're not actually using operating system threads. Instead, this approach leverages CPU-level parallelism and cache efficiency. Here's how it works:The iterator initializes
HASHTABLE_ITER_WIDTH
threads, each starting in the INIT state. EachhashtableNext
invocation advances the state machine for these threads in a in a round-robin fashion until it finds a threadin READY state(which means an entry was found and prefetched). Key optimization: when a thread reaches the READY state and needs to return an entry or access a bucket, it's already in the cache, minimizing memory access latency. The states are:
INIT
PREFETCH
READY
FINISHED
The state machine for each thread follows this diagram:
Performance
The data below was taken by conducting keys command on 64cores Graviton 3 Amazon EC2 instance with 50 mil keys in size of 100 bytes each. The results regarding the duration of “keys *” command was taken from “info all” command.
Save command improvment
Setup:
Results