Accelerate hash table iterator with prefetching #1501

NadavGigi · 2025-01-01T13:54:28Z

Batch Iterator

This PR introduces improvements to the hashtable iterator, implementing the advanced prefetching technique described in the blog post Unlock One Million RPS - Part 2 . The changes lay the groundwork for further enhancements in use cases involving iterators. Future PRs will build upon this foundation to improve performance and functionality in various iterator-dependent operations.

Implementation

The core of this improvement is the new hashtableNext function, which implements an optimized batch iterator for hashtable traversal. It's important to note that while we refer to 'threads' in this implementation, we're not actually using operating system threads. Instead, this approach leverages CPU-level parallelism and cache efficiency. Here's how it works:

The iterator initializes HASHTABLE_ITER_WIDTH threads, each starting in the INIT state. Each hashtableNext invocation advances the state machine for these threads in a in a round-robin fashion until it finds a thread
in READY state(which means an entry was found and prefetched). Key optimization: when a thread reaches the READY state and needs to return an entry or access a bucket, it's already in the cache, minimizing memory access latency. The states are:

INIT

Moves to the next bucket and prefetch it
Transitions to PREFETCH state

PREFETCH

Prefetches entries in the current bucket
Brings data into the cache for future use
Transitions to READY state

READY

Searches for a filled position in the current bucket
Data is likely already in the cache due to previous PREFETCH
If found, returns the element and updates its position
If not found, moves to the next bucket in chain or INIT state

FINISHED

Skipped, as it has completed its portion

The state machine for each thread follows this diagram:

       (empty bucket)
    +-------------------------+
    |                         |         (all entries in
    v     (new bucket found)  |          bucket prefetched)
+--------+                +------------+                +---------+
|  INIT  | -------------->|  PREFETCH  | -------------> |  READY  |
+--------+                +------------+                +---------+
  |   ^                          ^                           |
  |   |                          |                           |
  |   |                          |  (chained                 |
  |   |                          |   bucket)                 |
  |   |                          |                           |
  |   +--------------------------+---------------------------+
  |   (find next
  |     bucket in table)
  |         
  |  
  v   
+----------+
| FINISHED |
+----------+
(no more buckets)

Performance

The data below was taken by conducting keys command on 64cores Graviton 3 Amazon EC2 instance with 50 mil keys in size of 100 bytes each. The results regarding the duration of “keys *” command was taken from “info all” command.

+--------------------+------------------+-----------------------------+
| Implementation     | Time (seconds)   | Keys Processed per Second   |
+--------------------+------------------+-----------------------------+
| Iterator without   | 11.112279        | 4,499,529                   |
|    prefetching     |                  |                             |
| 1 Thread           | 4.341916         | 11,515,500                  |
| 2 Threads          | 3.469910         | 14,409,800                  |
| 3 Threads          | 3.387153         | 14,761,300                  |
| 4 Threads          | 3.357078         | 14,893,700                  |
| 5 Threads          | 3.421603         | 14,613,200                  |
| 6 Threads          | 3.336432         | 14,985,700                  |
| 7 Threads          | 3.439140         | 14,538,600                  |
| 8 Threads          | 3.359806         | 14,881,300                  |
+--------------------+------------------+-----------------------------+
Improvement:
Comparing iterator without prefetching and batch iterator(6 threads) 
we can see speed improvement of 14.985 / 4.50 ≈  3.33 times faster.

Save command improvment

Setup:

64cores Graviton 3 Amazon EC2 instance.
50 mil keys in size of 100 bytes each.
Running valkey server over RAM file system.
crc checksum and comperssion off.

Results

+--------------------+------------------+-----------------------------+
| Implementation     | Time (seconds)   | Keys Processed per Second   |
+--------------------+------------------+-----------------------------+
| Iterator without   | 28               | 1,785,700                   |
|    prefetching     |                  |                             |
| 6 Threads          | 20               | 2,500,000                  |
+--------------------+------------------+-----------------------------+
Improvements:
- Reduced SAVE time by 28.57% (8 seconds faster)
- Increased key processing rate by 40% (714,300 more keys/second)

src/hashtable.c

madolson · 2025-01-02T19:00:43Z

How does this compare to having an iterator that actually returns a batch of items. Something like:

void *entries[7];
size_t num_entries;
entries = getBatchEntries(iterator, *num_entries);
if (entries) {
    for (size_t i = 0; i < num_entries; i++) {
        whatever(entries[i]);
    }
}

I generally prefer to avoid manually executing prefetching when we can just efficiently process the data, as we then give more hints to the compiler and the processor so it can efficiently do its own re-ordering and prefetching.

madolson · 2025-01-02T19:01:31Z

It's important to note that while we refer to 'threads' in this implementation, we're not actually using operating system threads.

Then don't name them threads, it makes the implementation much harder to follow.

Signed-off-by: NadavGigi <[email protected]>

NadavGigi force-pushed the batch_iterator branch from 30276c1 to 6cf4299 Compare January 1, 2025 17:29

NadavGigi changed the title ~~Improving iterator using prefetch~~ Accelerate hash table iterator with prefetching Jan 1, 2025

github-advanced-security bot found potential problems Jan 2, 2025

View reviewed changes

src/hashtable.c Outdated Show resolved Hide resolved

NadavGigi force-pushed the batch_iterator branch 2 times, most recently from e001ab1 to ae465ad Compare January 2, 2025 10:44

ranshid requested a review from uriyage January 2, 2025 16:26

NadavGigi closed this Jan 5, 2025

NadavGigi reopened this Jan 5, 2025

NadavGigi force-pushed the batch_iterator branch 2 times, most recently from 05d93e2 to 86230a2 Compare January 5, 2025 11:08

Improving iterator using prefetch

8b2a3a5

Signed-off-by: NadavGigi <[email protected]>

NadavGigi force-pushed the batch_iterator branch from 86230a2 to 8b2a3a5 Compare January 5, 2025 11:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate hash table iterator with prefetching #1501

Accelerate hash table iterator with prefetching #1501

NadavGigi commented Jan 1, 2025 •

edited

Loading

madolson commented Jan 2, 2025

madolson commented Jan 2, 2025

Accelerate hash table iterator with prefetching #1501

Are you sure you want to change the base?

Accelerate hash table iterator with prefetching #1501

Conversation

NadavGigi commented Jan 1, 2025 • edited Loading

Batch Iterator

Implementation

INIT

PREFETCH

READY

FINISHED

Performance

Save command improvment

Setup:

Results

madolson commented Jan 2, 2025

madolson commented Jan 2, 2025

NadavGigi commented Jan 1, 2025 •

edited

Loading