Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate hash table iterator with prefetching #1501

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

NadavGigi
Copy link

@NadavGigi NadavGigi commented Jan 1, 2025

Batch Iterator

This PR introduces improvements to the hashtable iterator, implementing the advanced prefetching technique described in the blog post Unlock One Million RPS - Part 2 . The changes lay the groundwork for further enhancements in use cases involving iterators. Future PRs will build upon this foundation to improve performance and functionality in various iterator-dependent operations.

Implementation

The core of this improvement is the new hashtableNext function, which implements an optimized batch iterator for hashtable traversal. It's important to note that while we refer to 'threads' in this implementation, we're not actually using operating system threads. Instead, this approach leverages CPU-level parallelism and cache efficiency. Here's how it works:

The iterator initializes HASHTABLE_ITER_WIDTH threads, each starting in the INIT state. Each hashtableNext invocation advances the state machine for these threads in a in a round-robin fashion until it finds a thread
in READY state(which means an entry was found and prefetched). Key optimization: when a thread reaches the READY state and needs to return an entry or access a bucket, it's already in the cache, minimizing memory access latency. The states are:

INIT
  • Moves to the next bucket and prefetch it
  • Transitions to PREFETCH state
PREFETCH
  • Prefetches entries in the current bucket
  • Brings data into the cache for future use
  • Transitions to READY state
READY
  • Searches for a filled position in the current bucket
  • Data is likely already in the cache due to previous PREFETCH
  • If found, returns the element and updates its position
  • If not found, moves to the next bucket in chain or INIT state
FINISHED
  • Skipped, as it has completed its portion

The state machine for each thread follows this diagram:

       (empty bucket)
    +-------------------------+
    |                         |         (all entries in
    v     (new bucket found)  |          bucket prefetched)
+--------+                +------------+                +---------+
|  INIT  | -------------->|  PREFETCH  | -------------> |  READY  |
+--------+                +------------+                +---------+
  |   ^                          ^                           |
  |   |                          |                           |
  |   |                          |  (chained                 |
  |   |                          |   bucket)                 |
  |   |                          |                           |
  |   +--------------------------+---------------------------+
  |   (find next
  |     bucket in table)
  |         
  |  
  v   
+----------+
| FINISHED |
+----------+
(no more buckets)

Performance

The data below was taken by conducting keys command on 64cores Graviton 3 Amazon EC2 instance with 50 mil keys in size of 100 bytes each. The results regarding the duration of “keys *” command was taken from “info all” command.

+--------------------+------------------+-----------------------------+
| Implementation     | Time (seconds)   | Keys Processed per Second   |
+--------------------+------------------+-----------------------------+
| Iterator without   | 11.112279        | 4,499,529                   |
|    prefetching     |                  |                             |
| 1 Thread           | 4.341916         | 11,515,500                  |
| 2 Threads          | 3.469910         | 14,409,800                  |
| 3 Threads          | 3.387153         | 14,761,300                  |
| 4 Threads          | 3.357078         | 14,893,700                  |
| 5 Threads          | 3.421603         | 14,613,200                  |
| 6 Threads          | 3.336432         | 14,985,700                  |
| 7 Threads          | 3.439140         | 14,538,600                  |
| 8 Threads          | 3.359806         | 14,881,300                  |
+--------------------+------------------+-----------------------------+
Improvement:
Comparing iterator without prefetching and batch iterator(6 threads) 
we can see speed improvement of 14.985 / 4.50 ≈  3.33 times faster.

Save command improvment

Setup:

  • 64cores Graviton 3 Amazon EC2 instance.
  • 50 mil keys in size of 100 bytes each.
  • Running valkey server over RAM file system.
  • crc checksum and comperssion off.

Results

+--------------------+------------------+-----------------------------+
| Implementation     | Time (seconds)   | Keys Processed per Second   |
+--------------------+------------------+-----------------------------+
| Iterator without   | 28               | 1,785,700                   |
|    prefetching     |                  |                             |
| 6 Threads          | 20               | 2,500,000                  |
+--------------------+------------------+-----------------------------+
Improvements:
- Reduced SAVE time by 28.57% (8 seconds faster)
- Increased key processing rate by 40% (714,300 more keys/second)

@NadavGigi NadavGigi changed the title Improving iterator using prefetch Accelerate hash table iterator with prefetching Jan 1, 2025
src/hashtable.c Outdated Show resolved Hide resolved
@NadavGigi NadavGigi force-pushed the batch_iterator branch 2 times, most recently from e001ab1 to ae465ad Compare January 2, 2025 10:44
@ranshid ranshid requested a review from uriyage January 2, 2025 16:26
@madolson
Copy link
Member

madolson commented Jan 2, 2025

How does this compare to having an iterator that actually returns a batch of items. Something like:

void *entries[7];
size_t num_entries;
entries = getBatchEntries(iterator, *num_entries);
if (entries) {
    for (size_t i = 0; i < num_entries; i++) {
        whatever(entries[i]);
    }
}

I generally prefer to avoid manually executing prefetching when we can just efficiently process the data, as we then give more hints to the compiler and the processor so it can efficiently do its own re-ordering and prefetching.

@madolson
Copy link
Member

madolson commented Jan 2, 2025

It's important to note that while we refer to 'threads' in this implementation, we're not actually using operating system threads.

Then don't name them threads, it makes the implementation much harder to follow.

@NadavGigi NadavGigi closed this Jan 5, 2025
@NadavGigi NadavGigi reopened this Jan 5, 2025
@NadavGigi NadavGigi force-pushed the batch_iterator branch 2 times, most recently from 05d93e2 to 86230a2 Compare January 5, 2025 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants