v2.1: rpc: improve latency by not blocking worker threads polling IO notifications (backport of #3242) #4412

mergify · 2025-01-11T03:21:08Z

Problem

Some RPC operations are CPU bound and run for a significant amount of time. Those operations end up blocking worker threads that are also used to handle IO notifications, leading to notifications not being polled often enough and so for the whole RPC server to potentially become slow and exhibit high latency. When latency gets high enough it can exceed request timeouts, leading to failed requests.

Summary of Changes

This PR makes some of the most CPU expensive RPC methods use tokio::task::spawn_blocking to run cpu hungry code. This way the worker threads doing IO don't get blocked and latency is improved.

The methods changed so far include:

getMultipleAccounts
getProgramAccounts
getAccountInfo
getTokenAccountsByDelegate
getTokenAccountsByOwner

I'm not super familiar with RPC so I've changed what looking at the code seems to be loading/copying a lot of data around. Please feel free to suggest more!

Test plan

Methodolgy for selection of CPU defaults

Run this blocks benchmark script while tweaking CPU params. This was run on a 48 CPU machine.

`rpc_threads`	`rpc_blocking_threads`	Average	Median	p90	p99
cpus	cpus / 2	21880	22136	22546	22572
cpus	cpus / 4	20617 ($${\color{green}-5.7\%}$$)	20627 ($${\color{green}-6.8\%}$$)	21040 ($${\color{green}-6.7\%}$$)	21149 ($${\color{green}-6.3\%}$$)
cpus	cpus / 8	21366 ($${\color{green}-2.4\%}$$)	21367 ($${\color{green}-3.8\%}$$)	21434 ($${\color{green}-4.9\%}$$)	21477 ($${\color{green}-4.9\%}$$)
cpus / 2	cpus / 2	21642 ($${\color{green}-1.1\%}$$)	21525 ($${\color{green}-2.8\%}$$)	23202 ($${\color{red}+2.9\%}$$)	23235 ($${\color{red}+2.9\%}$$)
cpus / 2	cpus / 4	20033 ($${\color{green}-8.4\%}$$)	20044 ($${\color{green}-9.4\%}$$)	20430 ($${\color{green}-9.4\%}$$)	20598 ($${\color{green}-8.7\%}$$)

Methodology

Using this script for computing metrics: https://gist.github.com/steveluscher/b4959b9601093b0009f1d7646217b030, ran each of these account-cluster-bench suites before and after this PR:

account-info
block
blocks
first-available-block
multiple-accounts
slot
supply
token-accounts-by-delegate
token-accounts-by-owner
token-supply
transaction
transaction-parsed
version

Using a command similar to this:

 % (
       bash -c 'while ! curl -s localhost:8899/health | grep -q "ok"; do echo "Waiting for validator" && sleep 1; done;' \
           ${IFS# Set this higher if you want the test to run with more blocks having been committed } \
           && sleep 15 \
           && echo "Running bench" \
           && cd accounts-cluster-bench \
           && cargo run --release -- \
               -u l \
               --identity ~/.config/solana/id.json \
               ${IFS# Optional for benches that require token accounts} \
               ${IFS# https://gist.github.com/steveluscher/19261b5321f56a89dc75804070b61dc4} \
               ${IFS# --mint UhrKsjtPJJ8ndhSdrcCbQaiw8L8a6gH1FbmtJ4XpVJR } \
               --iterations 100 \
               --num-rpc-bench-threads 100 \
               --rpc-bench supply 2>&1 \
           | grep -Po "Supply average success_time: \K(\d+)" \
           | ~/stats.sh 
   ) \
       & (
           (cd accounts-cluster-bench && cargo build --release) \
           && (
               cd validator \
                   && rm -rf test-ledger/ \
                   && cargo run --release \
                       --manifest-path=./Cargo.toml \
                       --bin solana-test-validator -- \
                           ${IFS# Put this in ~/fixtures/ } \
                           ${IFS# https://gist.github.com/steveluscher/19261b5321f56a89dc75804070b61dc4 } \
                           --account-dir ~/fixtures \
                           --quiet
               ) \
       )
Average: 34293.3
Median: 31708.5
90th Percentile: 44640
99th Percentile: 45166

Note

You can adjust the sleep 15 if you want the validator to stack up more slots before starting the bench.

Warning

When running benches that require token accounts, supply a mint, space, and actually create the token account using the fixture found here.

Results

Warning

These results are a little messed up, because what's actually happening here is that the benchmark script is spitting out averages in 3s windows. The avg/p50/p90/p99 of those numbers is what you're seeing in this table. Not correct, but directionally correct.

Note

Filling in this grid would take a long time, especially if run against a mainnet RPC with production traffic. We may just choose to land this as ‘certainly better, how much we can't say exactly.’

Suite	Average	Median	p90	p99
`account-info`	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)
`block`	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)
`blocks`	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)
`first-available-block`	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)
`multiple-accounts`	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)
`slot`	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)
`supply`	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)
`token-accounts-by-delegate`	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)
`token-accounts-by-owner`	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)
`token-supply`	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)
`transaction`	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)
`transaction-parsed`	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)
`version`	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)	TBD ($${\color{green}-99\%}$$)

…cations (#3242) * rpc: limit the number of blocking threads in tokio runtime By default tokio allows up to 512 blocking threas. We don't want that many threads, as they'd slow down other validator threads. * rpc: make getMultipleAccounts async Make the function async and use tokio::task::spawn_blocking() to execute CPU-bound code in background. This prevents stalling the worker threads polling IO notifications and serving other non CPU-bound rpc methods. * rpc: make getAccount async * rpc: run get_filtered_program_accounts with task::spawn_blocking get_filtered_program_accounts can be used to retrieve _a list_ of accounts that match some filters. This is CPU bound and can block the calling thread for a significant amount of time when copying many/large accounts. * rpc: use our custom runtime to spawn blocking tasks Pass the custom runtime to JsonRpcRequestProcessor and use it to spawn blocking tasks from rpc methods. * Make `get_blocks()` and `get_block()` yieldy When these methods reach out to Blockstore, yield the thread * Make `get_supply()` yieldy When this method reaches out to accounts_db (through a call to `calculate_non_circulating_supply()`), yield the thread. * Make `get_first_available_block()` yieldy When this method reaches out to blockstore, yield the thread * Make `get_transaction()` yieldy When this method reaches out to blockstore, yield the thread * Make `get_token_supply()` yieldy When this method reaches out to methods on bank that do reads, yield the thread * Make the choice of `cpus / 4` as the default for `rpc_blocking_threads` * Encode blocks async * Revert "Make `get_first_available_block()` yieldy" This blockstore method doesn't actually do expensive reads. This reverts commit 3bbc57f. * Revert "Make `get_blocks()` and `get_block()` yieldy" Kept the `spawn_blocking` around: * Call to `get_rooted_block` * Call to `get_complete_block` This reverts commit 710f9c6. * Revert "Make `get_token_supply()` yieldy" * Reverted the change to `interest_bearing_config` * Reverted moving `bank.get_account(&mint)` to the background pool This reverts commit 02f5c94. * Share spawned call to `calculate_non_circulating_supply` between `get_supply` and `get_largest_accounts` * Create a shim for `get_filtered_indexed_accounts` that sends the work to the background thread internally * Send call to `get_largest_accounts` to the background pool --------- Co-authored-by: Steven Luscher <[email protected]> (cherry picked from commit c6f3e1b)

mergify · 2025-01-11T03:21:43Z

If this PR represents a change to the public RPC API:

Make sure it includes a complementary update to rpc-client/ (example)
Open a follow-up PR to update the JavaScript client @solana/web3.js (example)

Thank you for keeping the RPC clients in sync with the server API @mergify[bot].

mergify bot requested a review from a team as a code owner January 11, 2025 03:21

mergify bot assigned alessandrod Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.1: rpc: improve latency by not blocking worker threads polling IO notifications (backport of #3242) #4412

v2.1: rpc: improve latency by not blocking worker threads polling IO notifications (backport of #3242) #4412

mergify bot commented Jan 11, 2025

mergify bot commented Jan 11, 2025

v2.1: rpc: improve latency by not blocking worker threads polling IO notifications (backport of #3242) #4412

Are you sure you want to change the base?

v2.1: rpc: improve latency by not blocking worker threads polling IO notifications (backport of #3242) #4412

Conversation

mergify bot commented Jan 11, 2025

Problem

Summary of Changes

Test plan

Methodolgy for selection of CPU defaults

Methodology

Results

mergify bot commented Jan 11, 2025