-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make transaction status service multi-threaded. #4032
base: master
Are you sure you want to change the base?
Conversation
Thanks for looking at this! Haven't done a proper review yet, but skimming through the code, it looks like it would parallelize well using rayon instead? |
Thanks for the feedback Alessandro. That makes sense, and spinning off just the write_transaction_status_batch() into a rayon thread after a message is dequeued would be cleaner and achieve the same desired outcome. One follow-up, mostly because I am not super familiar with how we manage this at large on Agave, we should do it with a private rayon thread pool that's still capped, rather than the global rayon pool. I am worried about introducing other perf variations and resource starvation with tapping the global pool. Is that what you have in mind? |
This is a tricky one because the global rayon pool is actually 99.9% unused. But it does have a bajillion threads (num_cpus() I think?), so we should be careful to not crank it too hard. Between adding another pool and using the global one I'd vote for using the global one, and then hopefully someone makes that pool smaller. |
+1 to using the global pool |
I'll counter that I think this one may deserve a dedicated pool. I view the global pool as something to be used for startup operations, or for "occasional work". I know the incoming transactions are "bursty" so this isn't a perfect analogy, but at 10k TPS, that is 1 transaction coming in every 100us. Every 10us for 100k TPS. So, I think this work would keep the global pool pretty busy. |
Just want to make sure I follow, you're saying things get worse when running with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking at this! Haven't done a proper review yet, but skimming through the code, it looks like it would parallelize well using rayon instead?
That makes sense, and spinning off just the write_transaction_status_batch() into a rayon thread after a message is dequeued would be cleaner and achieve the same desired outcome
Do you know where we're spending time ? Is it in the transaction-status-preparation, or in the actual commit to Blockstore
(blockstore.write_batch()
) ?
The service is spun up even without the rpc tx history flag enabled, because the other check is tx notifier (tied to geyser plugin support). So this aligns with third graph showing memory activity and spikes and logs on agave nodes. This also aligns with the processing work for tx notifier happening unconditionally, and only items that involve writing to blockstore (rpc tx history, and ext metadata) being wrapped in conditional checks. In my experiments (and default config for internal tests that originally OOOM), both rpc tx history and extended metadata are enabled. I disabled them together for one experiment to get a sense of how much worse they make the situation when added, but I have not tested with rpc history enabled and ext metadata disabled. I expect that to be a no-op here bc the only place in this service it would matter in reducing work is cancelled out by tx notify support. |
All - I updated the review with using a local thread pool of 4 threads, while the global vs local discussion continues. I am still not convinced that giving TSS unrestricted access to the global thread pool is a good idea. I am worried about the case of just using all threads almost all the time (as we increase TPS). For example, with 4 threads, the queue can still have 500-1000 pending items, but it's not an issue yet (maybe it will later on the road to 1M). Yet in a global pool, they would still be processed sooner because they can. Also a 4-thread cap is a forcing function to revisit this if it's explicitly flagged in future debug as this service is falling behind and cant keep up at TPS rate X. Without it, TSS could starve other global pool users for threads, and we have to start the debug from some downstream manifestation of the issue. |
Yup, you're correct that enabling geyser on the validator will also result in this service getting spun up. I didn't see mention of geyser in the PR description, and I was acting under the assumption (possibly incorrectly) that these nodes did not have a geyser plugin enabled. |
Ah, good call. You are right. I missed this at first and just checked my running setup. The notifiers are not enabled.
|
It's the tx processing and constructing the write batch. The following data was collected running a local cluster test on my mac and filtered out to only count tx batches with 64 tx, which provided 278 data points. Sample size is small but it should be directionally correct to answer the question.
|
While your conclusion matches my assumption, I'm hesitant to draw any strong performance conclusions. Probably running thousands of threads on a dozen or so cores and getting thrashed to hell. Would be good to run on a dev server and make sure split times still hold. |
I re-ran both Linux and mac in release build (bc my earlier mac data was debug build by accident..). The conclusion holds that processing and batching dominates over the batch write step. But MBP with M3 max and 64 GB RAM outperforming devserver with AMD EPYC 7513 (64 vCores) and 512 GB RAM caught me by surprise. So I kept all the data together here for review.
|
No worries. So given your earlier comment
Have you seen the node OOM without these |
No, not OOM. I updated the wording to be more clear. If you look at the third graph in the PR description, that's with both those flags disabled and there was still a steep climb in mem used to 80-120GB that seem to have self-corrected back down. It was a one-off run. I have not looked into the reason for the spikes as the OOM was the main thing I was chasing. |
All - I have addressed all review feedback and questions. I also combined both requests for using of global rayon thread pool and a more cooperative sleep of 50 ms (friendlier service and concern mitigation for tapping global thread pool) in latest patch. with that, TSS backlog never builds to more that 4000 messages (@ 64 tx each max), consistently works that to zero, hits sleep, wakes to somewhere b/w 1k and 4k messages backlog. This is done while sustaining above 40k TPS synthetic wkld. So I think my earlier starvation concern for other users of global pool is mitigated. |
Problem
As part of an investigation into Agave OOM issues in internal private cluster tests, I found that TSS receiver channel would get severely backed up (80k+ pending msg) when the cluster is running at 40k TPS sustained (bench-TPS wkld; 80-20 FD to Agave node ratio). This would cause slow down across the system and build up memory usage until the node OOMs (crashed agave 256 GB node, and Agave tile on FD 512 GB node in my tests). This issue reproduces
more prominentlyreliably when running with '--enable-rpc-transaction-history' and '--enable-extended-tx-metadata-storage' enabled. Without those flags (graph 3), OOM issue did not reproduce but had mem spikes.Summary of Changes
Original issue:
FD node failures are agave tile oom'ing.
Improved state:
tiv1 and tiv2 are agave nodes running the fix. Other nodes running same FD code as before.
original code (without tx history flags):