perf: switch store from RwLock to Mutex 1.74x performance increase ~98.8% error rate decrease #1700

KolbyML · 2025-02-23T10:58:32Z

What was wrong?

Currently the performance of Trin is very poor as in we can only handle around 2-5 Megabits per second. This is insanely low, inorder to get the state network live we need to massively increase performance of Trin hopefully by at least 10-100x more ideally. Ideally Trin can saturate a computers network card. To put 2-5 megabits into prospective that isn't even 1 megabyte per a second, Trin is insanely slow, and due to this slowness Trin has a high error rate, especially under under small load. Last I checked only 90% of Trin Execution state diff bridges made it onto the network. The state is massive, having an error rate that high makes it hard to have a reliable network.

In short

trin performance is bad
trin performance isn't good enough to onboard users, the network would die under any real load by users

How was it fixed?

This is the first improvement of many. I used the benchmark I wrote and found an big bottleneck around our database code, if I commented out a few database calls I was able to get 2x performance which was the theoretical target for my optimizations for this specific bottleneck.

I ended up finding the parking_lot::RwLock would block readers to "ensure" fairness to writers, the issue with this is it greatly limits performance of Trin. When I switched the read's to write's the performance bottleneck I was trying to debug went away.

	Time	Errors/Transfer Failures
Run 1	15m 27s	5539
Run 2	15m 9s	969
Run 3	13m 5s	13139
Run 4	8m 27s	89
Run 5	8m 26s	72
Run 6	8m 11s	76

Here is a chart, the first 3 runs are before this PR. The last 3 runs are with this PR.

If you notice there is a speed up of 1.74x and there are ~98.8% fewer errors. This is significant and confirms my belief that in order to increase the reliability of State transfers, we must increase the performance of Trin itself, as currently Trin can't handle any reasonable load before throwing errors and such.

For more information on the benchmark

I kinda hijacked my benchmark PR to benchmark stuff #1660

But the basic idea is we send a range of era1 files through offers from nodeA to nodeB, we send all headers first, then bodies, then receipts. This way we won't have validation/complexity issues, as if we did 1 block at a time.

I am sending era1 1000 to 1010 inclusive which is around 6.7GB of data

Todo in follow up PR's

This change doesn't really change our performance for small transfers like for headers and state, but now that this bottleneck is fixed others should become easier to find (it is like a game of wack a mole, where you get rid of the some and new ones take their place). Currently we can only handle around 20-30k packets sending or receiving (when sending headers, for bodies and receipts our rate is significantly higher especially with this fix) I am assuming a good chunk of those are wasted packets because currently all talk_responses are empty, so ethereum/devp2p#229 would probably help significantly with that, but I am sure I can find another bottleneck which should give another significant performance gain, they shouldn't be too hard to find as currently Trin's performance is unbelievably bad.

…8.8% error rate decrease

njgheorghita

Yeah, great find! I'm still not completely sure I understand why this change caused the improvement, but those numbers look much better

KolbyML · 2025-02-25T06:21:44Z

Yeah, great find! I'm still not completely sure I understand why this change caused the improvement, but those numbers look much better

Yeah it is a little confusing and intuitive, I don't fully understand the core issue myself

morph-dev

LGTM

Same as with other comments, not sure why this would give so much improvement... Glad you found it.

KolbyML requested review from carver, njgheorghita, ogenev and morph-dev February 23, 2025 10:58

KolbyML self-assigned this Feb 23, 2025

KolbyML added enhancement New feature or request history network Issue related to portal history network state network Issue related to portal state network priority labels Feb 23, 2025

KolbyML force-pushed the improve-database-performance branch 2 times, most recently from 571a433 to 1ca0472 Compare February 24, 2025 18:49

perf: switch store from RwLock to Mutex 1.74x performance increase ~9…

6373d15

…8.8% error rate decrease

KolbyML force-pushed the improve-database-performance branch from 1ca0472 to 6373d15 Compare February 24, 2025 18:54

njgheorghita approved these changes Feb 24, 2025

View reviewed changes

KolbyML merged commit e478c5f into ethereum:master Feb 25, 2025
14 checks passed

morph-dev reviewed Feb 25, 2025

View reviewed changes

KolbyML mentioned this pull request Mar 1, 2025

feat: initiating database for ream ReamLabs/ream#153

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: switch store from RwLock to Mutex 1.74x performance increase ~98.8% error rate decrease #1700

perf: switch store from RwLock to Mutex 1.74x performance increase ~98.8% error rate decrease #1700

KolbyML commented Feb 23, 2025 •

edited

Loading

njgheorghita left a comment

KolbyML commented Feb 25, 2025 •

edited

Loading

morph-dev left a comment

perf: switch store from RwLock to Mutex 1.74x performance increase ~98.8% error rate decrease #1700

perf: switch store from RwLock to Mutex 1.74x performance increase ~98.8% error rate decrease #1700

Conversation

KolbyML commented Feb 23, 2025 • edited Loading

What was wrong?

How was it fixed?

For more information on the benchmark

Todo in follow up PR's

njgheorghita left a comment

Choose a reason for hiding this comment

KolbyML commented Feb 25, 2025 • edited Loading

morph-dev left a comment

Choose a reason for hiding this comment

KolbyML commented Feb 23, 2025 •

edited

Loading

KolbyML commented Feb 25, 2025 •

edited

Loading