Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: switch store from RwLock to Mutex 1.74x performance increase ~98.8% error rate decrease #1700

Merged
merged 1 commit into from
Feb 25, 2025

Conversation

KolbyML
Copy link
Member

@KolbyML KolbyML commented Feb 23, 2025

What was wrong?

Currently the performance of Trin is very poor as in we can only handle around 2-5 Megabits per second. This is insanely low, inorder to get the state network live we need to massively increase performance of Trin hopefully by at least 10-100x more ideally. Ideally Trin can saturate a computers network card. To put 2-5 megabits into prospective that isn't even 1 megabyte per a second, Trin is insanely slow, and due to this slowness Trin has a high error rate, especially under under small load. Last I checked only 90% of Trin Execution state diff bridges made it onto the network. The state is massive, having an error rate that high makes it hard to have a reliable network.

In short

  • trin performance is bad
  • trin performance isn't good enough to onboard users, the network would die under any real load by users

How was it fixed?

This is the first improvement of many. I used the benchmark I wrote and found an big bottleneck around our database code, if I commented out a few database calls I was able to get 2x performance which was the theoretical target for my optimizations for this specific bottleneck.

image

I ended up finding the parking_lot::RwLock would block readers to "ensure" fairness to writers, the issue with this is it greatly limits performance of Trin. When I switched the read's to write's the performance bottleneck I was trying to debug went away.

Time Errors/Transfer Failures
Run 1 15m 27s 5539
Run 2 15m 9s 969
Run 3 13m 5s 13139
Run 4 8m 27s 89
Run 5 8m 26s 72
Run 6 8m 11s 76

graph

Here is a chart, the first 3 runs are before this PR. The last 3 runs are with this PR.

If you notice there is a speed up of 1.74x and there are ~98.8% fewer errors. This is significant and confirms my belief that in order to increase the reliability of State transfers, we must increase the performance of Trin itself, as currently Trin can't handle any reasonable load before throwing errors and such.

For more information on the benchmark

I kinda hijacked my benchmark PR to benchmark stuff #1660

But the basic idea is we send a range of era1 files through offers from nodeA to nodeB, we send all headers first, then bodies, then receipts. This way we won't have validation/complexity issues, as if we did 1 block at a time.

I am sending era1 1000 to 1010 inclusive which is around 6.7GB of data

Todo in follow up PR's

This change doesn't really change our performance for small transfers like for headers and state, but now that this bottleneck is fixed others should become easier to find (it is like a game of wack a mole, where you get rid of the some and new ones take their place). Currently we can only handle around 20-30k packets sending or receiving (when sending headers, for bodies and receipts our rate is significantly higher especially with this fix) I am assuming a good chunk of those are wasted packets because currently all talk_responses are empty, so ethereum/devp2p#229 would probably help significantly with that, but I am sure I can find another bottleneck which should give another significant performance gain, they shouldn't be too hard to find as currently Trin's performance is unbelievably bad.

@KolbyML KolbyML self-assigned this Feb 23, 2025
@KolbyML KolbyML added enhancement New feature or request history network Issue related to portal history network state network Issue related to portal state network priority labels Feb 23, 2025
@KolbyML KolbyML force-pushed the improve-database-performance branch 2 times, most recently from 571a433 to 1ca0472 Compare February 24, 2025 18:49
@KolbyML KolbyML force-pushed the improve-database-performance branch from 1ca0472 to 6373d15 Compare February 24, 2025 18:54
Copy link
Collaborator

@njgheorghita njgheorghita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, great find! I'm still not completely sure I understand why this change caused the improvement, but those numbers look much better

@KolbyML
Copy link
Member Author

KolbyML commented Feb 25, 2025

Yeah, great find! I'm still not completely sure I understand why this change caused the improvement, but those numbers look much better

Yeah it is a little confusing and intuitive, I don't fully understand the core issue myself

@KolbyML KolbyML merged commit e478c5f into ethereum:master Feb 25, 2025
14 checks passed
Copy link
Collaborator

@morph-dev morph-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Same as with other comments, not sure why this would give so much improvement... Glad you found it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request history network Issue related to portal history network priority state network Issue related to portal state network
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants