perf: switch store from RwLock to Mutex 1.74x performance increase ~98.8% error rate decrease #1700
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What was wrong?
Currently the performance of Trin is very poor as in we can only handle around 2-5 Megabits per second. This is insanely low, inorder to get the state network live we need to massively increase performance of Trin hopefully by at least 10-100x more ideally. Ideally Trin can saturate a computers network card. To put 2-5 megabits into prospective that isn't even 1 megabyte per a second, Trin is insanely slow, and due to this slowness Trin has a high error rate, especially under under small load. Last I checked only 90% of Trin Execution state diff bridges made it onto the network. The state is massive, having an error rate that high makes it hard to have a reliable network.
In short
How was it fixed?
This is the first improvement of many. I used the benchmark I wrote and found an big bottleneck around our database code, if I commented out a few database calls I was able to get 2x performance which was the theoretical target for my optimizations for this specific bottleneck.
I ended up finding the
parking_lot::RwLock
would block readers to "ensure" fairness to writers, the issue with this is it greatly limits performance of Trin. When I switched the read's to write's the performance bottleneck I was trying to debug went away.Here is a chart, the first 3 runs are before this PR. The last 3 runs are with this PR.
If you notice there is a speed up of 1.74x and there are ~98.8% fewer errors. This is significant and confirms my belief that in order to increase the reliability of State transfers, we must increase the performance of Trin itself, as currently Trin can't handle any reasonable load before throwing errors and such.
For more information on the benchmark
I kinda hijacked my benchmark PR to benchmark stuff #1660
But the basic idea is we send a range of era1 files through offers from nodeA to nodeB, we send all headers first, then bodies, then receipts. This way we won't have validation/complexity issues, as if we did 1 block at a time.
I am sending era1 1000 to 1010 inclusive which is around 6.7GB of data
Todo in follow up PR's
This change doesn't really change our performance for small transfers like for headers and state, but now that this bottleneck is fixed others should become easier to find (it is like a game of wack a mole, where you get rid of the some and new ones take their place). Currently we can only handle around 20-30k packets sending or receiving (when sending headers, for bodies and receipts our rate is significantly higher especially with this fix) I am assuming a good chunk of those are wasted packets because currently all talk_responses are empty, so ethereum/devp2p#229 would probably help significantly with that, but I am sure I can find another bottleneck which should give another significant performance gain, they shouldn't be too hard to find as currently Trin's performance is unbelievably bad.