Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Shred Repair #4268

Open
bw-solana opened this issue Jan 3, 2025 · 0 comments
Open

Performance: Shred Repair #4268

bw-solana opened this issue Jan 3, 2025 · 0 comments

Comments

@bw-solana
Copy link

With a 10 node invalidator cluster, shred recv spends 20% repairing shreds. Where are these shreds disappearing? Why so much repair?

When producing a large number of shreds, window service can get congested and not update slot metas for a long time. These slot metas are what repair uses to identify missing shreds, so it can lead to requesting repairs for shreds that we already have. This snowballs in a bad way…

It has been observed to take a long time (>200ms) for a shred to go from fetch to window service. This is probably just a symptom of the above.

The repair interval is only 100ms, which means if our repair peer is across the world, we always request the repair more than once. We should give repairs a better chance to land before spamming more requests (even though they should get deduped before sigverify).

It’s important to note that in the typical case, we are only inserting 1 or 2 shreds per iteration. Unclear how we sometimes get slammed and fall into the runaway case.

With 32:32 erasure coding, why would we ever need to repair during normal runtime? Seems that repair is more of a bypass for when we get unlucky and have lots of slow shreds in a batch (if we’re in turbine layer 2, shreds take 3 hops to reach us). If we enforce repairing from turbine parent, this trick will be disallowed.

On the other side of the coin, repair is super slow due to throttling when starting a fresh node. Long term fix would be to start turbine earlier (while downloading snapshot), but simpler bandaid stopgap is to unthrottle repair in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

1 participant