Performance: Shred Repair #4268

bw-solana · 2025-01-03T20:21:34Z

With a 10 node invalidator cluster, shred recv spends 20% repairing shreds. Where are these shreds disappearing? Why so much repair?

When producing a large number of shreds, window service can get congested and not update slot metas for a long time. These slot metas are what repair uses to identify missing shreds, so it can lead to requesting repairs for shreds that we already have. This snowballs in a bad way…

It has been observed to take a long time (>200ms) for a shred to go from fetch to window service. This is probably just a symptom of the above.

The repair interval is only 100ms, which means if our repair peer is across the world, we always request the repair more than once. We should give repairs a better chance to land before spamming more requests (even though they should get deduped before sigverify).

It’s important to note that in the typical case, we are only inserting 1 or 2 shreds per iteration. Unclear how we sometimes get slammed and fall into the runaway case.

With 32:32 erasure coding, why would we ever need to repair during normal runtime? Seems that repair is more of a bypass for when we get unlucky and have lots of slow shreds in a batch (if we’re in turbine layer 2, shreds take 3 hops to reach us). If we enforce repairing from turbine parent, this trick will be disallowed.

On the other side of the coin, repair is super slow due to throttling when starting a fresh node. Long term fix would be to start turbine earlier (while downloading snapshot), but simpler bandaid stopgap is to unthrottle repair in some cases.

bw-solana added this to Agave Performance Jan 3, 2025

bw-solana moved this to Backlog in Agave Performance Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Shred Repair #4268

Performance: Shred Repair #4268

bw-solana commented Jan 3, 2025

Performance: Shred Repair #4268

Performance: Shred Repair #4268

Comments

bw-solana commented Jan 3, 2025