You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With a 10 node invalidator cluster, shred recv spends 20% repairing shreds. Where are these shreds disappearing? Why so much repair?
When producing a large number of shreds, window service can get congested and not update slot metas for a long time. These slot metas are what repair uses to identify missing shreds, so it can lead to requesting repairs for shreds that we already have. This snowballs in a bad way…
It has been observed to take a long time (>200ms) for a shred to go from fetch to window service. This is probably just a symptom of the above.
The repair interval is only 100ms, which means if our repair peer is across the world, we always request the repair more than once. We should give repairs a better chance to land before spamming more requests (even though they should get deduped before sigverify).
It’s important to note that in the typical case, we are only inserting 1 or 2 shreds per iteration. Unclear how we sometimes get slammed and fall into the runaway case.
With 32:32 erasure coding, why would we ever need to repair during normal runtime? Seems that repair is more of a bypass for when we get unlucky and have lots of slow shreds in a batch (if we’re in turbine layer 2, shreds take 3 hops to reach us). If we enforce repairing from turbine parent, this trick will be disallowed.
On the other side of the coin, repair is super slow due to throttling when starting a fresh node. Long term fix would be to start turbine earlier (while downloading snapshot), but simpler bandaid stopgap is to unthrottle repair in some cases.
The text was updated successfully, but these errors were encountered:
With a 10 node invalidator cluster, shred recv spends 20% repairing shreds. Where are these shreds disappearing? Why so much repair?
When producing a large number of shreds, window service can get congested and not update slot metas for a long time. These slot metas are what repair uses to identify missing shreds, so it can lead to requesting repairs for shreds that we already have. This snowballs in a bad way…
It has been observed to take a long time (>200ms) for a shred to go from fetch to window service. This is probably just a symptom of the above.
The repair interval is only 100ms, which means if our repair peer is across the world, we always request the repair more than once. We should give repairs a better chance to land before spamming more requests (even though they should get deduped before sigverify).
It’s important to note that in the typical case, we are only inserting 1 or 2 shreds per iteration. Unclear how we sometimes get slammed and fall into the runaway case.
With 32:32 erasure coding, why would we ever need to repair during normal runtime? Seems that repair is more of a bypass for when we get unlucky and have lots of slow shreds in a batch (if we’re in turbine layer 2, shreds take 3 hops to reach us). If we enforce repairing from turbine parent, this trick will be disallowed.
On the other side of the coin, repair is super slow due to throttling when starting a fresh node. Long term fix would be to start turbine earlier (while downloading snapshot), but simpler bandaid stopgap is to unthrottle repair in some cases.
The text was updated successfully, but these errors were encountered: