You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The agent code evolved to include download timeouts outside of the BlockBuffer, but this makes the code around "block reservations" harder to debug. There currently appears to be some unusual bug where block reservations can sometimes be created and never cancelled, and if enough of these stack up, syncing stops until the agent is restarted. (Seems to happen especially in “prod-sim” environments where everything is on the same machine – BCHN becomes unresponsive to the agent until it’s initial sync is complete, and if chipnet/testnet and mainnet are both syncing, the block buffer sometimes becomes full of reservations without any active downloads).
A more defensive way to design this is for the block buffer to itself manage a timeout for each reservation - rather than being a simple internal counter, reservations could be an array of objects, each with a cancellation callback and timestamp at which the reservation should be cancelled. Timed-out reservations can then be cancelled each time the block buffer is cleaned up. So before requesting a block, the agent simply registers the cancellation time and callback with the block buffer (maybe also the block height and node name for monitoring purposes), rather than managing download timers itself.
This refactor should also clean up our prioritization when requesting blocks: the current strategy of requesting the least-synced chain is a great basic behavior, but if it’s causing the agent to spend all of its time on one mostly-unresponsive node, we’re wasting potential sync time of other nodes. So maybe after a block download is cancelled, we should temporarily bias block selection to other, non-lagging nodes. In the best case scenario, load is organized to have nodes finish syncing at a similar time, but throughput is never wasted waiting for a slow node (e.g. while finishing an initial sync).
Related: BCHN gets very behind on serving requests during initial sync, and it can sometimes send back a requested block >5 minutes later (see BCHN issue). We need to be a bit more intelligent about simply canceling these requests, and they later come in and look like blocks mined by the node (when we simply forget we requested them).
The text was updated successfully, but these errors were encountered:
The agent code evolved to include download timeouts outside of the BlockBuffer, but this makes the code around "block reservations" harder to debug. There currently appears to be some unusual bug where block reservations can sometimes be created and never cancelled, and if enough of these stack up, syncing stops until the agent is restarted. (Seems to happen especially in “prod-sim” environments where everything is on the same machine – BCHN becomes unresponsive to the agent until it’s initial sync is complete, and if chipnet/testnet and mainnet are both syncing, the block buffer sometimes becomes full of reservations without any active downloads).
A more defensive way to design this is for the block buffer to itself manage a timeout for each reservation - rather than being a simple internal counter, reservations could be an array of objects, each with a cancellation callback and timestamp at which the reservation should be cancelled. Timed-out reservations can then be cancelled each time the block buffer is cleaned up. So before requesting a block, the agent simply registers the cancellation time and callback with the block buffer (maybe also the block height and node name for monitoring purposes), rather than managing download timers itself.
This refactor should also clean up our prioritization when requesting blocks: the current strategy of requesting the least-synced chain is a great basic behavior, but if it’s causing the agent to spend all of its time on one mostly-unresponsive node, we’re wasting potential sync time of other nodes. So maybe after a block download is cancelled, we should temporarily bias block selection to other, non-lagging nodes. In the best case scenario, load is organized to have nodes finish syncing at a similar time, but throughput is never wasted waiting for a slow node (e.g. while finishing an initial sync).
Related: BCHN gets very behind on serving requests during initial sync, and it can sometimes send back a requested block >5 minutes later (see BCHN issue). We need to be a bit more intelligent about simply canceling these requests, and they later come in and look like blocks mined by the node (when we simply forget we requested them).
The text was updated successfully, but these errors were encountered: