Refactor: `BlockBuffer` should manage download timeouts #53

bitjson · 2023-01-24T05:07:32Z

The agent code evolved to include download timeouts outside of the BlockBuffer, but this makes the code around "block reservations" harder to debug. There currently appears to be some unusual bug where block reservations can sometimes be created and never cancelled, and if enough of these stack up, syncing stops until the agent is restarted. (Seems to happen especially in “prod-sim” environments where everything is on the same machine – BCHN becomes unresponsive to the agent until it’s initial sync is complete, and if chipnet/testnet and mainnet are both syncing, the block buffer sometimes becomes full of reservations without any active downloads).

A more defensive way to design this is for the block buffer to itself manage a timeout for each reservation - rather than being a simple internal counter, reservations could be an array of objects, each with a cancellation callback and timestamp at which the reservation should be cancelled. Timed-out reservations can then be cancelled each time the block buffer is cleaned up. So before requesting a block, the agent simply registers the cancellation time and callback with the block buffer (maybe also the block height and node name for monitoring purposes), rather than managing download timers itself.

This refactor should also clean up our prioritization when requesting blocks: the current strategy of requesting the least-synced chain is a great basic behavior, but if it’s causing the agent to spend all of its time on one mostly-unresponsive node, we’re wasting potential sync time of other nodes. So maybe after a block download is cancelled, we should temporarily bias block selection to other, non-lagging nodes. In the best case scenario, load is organized to have nodes finish syncing at a similar time, but throughput is never wasted waiting for a slow node (e.g. while finishing an initial sync).

Related: BCHN gets very behind on serving requests during initial sync, and it can sometimes send back a requested block >5 minutes later (see BCHN issue). We need to be a bit more intelligent about simply canceling these requests, and they later come in and look like blocks mined by the node (when we simply forget we requested them).

bitjson added bug Something isn't working good first issue Good for newcomers labels Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: `BlockBuffer` should manage download timeouts #53

Refactor: `BlockBuffer` should manage download timeouts #53

bitjson commented Jan 24, 2023

Refactor: BlockBuffer should manage download timeouts #53

Refactor: BlockBuffer should manage download timeouts #53

Comments

bitjson commented Jan 24, 2023

Refactor: `BlockBuffer` should manage download timeouts #53

Refactor: `BlockBuffer` should manage download timeouts #53