Skip to content

feat: checkpoint node recovery progress to skip rescans on restart#3407

Open
halfprice wants to merge 1 commit into
mainfrom
zhewu/node_recovery_checkpoint
Open

feat: checkpoint node recovery progress to skip rescans on restart#3407
halfprice wants to merge 1 commit into
mainfrom
zhewu/node_recovery_checkpoint

Conversation

@halfprice

@halfprice halfprice commented May 23, 2026

Copy link
Copy Markdown
Collaborator

Description

Previously start_node_recovery rescanned the certified-blob iterator from the first blob_id on every invocation and on every outer-loop pass, so a crash mid-recovery lost all progress and a successful first pass still re-read every blob to verify storage.

Persist a NodeRecoveryProgress {epoch, owned_shards, last_settled_blob_id} checkpoint in a new RocksDB column family. On resume, keep the checkpoint iff the current owned-shards set equals the persisted one — otherwise the "stored at all shards" assertion may no longer hold (new shards are empty and removed-then-regained shards have been wiped).

Refactor blob_sync_handler.start_sync to return watch::Receiver instead of Arc, propagating SyncOutcome::{Success, Cancelled, Skipped} to callers. This is multi-consumer safe by construction and incidentally fixes a latent notify_one bug where a second waiter on the same blob would block forever. With real outcomes available, the recovery loop no longer needs the outer "re-scan to verify" pass (TODO(WAL-669)).

Cancelled outcomes that leave a blob still certified and unstored are queued for retry with exponential backoff (1s -> 30s cap). The settled cursor never advances past any in-flight or retry-pending blob_id.

Test plan

How did you test the new or updated feature?


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that
a user might notice and any actions they must take to implement updates. (Add release notes after the colon for each item)

  • Storage node: storage node full recovery has checkpoint now.
  • Aggregator:
  • Publisher:
  • CLI:

@github-actions

Copy link
Copy Markdown
Contributor

Warning: This PR modifies one of the example config files. Please consider the
following:

  • Make sure the changes are backwards compatible with the current configuration.
  • Make sure any added parameters follow the conventions of the existing parameters; in
    particular, durations should take seconds or milliseconds using the naming convention
    _secs or _millis, respectively.
  • If there are added optional parameter sections, it should be possible to specify them
    partially. A useful pattern there is to implement Default for the struct and derive
    #[serde(default)] on it, see BlobRecoveryConfig as an example.
  • You may need to update the documentation to reflect the changes.

@halfprice halfprice force-pushed the zhewu/node_recovery_checkpoint branch from 7aae512 to f456a66 Compare May 27, 2026 23:59
Previously start_node_recovery rescanned the certified-blob iterator from
the first blob_id on every invocation and on every outer-loop pass, so a
crash mid-recovery lost all progress and a successful first pass still
re-read every blob to verify storage.

Persist a NodeRecoveryProgress {epoch, owned_shards, last_settled_blob_id}
checkpoint in a new RocksDB column family. On resume, keep the checkpoint
iff the current owned-shards set equals the persisted one — otherwise the
"stored at all shards" assertion may no longer hold (new shards are empty
and removed-then-regained shards have been wiped).

Replace the outer "re-scan to verify" loop (TODO(WAL-669)) with a single
pass that drives blob syncs concurrently and trusts the SyncOutcome
returned via watch::Receiver<SyncStatus> (introduced in #3412). Cancelled
outcomes that leave a blob still certified and unstored are queued for
retry with exponential backoff (1s -> 30s cap). The settled cursor never
advances past any in-flight or retry-pending blob_id.
@halfprice halfprice force-pushed the zhewu/node_recovery_checkpoint branch from f456a66 to 2861256 Compare June 3, 2026 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant