feat: checkpoint node recovery progress to skip rescans on restart by halfprice · Pull Request #3407 · MystenLabs/walrus

halfprice · 2026-05-23T05:31:32Z

Description

Previously start_node_recovery rescanned the certified-blob iterator from the first blob_id on every invocation and on every outer-loop pass, so a crash mid-recovery lost all progress and a successful first pass still re-read every blob to verify storage.

Persist a NodeRecoveryProgress {epoch, owned_shards, last_settled_blob_id} checkpoint in a new RocksDB column family. On resume, keep the checkpoint iff the current owned-shards set equals the persisted one — otherwise the "stored at all shards" assertion may no longer hold (new shards are empty and removed-then-regained shards have been wiped).

Refactor blob_sync_handler.start_sync to return watch::Receiver instead of Arc, propagating SyncOutcome::{Success, Cancelled, Skipped} to callers. This is multi-consumer safe by construction and incidentally fixes a latent notify_one bug where a second waiter on the same blob would block forever. With real outcomes available, the recovery loop no longer needs the outer "re-scan to verify" pass (TODO(WAL-669)).

Cancelled outcomes that leave a blob still certified and unstored are queued for retry with exponential backoff (1s -> 30s cap). The settled cursor never advances past any in-flight or retry-pending blob_id.

Test plan

How did you test the new or updated feature?

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that
a user might notice and any actions they must take to implement updates. (Add release notes after the colon for each item)

Storage node: storage node full recovery has checkpoint now.
Aggregator:
Publisher:
CLI:

github-actions · 2026-05-23T05:31:59Z

Warning: This PR modifies one of the example config files. Please consider the
following:

Make sure the changes are backwards compatible with the current configuration.
Make sure any added parameters follow the conventions of the existing parameters; in
particular, durations should take seconds or milliseconds using the naming convention
_secs or _millis, respectively.
If there are added optional parameter sections, it should be possible to specify them
partially. A useful pattern there is to implement Default for the struct and derive
#[serde(default)] on it, see BlobRecoveryConfig as an example.
You may need to update the documentation to reflect the changes.

Previously start_node_recovery rescanned the certified-blob iterator from the first blob_id on every invocation and on every outer-loop pass, so a crash mid-recovery lost all progress and a successful first pass still re-read every blob to verify storage. Persist a NodeRecoveryProgress {epoch, owned_shards, last_settled_blob_id} checkpoint in a new RocksDB column family. On resume, keep the checkpoint iff the current owned-shards set equals the persisted one — otherwise the "stored at all shards" assertion may no longer hold (new shards are empty and removed-then-regained shards have been wiped). Replace the outer "re-scan to verify" loop (TODO(WAL-669)) with a single pass that drives blob syncs concurrently and trusts the SyncOutcome returned via watch::Receiver<SyncStatus> (introduced in #3412). Cancelled outcomes that leave a blob still certified and unstored are queued for retry with exponential backoff (1s -> 30s cap). The settled cursor never advances past any in-flight or retry-pending blob_id.

halfprice force-pushed the zhewu/node_recovery_checkpoint branch from 7aae512 to f456a66 Compare May 27, 2026 23:59

halfprice force-pushed the zhewu/node_recovery_checkpoint branch from f456a66 to 2861256 Compare June 3, 2026 06:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: checkpoint node recovery progress to skip rescans on restart#3407

feat: checkpoint node recovery progress to skip rescans on restart#3407
halfprice wants to merge 1 commit into
mainfrom
zhewu/node_recovery_checkpoint

halfprice commented May 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

halfprice commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test plan

Release notes

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

halfprice commented May 23, 2026 •

edited

Loading