feat: checkpoint node recovery progress to skip rescans on restart#3407
Open
halfprice wants to merge 1 commit into
Open
feat: checkpoint node recovery progress to skip rescans on restart#3407halfprice wants to merge 1 commit into
halfprice wants to merge 1 commit into
Conversation
Contributor
|
Warning: This PR modifies one of the example config files. Please consider the
|
7aae512 to
f456a66
Compare
Previously start_node_recovery rescanned the certified-blob iterator from
the first blob_id on every invocation and on every outer-loop pass, so a
crash mid-recovery lost all progress and a successful first pass still
re-read every blob to verify storage.
Persist a NodeRecoveryProgress {epoch, owned_shards, last_settled_blob_id}
checkpoint in a new RocksDB column family. On resume, keep the checkpoint
iff the current owned-shards set equals the persisted one — otherwise the
"stored at all shards" assertion may no longer hold (new shards are empty
and removed-then-regained shards have been wiped).
Replace the outer "re-scan to verify" loop (TODO(WAL-669)) with a single
pass that drives blob syncs concurrently and trusts the SyncOutcome
returned via watch::Receiver<SyncStatus> (introduced in #3412). Cancelled
outcomes that leave a blob still certified and unstored are queued for
retry with exponential backoff (1s -> 30s cap). The settled cursor never
advances past any in-flight or retry-pending blob_id.
f456a66 to
2861256
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Previously start_node_recovery rescanned the certified-blob iterator from the first blob_id on every invocation and on every outer-loop pass, so a crash mid-recovery lost all progress and a successful first pass still re-read every blob to verify storage.
Persist a NodeRecoveryProgress {epoch, owned_shards, last_settled_blob_id} checkpoint in a new RocksDB column family. On resume, keep the checkpoint iff the current owned-shards set equals the persisted one — otherwise the "stored at all shards" assertion may no longer hold (new shards are empty and removed-then-regained shards have been wiped).
Refactor blob_sync_handler.start_sync to return watch::Receiver instead of Arc, propagating SyncOutcome::{Success, Cancelled, Skipped} to callers. This is multi-consumer safe by construction and incidentally fixes a latent notify_one bug where a second waiter on the same blob would block forever. With real outcomes available, the recovery loop no longer needs the outer "re-scan to verify" pass (TODO(WAL-669)).
Cancelled outcomes that leave a blob still certified and unstored are queued for retry with exponential backoff (1s -> 30s cap). The settled cursor never advances past any in-flight or retry-pending blob_id.
Test plan
How did you test the new or updated feature?
Release notes
Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that
a user might notice and any actions they must take to implement updates. (Add release notes after the colon for each item)