Skip to content

Do not wait for startup verification before calling shrink in ABS#6402

Merged
brooksprumo merged 1 commit intoanza-xyz:masterfrom
brooksprumo:abs/unconditional-shrink
Jun 4, 2025
Merged

Do not wait for startup verification before calling shrink in ABS#6402
brooksprumo merged 1 commit intoanza-xyz:masterfrom
brooksprumo:abs/unconditional-shrink

Conversation

@brooksprumo
Copy link

@brooksprumo brooksprumo commented Jun 4, 2025

Problem

AccountsBackgroundService conditionally calls shrink based on if startup verification is complete or not, but this is no longer necessary.

Originally, we didn't want to shrink in ABS until startup verification was complete because we needed to ensure the startup accounts hash calculation was able to get the right storages. If shrink was called before the accounts hash calculation was run, then accounts could be removed, which could cause startup verification to erroneously fail due to calculating a different hash. Here's the PR where we added the conditional around shrink: solana-labs#34209 (and it includes reasoning why this impl was chosen vs getting the storages early and passing them into verify_accounts_hash()).

Now, startup verification grabs the storages before verification even begins, so there is no risk of those storages getting removed/changed. This means we can unconditionally shrink in ABS.

Summary of Changes

Unconditionally shrink in ABS.

Justification

We load blockstore (i.e. create bank/bank forks) before creating AccountsBackgroundService (top of this snippet is where we create bank forks, and the bottom of the snippet is where we create AccountsBackgroundService):

agave/core/src/validator.rs

Lines 785 to 925 in 6fcda9b

let (
bank_forks,
blockstore,
original_blockstore_root,
ledger_signal_receiver,
leader_schedule_cache,
starting_snapshot_hashes,
TransactionHistoryServices {
transaction_status_sender,
transaction_status_service,
max_complete_transaction_status_slot,
},
blockstore_process_options,
blockstore_root_scan,
pruned_banks_receiver,
entry_notifier_service,
) = load_blockstore(
config,
ledger_path,
&genesis_config,
exit.clone(),
&start_progress,
accounts_update_notifier,
transaction_notifier,
entry_notifier,
)
.map_err(ValidatorError::Other)?;
if !config.no_poh_speed_test {
check_poh_speed(&bank_forks.read().unwrap().root_bank(), None)?;
}
let (root_slot, hard_forks) = {
let root_bank = bank_forks.read().unwrap().root_bank();
(root_bank.slot(), root_bank.hard_forks())
};
let shred_version = compute_shred_version(&genesis_config.hash(), Some(&hard_forks));
info!(
"shred version: {shred_version}, hard forks: {:?}",
hard_forks
);
if let Some(expected_shred_version) = config.expected_shred_version {
if expected_shred_version != shred_version {
return Err(ValidatorError::ShredVersionMismatch {
actual: shred_version,
expected: expected_shred_version,
}
.into());
}
}
if let Some(start_slot) = should_cleanup_blockstore_incorrect_shred_versions(
config,
&blockstore,
root_slot,
&hard_forks,
)? {
*start_progress.write().unwrap() = ValidatorStartProgress::CleaningBlockStore;
cleanup_blockstore_incorrect_shred_versions(
&blockstore,
config,
start_slot,
shred_version,
)?;
} else {
info!("Skipping the blockstore check for shreds with incorrect version");
}
node.info.set_shred_version(shred_version);
node.info.set_wallclock(timestamp());
Self::print_node_info(&node);
let mut cluster_info = ClusterInfo::new(
node.info.clone(),
identity_keypair.clone(),
socket_addr_space,
);
cluster_info.set_contact_debug_interval(config.contact_debug_interval);
cluster_info.set_entrypoints(cluster_entrypoints);
cluster_info.restore_contact_info(ledger_path, config.contact_save_interval);
let cluster_info = Arc::new(cluster_info);
assert!(is_snapshot_config_valid(&config.snapshot_config));
let (snapshot_request_sender, snapshot_request_receiver) = unbounded();
let snapshot_controller = Arc::new(SnapshotController::new(
snapshot_request_sender.clone(),
config.snapshot_config.clone(),
bank_forks.read().unwrap().root(),
));
let pending_snapshot_packages = Arc::new(Mutex::new(PendingSnapshotPackages::default()));
let snapshot_packager_service = if snapshot_controller
.snapshot_config()
.should_generate_snapshots()
{
let exit_backpressure = config
.validator_exit_backpressure
.get(SnapshotPackagerService::NAME)
.cloned();
let enable_gossip_push = true;
let snapshot_packager_service = SnapshotPackagerService::new(
pending_snapshot_packages.clone(),
starting_snapshot_hashes,
exit.clone(),
exit_backpressure,
cluster_info.clone(),
snapshot_controller.clone(),
enable_gossip_push,
);
Some(snapshot_packager_service)
} else {
None
};
let (accounts_package_sender, accounts_package_receiver) = crossbeam_channel::unbounded();
let accounts_hash_verifier = AccountsHashVerifier::new(
accounts_package_sender.clone(),
accounts_package_receiver,
pending_snapshot_packages,
exit.clone(),
snapshot_controller.clone(),
);
let snapshot_request_handler = SnapshotRequestHandler {
snapshot_controller: snapshot_controller.clone(),
snapshot_request_receiver,
accounts_package_sender,
};
let pruned_banks_request_handler = PrunedBanksRequestHandler {
pruned_banks_receiver,
};
let accounts_background_service = AccountsBackgroundService::new(
bank_forks.clone(),
exit.clone(),
AbsRequestHandlers {
snapshot_request_handler,
pruned_banks_request_handler,
},
config.accounts_db_test_hash_calculation,
);

After loading the bank we do verification. In this snippet we see that accounts verification is the first thing called:

agave/runtime/src/bank.rs

Lines 5993 to 6037 in 6fcda9b

/// A snapshot bank should be purged of 0 lamport accounts which are not part of the hash
/// calculation and could shield other real accounts.
pub fn verify_snapshot_bank(
&self,
test_hash_calculation: bool,
skip_shrink: bool,
force_clean: bool,
latest_full_snapshot_slot: Slot,
base: Option<(Slot, /*capitalization*/ u64)>,
duplicates_lt_hash: Option<Box<DuplicatesLtHash>>,
) -> bool {
// If we verify the accounts using the lattice-based hash *and* with storages (as opposed
// to the index), then we rely on the DuplicatesLtHash as given by generate_index(). Since
// the duplicates are based on a specific set of storages, we must use the exact same
// storages to do the lattice-based accounts verification. This means we must wait to
// clean/shrink until *after* we've gotten Arcs to the storages (this prevents their
// untimely removal). Simply, we call `verify_accounts_hash()` before we call `clean` or
// `shrink`.
let (verified_accounts, verify_accounts_time_us) = measure_us!({
let should_verify_accounts = !self.rc.accounts.accounts_db.skip_initial_hash_calc;
if should_verify_accounts {
info!("Verifying accounts...");
let verified = self.verify_accounts_hash(
base,
VerifyAccountsHashConfig {
test_hash_calculation,
ignore_mismatch: false,
require_rooted_bank: false,
run_in_background: true,
store_hash_raw_data_for_debug: false,
},
duplicates_lt_hash,
);
info!("Verifying accounts... In background.");
verified
} else {
info!("Verifying accounts... Skipped.");
self.rc
.accounts
.accounts_db
.verify_accounts_hash_in_bg
.verification_complete();
true
}
});

Inside accounts verification we get the snapshot storages (added in #1202, which is the "proper" fix as described in solana-labs#34209):

agave/runtime/src/bank.rs

Lines 5518 to 5521 in 6fcda9b

// The snapshot storages must be captured *before* starting the background verification.
// Otherwise, it is possible that a delayed call to `get_snapshot_storages()` will *not*
// get the correct storages required to calculate and verify the accounts hashes.
let snapshot_storages = self.rc.accounts.accounts_db.get_storages(RangeFull);

These snippets are meant to show that we do hold the storages required for account verification before ABS can be called, this rendering the PR safe.

@brooksprumo brooksprumo self-assigned this Jun 4, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jun 4, 2025

Codecov Report

Attention: Patch coverage is 57.14286% with 3 lines in your changes missing coverage. Please review.

Project coverage is 82.8%. Comparing base (994c56a) to head (843e902).

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #6402     +/-   ##
=========================================
- Coverage    82.8%    82.8%   -0.1%     
=========================================
  Files         848      848             
  Lines      379488   379481      -7     
=========================================
- Hits       314544   314509     -35     
- Misses      64944    64972     +28     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@brooksprumo brooksprumo marked this pull request as ready for review June 4, 2025 12:48
@jeffwashington
Copy link

is this a simplifying change or does it unblock something?
We just want to make sure we don't introduce a race condition/inconsistent behavior at startup that will pop up from time to time. Did we refactor the startup verification to grab the storages first at some point? ie. was this previously necessary to wait until it was complete? Maybe this is related to index based hash calc from long ago? Just trying to think through any connections or reasoning and remember/imagine/understand what changed.

@brooksprumo
Copy link
Author

brooksprumo commented Jun 4, 2025

is this a simplifying change or does it unblock something?

Currently it's just for simplifying. But overall it is meant to make it easier to reason about this ABS loop, which makes it easier to prove we've fixed #6295.

We just want to make sure we don't introduce a race condition/inconsistent behavior at startup that will pop up from time to time.

100%

Did we refactor the startup verification to grab the storages first at some point? ie. was this previously necessary to wait until it was complete?

Yes. Please see the "Justification" part in the PR description that shows where we grab the storages for startup verification. We didn't used to do that, which is why we needed to gate calling shrink in ABS. (Edit: I also added more info into the PR description itself. With the context/background on when we added the gate logic to ABS.)

Maybe this is related to index based hash calc from long ago? Just trying to think through any connections or reasoning and remember/imagine/understand what changed.

Yes, thank you. I want to make sure I don't forget anything from the past either! One reason I wanted your review on this one too.

jeffwashington
jeffwashington previously approved these changes Jun 4, 2025
Copy link

@jeffwashington jeffwashington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

HaoranYi
HaoranYi previously approved these changes Jun 4, 2025
Copy link

@HaoranYi HaoranYi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks correct to me.

@brooksprumo brooksprumo dismissed stale reviews from HaoranYi and jeffwashington via 7f1b574 June 4, 2025 16:57
@brooksprumo brooksprumo force-pushed the abs/unconditional-shrink branch from 588426f to 7f1b574 Compare June 4, 2025 16:57
@brooksprumo
Copy link
Author

Had to resolve merge conflicts due to #6396, hence the force-push. No code was changed.

@brooksprumo
Copy link
Author

brooksprumo commented Jun 4, 2025

Master is broken, so CI is failing. Will need to rebase again once a fix is merged.

Edit: PR #6412 is the fix.

@brooksprumo brooksprumo force-pushed the abs/unconditional-shrink branch from 7f1b574 to 843e902 Compare June 4, 2025 18:33
Copy link

@jeffwashington jeffwashington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@brooksprumo brooksprumo merged commit 75ad0c8 into anza-xyz:master Jun 4, 2025
47 checks passed
@brooksprumo brooksprumo deleted the abs/unconditional-shrink branch June 4, 2025 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants