-
Notifications
You must be signed in to change notification settings - Fork 311
Debugging Consensus Failures
This page is meant to provide some steps to identify and narrow down the cause of consensus failures on nodes. The high level steps include:
- Finding the slot where consensus broke
- Gathering the necessary artifacts to reproduce the issue locally
- Testing with both "good" (consensus matches cluster) and "bad" code versions
- Diffing good/bad output to identify accounts and/or transactions where consensus broke
- Inspiration for root causing beyond above step
As of this change, nodes explicitly panic when they realize they have diverged from consensus.
- Prior to the change, nodes would repeatedly purge/repair/retry a slot that it computed the incorrect hash for
- With the change, the number of retries is limited to 10 before the node panics
This section outlines how to find FAIL_SLOT
; the rest of this wiki will reference this slot as such.
The panic message mentioned above includes a slot number and thus makes finding the divergent slot trivial; just grep for panic
.
One important caveat to note though:
- Suppose a node is running with a restart mechanism, such as auto-restart in systemd
- On process restarts, the node will replay the block that it previously diverged during local ledger replay at startup
- Local replay mode means that
ReplayStage
is NOT yet running, and replay results are NOT compared against the cluster
- Local replay mode means that
- The node will not be aware that it replayed a block in "local mode" incorrectly
- The node replaying the previous block incorrectly will mean a parent bank hash is incorrect, which will manifest as some subsequent slot producing the incorrect hash
Thus, it is important to find the first instance of the panic message in logs.
Even though the newer releases identify the divergent slot for us, the details noted below are still valuable for general understanding / debugging.
Details
Supposing that panic didn't exist, finding the latest root is probably a good starting point:
[TIMESTAMP INFO solana_core::replay_stage] new root {}
With the last new root known, search the surrounding "area" (both before and after - often a tower distance away) for the following log statements:
[TIMESTAMP WARN solana_core::cluster_slot_state_verifier] Cluster duplicate confirmed slot {} with hash {}, but our version has hash {}
[TIMESTAMP WARN solana_vote_program::vote_state] {} dropped vote slots {:?} failed to match hash {} {}
As mentioned, a node that has experienced a consensus divergence can get into a loop of checking / failing / purging / repairing a slot. Seeing the bank frozen log line for the same slot repeatedly is a quick way to identify this (and further inspection should show the repeated purges/repairs as well):
[TIMESTAMP INFO solana_runtime::bank] bank frozen: {} hash: {} accounts_delta: {} signature_count: {} last_blockhash: {} capitalization: {}
Bank hashes are computed here. As of writing this, the bank hash is a "combined" hash of the slots's:
- parent bank hash
- account delta hash
- signature count
- blockhash As such, any of these items being off will cause the bank hash to be wrong.
Once the offending slot is known, the first step is collecting a ledger to reliably and quickly reproduce the issue. This process can be a little complicated in itself, and more details on this can be found TODO. Namely, a ledger with the following is needed to replay:
- A snapshot at slot
SNAPSHOT_SLOT
whereSNAPSHOT_SLOT
<FAIL_SLOT
.- This is the file of form
snapshot-SNAPSHOT_SLOT-HASH.tar.zst
- Creating a snapshot at
FAIL_SLOT - 1
will minimize time spent replaying additional times. You only need to create a new snapshot once, but you might replay from that snapshot numerous times.
- This is the file of form
- The shreds (transaction data) for slots [
SNAPSHOT_SLOT
,FAIL_SLOT
].- This is the data contained in
rocksdb
directory. - For good measure and if available, grab the shreds for slots
SNAPSHOT_SLOT - 1
andFAIL_SLOT + 1
as well - Shreds for a range of slots can be collected using the
solana-ledger-tool copy
command
- This is the data contained in
- The cluster specific
genesis.bin
file.- Running
solana-validator
will sync this so this file can be copied from the node that break was found on.
- Running
With a ledger in-hand, solana-ledger-tool verify
can be used to replay FAIL_SLOT
with the following command:
$ solana-ledger-tool verify --ledger LEDGER_DIR --halt-at-slot FAIL_SLOT
Note that replaying a ledger requires large amount of system resources. I.e. you may have problems with your personal laptop and will have more success with a proper validator setup.
Replaying will log bank hashes; the log statement to look for is:
[TIMESTAMP INFO solana_runtime::bank] bank frozen: SLOT hash: BANK_HASH accounts_delta: ACCOUNT_HASH signature_count: M last_blockhash: BLOCK_HASH capitalization: N
Perform the following steps to confirm that the correct & incorrect behavior can be reproduced locally:
- Backup the
rocksdb
folder somewhere just incase. - Using the command above, replay with the cluster majority software version and confirm the correct bank hash is calculated for
FAIL_SLOT
. - Using the command above, replay with the version that broke and confirm the incorrect bank hash is calculated for
FAIL_SLOT
.
Looking at the updated accounts for correct vs. incorrect version and noting any differences is a valuable step in debugging. With the differing account(s) known, the divergent behavior can typically be narrowed down to a single transaction.
With #32632, a node creates a "debug" file at LEDGER_DIR/bank_hash_details/
when it hits the aforementioned panic. The file contains details about every account that was updated in the slot. solana-ledger-tool
can also be used to create this file by passing --halt-at-slot FAIL_SLOT --write-bank-file
flags. The created file is pretty-printed json, so it is human readable and can be easily diffed.
Again, software modifications have minimized the manual steps, but these details retained for the sake of general understanding.
Details
The following patch and instructions can be applied to print the account hash of every account that was updated for a given account. If we print all of the correct and incorrect hashes, diffing these two lists (which are sorted by pubkey) will allow us to determine which account(s) contributed to the divergent bank hash.
Here is the patch:
diff --git a/runtime/src/accounts_db.rs b/runtime/src/accounts_db.rs
index 4893f3104c..34a9f02d33 100644
--- a/runtime/src/accounts_db.rs
+++ b/runtime/src/accounts_db.rs
@@ -5937,7 +5937,7 @@ impl AccountsDb {
hashes.retain(|(pubkey, _hash)| !self.is_filler_account(pubkey));
}
- let ret = AccountsHash::accumulate_account_hashes(hashes);
+ let ret = AccountsHash::accumulate_account_hashes(slot, hashes);
accumulate.stop();
let mut uncleaned_time = Measure::start("uncleaned_index");
self.uncleaned_pubkeys.insert(slot, dirty_keys);
diff --git a/runtime/src/accounts_hash.rs b/runtime/src/accounts_hash.rs
index 54aaa5c475..252dd44833 100644
--- a/runtime/src/accounts_hash.rs
+++ b/runtime/src/accounts_hash.rs
@@ -3,6 +3,7 @@ use {
rayon::prelude::*,
solana_measure::measure::Measure,
solana_sdk::{
+ clock::Slot,
hash::{Hash, Hasher},
pubkey::Pubkey,
},
@@ -487,9 +488,13 @@ impl AccountsHash {
)
}
- pub fn accumulate_account_hashes(mut hashes: Vec<(Pubkey, Hash)>) -> Hash {
+ pub fn accumulate_account_hashes(slot: Slot, mut hashes: Vec<(Pubkey, Hash)>) -> Hash {
Self::sort_hashes_by_pubkey(&mut hashes);
-
+ if slot == FAIL_SLOT {
+ for hash in hashes.iter() {
+ warn!("Updated Account: {:?}", hash);
+ }
+ }
Self::compute_merkle_root_loop(hashes, MERKLE_FANOUT, |i| i.1)
}
Perform the following steps to finally narrow down on divergent transaction:
- Apply the above patch - make sure to plug in the actual value of
FAIL_SLOT
. - Replay with both the good and bad software versions using the same command as above; however, make the following tweaks to get more verbose output and write to file (make sure to change the output log filename for good and bad):
$ RUST_LOG=solana=debug solana-ledger-tool verify --ledger LEDGER_DIR --halt-at-slot FAIL_SLOT 2> {good|bad}_output.log
- Run the following command on each output file to parse out the "Updated Account" statements.
$ grep -o "Updated Account: .*" {good|bad}_output.log > {good|bad}_updated_accounts.txt
- Diff the files from previous step that contain all updated accounts for each software version, and note any differing accounts (the pubkey is first element in tuple). Suppose that
DIFF_PUBKEY
is the / one of the accounts that has different state.
With known differing accounts, detailed information can be printed to stdout (separate from all rust logs which go to stderr). The following patches shows where to plug into to print out specific accounts or transactions.
Patch to print specific account details
diff --git a/runtime/src/bank.rs b/runtime/src/bank.rs
index 50e8aed5fa..a451ba6ed6 100644
--- a/runtime/src/bank.rs
+++ b/runtime/src/bank.rs
@@ -158,6 +158,7 @@ use {
ops::{Deref, Div, RangeInclusive},
path::PathBuf,
rc::Rc,
+ str::FromStr,
sync::{
atomic::{
AtomicBool, AtomicI64, AtomicU64, AtomicUsize,
@@ -6660,6 +6661,10 @@ impl Bank {
/// Hash the `accounts` HashMap. This represents a validator's interpretation
/// of the delta of the ledger since the last vote and up to now
fn hash_internal_state(&self) -> Hash {
+ if self.slot() == FAIL_SLOT {
+ print!("Account details: {:?}", self.get_account_with_fixed_root(&Pubkey::from_str("DIFF_PUBKEY").unwrap()));
+ }
+
// If there are no accounts, return the hash of the previous state and the latest blockhash
let accounts_delta_hash = self
.rc
Patch to print specific transaction details
diff --git a/runtime/src/bank.rs b/runtime/src/bank.rs
index e46faafaae..8cf4427529 100644
--- a/runtime/src/bank.rs
+++ b/runtime/src/bank.rs
@@ -4231,7 +4231,7 @@ impl Bank {
let pre_account_state_info =
self.get_transaction_account_state_info(&transaction_context, tx.message());
- let log_collector = if enable_log_recording {
+ let log_collector = if true {
match log_messages_bytes_limit {
None => Some(LogCollector::new_ref()),
Some(log_messages_bytes_limit) => Some(LogCollector::new_ref_with_limit(Some(
@@ -4318,6 +4317,10 @@ impl Bank {
.ok()
});
+ if format!("{:?}", tx.signature()).contains("[Signature as it appears in the Explorer]") {
+ println!("{}", log_messages.as_ref().unwrap().join("\n"));
+ }
+
let inner_instructions = if enable_cpi_recording {
Some(inner_instructions_list_from_instruction_trace(
&transaction_context,
With this patch in place, run solana-ledger-tool verify
with same arguments to yield the new log statements.
- Run
solana-ledger-tool slot
to display all transaction information for a given slot. Combined with account pubkey(s) that differed between runs before, run the following to see what transaction(s) involve the differing account(s).
$ solana-ledger-tool slot FAIL_SLOT -vv | grep DIFF_PUBKEY
At this point, it should be known which account(s)/transaction(s) caused the consensus break. If the failure can be traced to a specific transaction, the below sections provide some more insight into how one could learn more about why the transaction is differing between two versions. If the failure isn't tied to a specific transaction, it is likely that the affected account is "special" in some way, and digging through the explorer.solana.com and/or the code to see how/when that account is updated is likely necessary.
Look for things that might be "special" about this transaction, such as the slot, involved programs, last deploy date, read/write, etc.
If a transaction had different result between versions, it is very possible that it failed on one version. Log output might contain some more information, such as an error for why the transaction failed. To find this log output:
- Note one of the signatures for the transaction in question from
solana-ledger-tool slot
output. - Search the log output from replay for this signature (log level must at least include debug to get anything useful here):
$ < output.log grep TX_SIGNATURE
Depending on the situation, adding the flags to include lines before and after match might be helpful. For example, if the transaction errored, the following line will give details:
[TIMESTAMP DEBUG solana_runtime::bank] tx error: Err(...
This error will provide a good starting point to search the codebase and determine what is causing logic to fall through to error case.
Given that replaying should only take several minutes if the snapshot is at a slot shortly before FAIL_SLOT
, running a bisect to find the breaking commit is manageable.
- General
- Feature Gates
- Technical
- Policy
- Schedule
- Migration
- Restart Instructions