-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Hey!
I have DB corruptions happening on my validator from time to time. I caught it once in real time due to my RAM hitting its limit (64Gb + 16GB swap; I was doing some unrelated experiments that started to use a ton of RAM).
I suspect that the other times it too got corrupted due to such mem spikes, but I can't prove this yet. These mem spikes likely came from other things running (like my hermes relayer, which oddly has moments where it starts using quite some RAM, + operating other nodes).
My other nodes never corrupt though when this happens, so I wonder if the node is or can be safe guarded against random usage spikes like these. Have you guys checked whether a DB corruption occurs if there's a sudden memory spike? And if this is actually the case, is it possible to implement something to prevent it from corrupting?
A DB corruption could likely happen in many different ways, but the logs of this issue looks similar #86 (thanks for the diving deep into the archives @b0dski!).
Log excerpt:
I[2025-11-19|12:19:46.274] executed block module=state height=4351675 num_valid_txs=0 num_invalid_txs=0
2025-11-19T12:19:46.398914Z INFO namada_node::shell: Committed block hash: 532ac02058d73d22f7ccd36a79b2b9be7043503e88e25951fd7652c1139915a1, height: 4351675
I[2025-11-19|12:19:46.439] committed state module=state height=4351675 num_txs=0 app_hash=532AC02058D73D22F7CCD36A79B2B9BE7043503E88E25951FD7652C1139915A1
I[2025-11-19|12:19:46.440] indexed block events module=txindex height=4351675
I[2025-11-19|12:19:52.030] Timed out module=consensus dur=5.588606414s height=4351676 round=0 step=RoundStepNewHeight
I[2025-11-19|12:19:52.947] received proposal module=consensus proposal="Proposal{4351676/0 (EB3A9A59B92B1B9104E27A02B530EB6429C179559128D7C654F40BF94CBE0687:1:C8D681C2B1D6, -1) 3DEC1ED43EEC @ 2025-11-19T12:19:52.334888612Z}" proposer=0CE0EEB069DD6344BB639E8AF332E2F16F1FC6B1
I[2025-11-19|12:19:53.068] received complete proposal block module=consensus height=4351676 hash=EB3A9A59B92B1B9104E27A02B530EB6429C179559128D7C654F40BF94CBE0687
2025-11-19T12:19:53.076745Z INFO namada_node::shell::process_proposal: Received block proposal proposer="0CE0EEB069DD6344BB639E8AF332E2F16F1FC6B1" height=4351676 hash="EB3A9A59B92B1B9104E27A02B530EB6429C179559128D7C654F40BF94CBE0687" n_txs=0
I[2025-11-19|12:19:53.572] finalizing commit of block module=consensus height=4351676 hash=EB3A9A59B92B1B9104E27A02B530EB6429C179559128D7C654F40BF94CBE0687 root=532AC02058D73D22F7CCD36A79B2B9BE7043503E88E25951FD7652C1139915A1 num_txs=0
2025-11-19T12:19:53.588833Z INFO namada_node::shell::finalize_block: Block height: 4351676, epoch: 1402, is new epoch: false, is masp new epoch: false.
2025-11-19T12:19:53.785741Z INFO namada_node::shell::finalize_block: Applied 0 transactions. Wrappers: 0, successful inner txs: 0, rejected inner txs: 0, errored inner txs: 0, unrun txs: 0, valid txs discarded by failing atomic batch: 0, vp cache size: 2 - 2, tx cache size 11 - 11
2025-11-19T12:19:53.785750Z INFO namada_node::shell::finalize_block: txs executed: 0
I[2025-11-19|12:19:53.826] executed block module=state height=4351676 num_valid_txs=0 num_invalid_txs=0
The application panicked (crashed).
Message: Encountered a storage error while committing a block: Custom(CustomError(Custom(CustomError(DBError("Corruption: block checksum mismatch: stored = 1624078082, computed = 698440519, type = 4 in /home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/db/8375703.sst offset 65977035 size 1621")))))
Location: crates/node/src/shell/mod.rs:823
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
2025-11-19T12:19:53.903389Z INFO namada_node::shims::abcipp_shim: ABCI response channel didn't respond
E[2025-11-19|12:19:53.903] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=consensus
The application panicked (crashed).
Message:
I[2025-11-19|12:19:53.903] service stop module=abci-client connection=consensus msg="Stopping socketClient service" impl=socketClient
called `Result::unwrap()` on an `Err` value: RecvError(())
Location: /home/username/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tower-abci-0.19.1/src/v037/server.rs:179
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
The application panicked (crashed).
Message: flush failed: DBError("Corruption: block checksum mismatch: stored = 1624078082, computed = 698440519, type = 4 in /home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/db/8375703.sst offset 65977035 size 1621")
Location: crates/node/src/storage/rocksdb.rs:262
E[2025-11-19|12:19:53.903] consensus connection terminated. Did the application crash? Please restart CometBFT module=proxy err="read message: EOF"
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
E[2025-11-19|12:19:53.903] client error during proxyAppConn.CommitSync module=state err="read message: EOF"
E[2025-11-19|12:19:53.903] CONSENSUS FAILURE!!! module=consensus err="failed to apply block; error commit failed for application: read message: EOF" stack="goroutine 829 [running]:\nruntime/debug.Stack()\n\t/opt/hostedtoolcache/go/1.24.7/x64/src/runtime/debug/stack.go:26 +0x5e\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine.func2()\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:737 +0x46\npanic({0xf8a880?, 0xc005502650?})\n\t/opt/hostedtoolcache/go/1.24.7/x64/src/runtime/panic.go:792 +0x132\ngithub.com/cometbft/cometbft/consensus.(*State).finalizeCommit(0xc0003a9c08, 0x4266bc)\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:1720 +0xec5\ngithub.com/cometbft/cometbft/consensus.(*State).tryFinalizeCommit(0xc0003a9c08, 0x4266bc)\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:1620 +0x2e5\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit.func1()\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:1555 +0x9c\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit(0xc0003a9c08, 0x4266bc, 0x0)\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:1593 +0xc0f\ngithub.com/cometbft/cometbft/consensus.(*State).addVote(0xc0003a9c08, 0xc0022485a0, {0xc000151380, 0x28})\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:2233 +0x17df\ngithub.com/cometbft/cometbft/consensus.(*State).tryAddVote(0xc0003a9c08, 0xc0022485a0, {0xc000151380?, 0x0?})\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:2022 +0x26\ngithub.com/cometbft/cometbft/consensus.(*State).handleMsg(0xc0003a9c08, {{0x13c5440, 0xc003843cd0}, {0xc000151380, 0x28}})\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:866 +0x3d0\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine(0xc0003a9c08, 0x0)\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:773 +0x3f1\ncreated by github.com/cometbft/cometbft/consensus.(*State).OnStart in goroutine 82\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:384 +0x107\n"
I[2025-11-19|12:19:53.903] service stop module=consensus wal=/home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/cometbft/data/cs.wal/wal msg="Stopping baseWAL service" impl=baseWAL
I[2025-11-19|12:19:53.903] signal trapped module=main msg="captured terminated, exiting..."
I[2025-11-19|12:19:53.903] service stop module=main msg="Stopping Node service" impl=Node
I[2025-11-19|12:19:53.903] Stopping Node module=main
I[2025-11-19|12:19:53.903] service stop module=events msg="Stopping EventBus service" impl=EventBus
I[2025-11-19|12:19:53.903] service stop module=pubsub msg="Stopping PubSub service" impl=PubSub
I[2025-11-19|12:19:53.903] service stop module=txindex msg="Stopping IndexerService service" impl=IndexerService
I[2025-11-19|12:19:53.904] service stop module=blockchain msg="Stopping Reactor service" impl=Reactor
E[2025-11-19|12:19:53.904] Error stopping pool module=blockchain err="already stopped"
I[2025-11-19|12:19:53.904] service stop module=consensus msg="Stopping Consensus service" impl=ConsensusReactor
I[2025-11-19|12:19:53.904] service stop module=consensus msg="Stopping State service" impl=ConsensusState
I[2025-11-19|12:19:53.904] service stop module=consensus msg="Stopping TimeoutTicker service" impl=TimeoutTicker
I[2025-11-19|12:19:53.904] service stop module=consensus wal=/home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/cometbft/data/cs.wal/wal msg="Stopping Group service" impl=Group
I[2025-11-19|12:19:53.904] service stop module=evidence msg="Stopping Evidence service" impl=Evidence
I[2025-11-19|12:19:53.904] service stop module=statesync msg="Stopping StateSync service" impl=StateSync
I[2025-11-19|12:19:53.904] Closing rpc listener module=main listener="&{Listener:0xc000366940 sem:0xc00056f6c0 closeOnce:{_:{} done:{_:{} v:0} m:{_:{} mu:{state:0 sema:0}}} done:0xc00056f730}"
I[2025-11-19|12:19:53.904] New websocket connection module=rpc-server protocol=websocket remote=127.0.0.1:46394
I[2025-11-19|12:19:53.904] Closing blockstore module=main
I[2025-11-19|12:19:53.904] service start module=rpc-server protocol=websocket remote=127.0.0.1:46394 msg="Starting wsConnection service" impl=wsConnection
I[2025-11-19|12:19:53.904] RPC HTTP server stopped module=rpc-server err="accept tcp 127.0.0.1:26857: use of closed network connection"
E[2025-11-19|12:19:53.904] Error serving server module=main err="accept tcp 127.0.0.1:26857: use of closed network connection"
I[2025-11-19|12:19:53.904] Client closed the connection module=rpc-server protocol=websocket remote=127.0.0.1:45454
I[2025-11-19|12:19:53.904] service stop module=rpc-server protocol=websocket remote=127.0.0.1:45454 msg="Stopping wsConnection service" impl=wsConnection
E[2025-11-19|12:19:53.904] error while stopping connection module=rpc-server protocol=websocket error="already stopped"
I[2025-11-19|12:19:53.905] Subscribe to query module=rpc remote=127.0.0.1:46394 query="tm.event = 'NewBlock'"
I[2025-11-19|12:19:53.905] Closing statestore module=main
I[2025-11-19|12:19:53.905] WSJSONRPC module=rpc-server protocol=websocket remote=127.0.0.1:46394 method=subscribe
I[2025-11-19|12:19:53.905] Closing evidencestore module=main
2025-11-19T12:19:53.918889Z INFO namada_node: Tendermint node is no longer running.
2025-11-19T12:19:53.919021Z INFO namada_node::abortable: Tendermint has exited, shutting down...
2025-11-19T12:19:53.919042Z INFO namada_node: Namada ledger node has shut down.
2025-11-19T12:19:53.919050Z INFO namada_node::broadcaster: Shutting down broadcaster...
2025-11-19T12:19:53.919054Z INFO namada_node: Broadcaster is no longer running.
2025-11-19T12:19:53.919071Z INFO namada_node: Shutting down ABCI server...
2025-11-19T12:19:53.945509Z INFO namada_node::ethereum_oracle: Ethereum event oracle is no longer running url="http://127.0.0.1:10545"
The application panicked (crashed).
Message: panic in a destructor during cleanup
Location: library/core/src/panicking.rs:226
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
thread caused non-unwinding panic. aborting.
namadad.service: Main process exited, code=dumped, status=6/ABRT
namadad.service: Failed with result 'core-dump'.
namadad.service: Consumed 22h 25min 15.178s CPU time, 3.0G memory peak, 0B memory swap peak.
namadad.service: Scheduled restart job, restart counter is at 1.
Started namadad.service - Namada Daemon.
2025-11-19T12:20:05.173372Z INFO namada_node: Available logical cores: 22
2025-11-19T12:20:05.173386Z INFO namada_node: Using 11 threads for Rayon.
2025-11-19T12:20:05.173388Z INFO namada_node: Using 11 threads for Tokio.
2025-11-19T12:20:05.179378Z INFO namada_node: VP WASM compilation cache size not configured, using 1/6 of available memory.
2025-11-19T12:20:05.179563Z INFO namada_node: Available memory: 37.751705169677734 GiB
2025-11-19T12:20:05.179573Z INFO namada_node: VP WASM compilation cache size: 6.291950860992074 GiB
2025-11-19T12:20:05.179578Z INFO namada_node: Tx WASM compilation cache size not configured, using 1/6 of available memory.
2025-11-19T12:20:05.179581Z INFO namada_node: Tx WASM compilation cache size: 6.291950860992074 GiB
2025-11-19T12:20:05.179585Z INFO namada_node: Block cache size not configured, using 1/3 of available memory.
2025-11-19T12:20:05.179588Z INFO namada_node: RocksDB block cache size: 12.58390172291547 GiB
2025-11-19T12:20:05.179685Z INFO namada_node: Loading MASP verifying keys.
2025-11-19T12:20:05.179714Z INFO namada_node::ethereum_oracle: Ethereum event oracle is starting url="http://127.0.0.1:10545"
2025-11-19T12:20:05.180699Z INFO namada_node::ethereum_oracle: Oracle is awaiting initial configuration
2025-11-19T12:20:05.190341Z INFO namada_node::tendermint_node: CometBFT node started
I[2025-11-19|12:20:05.200] deprecated usage found in configuration file usage="[fastsync] table detected. This section has been renamed to [blocksync]. The values in this deprecated section will be disregarded."
I[2025-11-19|12:20:05.201] deprecated usage found in configuration file usage="fast_sync key detected. This key has been renamed to block_sync. The value of this deprecated key will be disregarded."
I[2025-11-19|12:20:05.201] deprecated usage found in configuration file usage="unused and deprecated upnp field detected in P2P config."
I[2025-11-19|12:20:05.235] service start module=proxy msg="Starting multiAppConn service" impl=multiAppConn
I[2025-11-19|12:20:05.235] service start module=abci-client connection=query msg="Starting socketClient service" impl=socketClient
E[2025-11-19|12:20:05.236] abci.socketClient failed to connect to tcp://127.0.0.1:26858. Retrying after 3s... module=abci-client connection=query err="dial tcp 127.0.0.1:26858: connect: connection refused"
2025-11-19T12:20:05.426143Z INFO namada_node: Done loading MASP verifying keys.
2025-11-19T12:20:05.426445Z INFO namada_node::storage::rocksdb: Using 5 compactions threads for RocksDB.
2025-11-19T12:20:05.427540Z INFO namada_node::broadcaster: Starting broadcaster.
E[2025-11-19|12:20:08.236] abci.socketClient failed to connect to tcp://127.0.0.1:26858. Retrying after 3s... module=abci-client connection=query err="dial tcp 127.0.0.1:26858: connect: connection refused"
E[2025-11-19|12:20:11.236] abci.socketClient failed to connect to tcp://127.0.0.1:26858. Retrying after 3s... module=abci-client connection=query err="dial tcp 127.0.0.1:26858: connect: connection refused"
E[2025-11-19|12:20:14.239] abci.socketClient failed to connect to tcp://127.0.0.1:26858. Retrying after 3s... module=abci-client connection=query err="dial tcp 127.0.0.1:26858: connect: connection refused"
2025-11-19T12:20:15.622914Z INFO tower_abci::v037::server: ABCI server starting on tcp socket addr=127.0.0.1:26858
2025-11-19T12:20:15.623049Z INFO namada_node: Namada ledger node started.
2025-11-19T12:20:15.623063Z INFO namada_node: This node is a validator
I[2025-11-19|12:20:17.240] service start module=abci-client connection=snapshot msg="Starting socketClient service" impl=socketClient
2025-11-19T12:20:17.240841Z INFO tower_abci::v037::server: listening for requests
2025-11-19T12:20:17.240960Z INFO tower_abci::v037::server: listening for requests
I[2025-11-19|12:20:17.241] service start module=abci-client connection=mempool msg="Starting socketClient service" impl=socketClient
I[2025-11-19|12:20:17.241] service start module=abci-client connection=consensus msg="Starting socketClient service" impl=socketClient
I[2025-11-19|12:20:17.241] service start module=events msg="Starting EventBus service" impl=EventBus
I[2025-11-19|12:20:17.241] service start module=pubsub msg="Starting PubSub service" impl=PubSub
I[2025-11-19|12:20:17.241] service start module=txindex msg="Starting IndexerService service" impl=IndexerService
2025-11-19T12:20:17.241281Z INFO tower_abci::v037::server: listening for requests
2025-11-19T12:20:17.241290Z INFO tower_abci::v037::server: listening for requests
2025-11-19T12:20:17.241413Z INFO namada_node::shell: Last state root hash: 532ac02058d73d22f7ccd36a79b2b9be7043503e88e25951fd7652c1139915a1, height: 4351675
I[2025-11-19|12:20:17.241] ABCI Handshake App Info module=consensus height=4351675 hash=532AC02058D73D22F7CCD36A79B2B9BE7043503E88E25951FD7652C1139915A1 software-version=v101.1.4 protocol-version=1
I[2025-11-19|12:20:17.241] ABCI Replay Blocks module=consensus appHeight=4351675 storeHeight=4351676 stateHeight=4351675
I[2025-11-19|12:20:17.241] Replay last block using real app module=consensus
2025-11-19T12:20:17.259368Z INFO namada_node::shell::process_proposal: Received block proposal proposer="0CE0EEB069DD6344BB639E8AF332E2F16F1FC6B1" height=4351676 hash="EB3A9A59B92B1B9104E27A02B530EB6429C179559128D7C654F40BF94CBE0687" n_txs=0
2025-11-19T12:20:17.259904Z INFO namada_node::shell::finalize_block: Block height: 4351676, epoch: 1402, is new epoch: false, is masp new epoch: false.
2025-11-19T12:20:17.430547Z INFO namada_node::shell::finalize_block: Applied 0 transactions. Wrappers: 0, successful inner txs: 0, rejected inner txs: 0, errored inner txs: 0, unrun txs: 0, valid txs discarded by failing atomic batch: 0, vp cache size: 0 - 0, tx cache size 0 - 0
2025-11-19T12:20:17.430557Z INFO namada_node::shell::finalize_block: txs executed: 0
I[2025-11-19|12:20:17.471] executed block module=consensus height=4351676 num_valid_txs=0 num_invalid_txs=0
The application panicked (crashed).
Message: Encountered a storage error while committing a block: Custom(CustomError(Custom(CustomError(DBError("Corruption: block checksum mismatch: stored = 1624078082, computed = 698440519, type = 4 in /home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/db/8375703.sst offset 65977035 size 1621")))))
Location: crates/node/src/shell/mod.rs:823
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
The application panicked (crashed).
Message: flush failed: DBError("Corruption: block checksum mismatch: stored = 1624078082, computed = 698440519, type = 4 in /home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/db/8375703.sst offset 65977035 size 1621")
Location: crates/node/src/storage/rocksdb.rs:262
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
2025-11-19T12:20:17.588543Z INFO namada_node::shims::abcipp_shim: ABCI response channel didn't respond
E[2025-11-19|12:20:17.588] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=consensus
I[2025-11-19|12:20:17.588] service stop module=abci-client connection=consensus msg="Stopping socketClient service" impl=socketClient
E[2025-11-19|12:20:17.588] consensus connection terminated. Did the application crash? Please restart CometBFT module=proxy err="read message: EOF"
The application panicked (crashed).
Message: called `Result::unwrap()` on an `Err` value: RecvError(())
Location: /home/username/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tower-abci-0.19.1/src/v037/server.rs:179
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
E[2025-11-19|12:20:17.588] client error during proxyAppConn.CommitSync module=consensus err="read message: EOF"
ERROR: failed to create node: error during handshake: error on replay: commit failed for application: read message: EOF
2025-11-19T12:20:17.591091Z INFO namada_node: Tendermint node is no longer running.
2025-11-19T12:20:17.591106Z ERROR namada_node: Err(Tendermint(Runtime("exit status: 1")))
2025-11-19T12:20:17.591200Z INFO namada_node::abortable: Tendermint has exited, shutting down...
2025-11-19T12:20:17.591225Z INFO namada_node: Namada ledger node has shut down.
2025-11-19T12:20:17.591236Z INFO namada_node::broadcaster: Shutting down broadcaster...
2025-11-19T12:20:17.591247Z INFO namada_node: Broadcaster is no longer running.
2025-11-19T12:20:17.591247Z INFO namada_node: Shutting down ABCI server...
The application panicked (crashed).
Message: panic in a destructor during cleanup
Location: library/core/src/panicking.rs:226
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
thread caused non-unwinding panic. aborting.
2025-11-19T12:20:17.618302Z INFO namada_node::ethereum_oracle: Ethereum event oracle is no longer running url="http://127.0.0.1:10545"
namadad.service: Main process exited, code=dumped, status=6/ABRT
namadad.service: Failed with result 'core-dump'.
namadad.service: Consumed 22.153s CPU time.
namadad.service: Scheduled restart job, restart counter is at 2.
Started namadad.service - Namada Daemon.
Byeee :)!
ZEN