Skip to content

bug: db corruption (likely when RAM limits get reached on the running system) #4988

@zenodeapp

Description

@zenodeapp

Hey!

I have DB corruptions happening on my validator from time to time. I caught it once in real time due to my RAM hitting its limit (64Gb + 16GB swap; I was doing some unrelated experiments that started to use a ton of RAM).

I suspect that the other times it too got corrupted due to such mem spikes, but I can't prove this yet. These mem spikes likely came from other things running (like my hermes relayer, which oddly has moments where it starts using quite some RAM, + operating other nodes).

My other nodes never corrupt though when this happens, so I wonder if the node is or can be safe guarded against random usage spikes like these. Have you guys checked whether a DB corruption occurs if there's a sudden memory spike? And if this is actually the case, is it possible to implement something to prevent it from corrupting?

A DB corruption could likely happen in many different ways, but the logs of this issue looks similar #86 (thanks for the diving deep into the archives @b0dski!).

Log excerpt:

I[2025-11-19|12:19:46.274] executed block                               module=state height=4351675 num_valid_txs=0 num_invalid_txs=0
2025-11-19T12:19:46.398914Z  INFO namada_node::shell: Committed block hash: 532ac02058d73d22f7ccd36a79b2b9be7043503e88e25951fd7652c1139915a1, height: 4351675
I[2025-11-19|12:19:46.439] committed state                              module=state height=4351675 num_txs=0 app_hash=532AC02058D73D22F7CCD36A79B2B9BE7043503E88E25951FD7652C1139915A1
I[2025-11-19|12:19:46.440] indexed block events                         module=txindex height=4351675
I[2025-11-19|12:19:52.030] Timed out                                    module=consensus dur=5.588606414s height=4351676 round=0 step=RoundStepNewHeight
I[2025-11-19|12:19:52.947] received proposal                            module=consensus proposal="Proposal{4351676/0 (EB3A9A59B92B1B9104E27A02B530EB6429C179559128D7C654F40BF94CBE0687:1:C8D681C2B1D6, -1) 3DEC1ED43EEC @ 2025-11-19T12:19:52.334888612Z}" proposer=0CE0EEB069DD6344BB639E8AF332E2F16F1FC6B1
I[2025-11-19|12:19:53.068] received complete proposal block             module=consensus height=4351676 hash=EB3A9A59B92B1B9104E27A02B530EB6429C179559128D7C654F40BF94CBE0687
2025-11-19T12:19:53.076745Z  INFO namada_node::shell::process_proposal: Received block proposal proposer="0CE0EEB069DD6344BB639E8AF332E2F16F1FC6B1" height=4351676 hash="EB3A9A59B92B1B9104E27A02B530EB6429C179559128D7C654F40BF94CBE0687" n_txs=0
I[2025-11-19|12:19:53.572] finalizing commit of block                   module=consensus height=4351676 hash=EB3A9A59B92B1B9104E27A02B530EB6429C179559128D7C654F40BF94CBE0687 root=532AC02058D73D22F7CCD36A79B2B9BE7043503E88E25951FD7652C1139915A1 num_txs=0
2025-11-19T12:19:53.588833Z  INFO namada_node::shell::finalize_block: Block height: 4351676, epoch: 1402, is new epoch: false, is masp new epoch: false.
2025-11-19T12:19:53.785741Z  INFO namada_node::shell::finalize_block: Applied 0 transactions. Wrappers: 0, successful inner txs: 0, rejected inner txs: 0, errored inner txs: 0, unrun txs: 0, valid txs discarded by failing atomic batch: 0, vp cache size: 2 - 2, tx cache size 11 - 11
2025-11-19T12:19:53.785750Z  INFO namada_node::shell::finalize_block: txs executed: 0
I[2025-11-19|12:19:53.826] executed block                               module=state height=4351676 num_valid_txs=0 num_invalid_txs=0
The application panicked (crashed).
Message:  Encountered a storage error while committing a block: Custom(CustomError(Custom(CustomError(DBError("Corruption: block checksum mismatch: stored = 1624078082, computed = 698440519, type = 4  in /home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/db/8375703.sst offset 65977035 size 1621")))))
Location: crates/node/src/shell/mod.rs:823
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
2025-11-19T12:19:53.903389Z  INFO namada_node::shims::abcipp_shim: ABCI response channel didn't respond
E[2025-11-19|12:19:53.903] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=consensus
The application panicked (crashed).
Message:  
I[2025-11-19|12:19:53.903] service stop                                 module=abci-client connection=consensus msg="Stopping socketClient service" impl=socketClient
called `Result::unwrap()` on an `Err` value: RecvError(())
Location: /home/username/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tower-abci-0.19.1/src/v037/server.rs:179
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
The application panicked (crashed).
Message:  flush failed: DBError("Corruption: block checksum mismatch: stored = 1624078082, computed = 698440519, type = 4  in /home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/db/8375703.sst offset 65977035 size 1621")
Location: crates/node/src/storage/rocksdb.rs:262
E[2025-11-19|12:19:53.903] consensus connection terminated. Did the application crash? Please restart CometBFT module=proxy err="read message: EOF"
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
E[2025-11-19|12:19:53.903] client error during proxyAppConn.CommitSync  module=state err="read message: EOF"
E[2025-11-19|12:19:53.903] CONSENSUS FAILURE!!!                         module=consensus err="failed to apply block; error commit failed for application: read message: EOF" stack="goroutine 829 [running]:\nruntime/debug.Stack()\n\t/opt/hostedtoolcache/go/1.24.7/x64/src/runtime/debug/stack.go:26 +0x5e\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine.func2()\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:737 +0x46\npanic({0xf8a880?, 0xc005502650?})\n\t/opt/hostedtoolcache/go/1.24.7/x64/src/runtime/panic.go:792 +0x132\ngithub.com/cometbft/cometbft/consensus.(*State).finalizeCommit(0xc0003a9c08, 0x4266bc)\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:1720 +0xec5\ngithub.com/cometbft/cometbft/consensus.(*State).tryFinalizeCommit(0xc0003a9c08, 0x4266bc)\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:1620 +0x2e5\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit.func1()\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:1555 +0x9c\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit(0xc0003a9c08, 0x4266bc, 0x0)\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:1593 +0xc0f\ngithub.com/cometbft/cometbft/consensus.(*State).addVote(0xc0003a9c08, 0xc0022485a0, {0xc000151380, 0x28})\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:2233 +0x17df\ngithub.com/cometbft/cometbft/consensus.(*State).tryAddVote(0xc0003a9c08, 0xc0022485a0, {0xc000151380?, 0x0?})\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:2022 +0x26\ngithub.com/cometbft/cometbft/consensus.(*State).handleMsg(0xc0003a9c08, {{0x13c5440, 0xc003843cd0}, {0xc000151380, 0x28}})\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:866 +0x3d0\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine(0xc0003a9c08, 0x0)\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:773 +0x3f1\ncreated by github.com/cometbft/cometbft/consensus.(*State).OnStart in goroutine 82\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:384 +0x107\n"
I[2025-11-19|12:19:53.903] service stop                                 module=consensus wal=/home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/cometbft/data/cs.wal/wal msg="Stopping baseWAL service" impl=baseWAL
I[2025-11-19|12:19:53.903] signal trapped                               module=main msg="captured terminated, exiting..."
I[2025-11-19|12:19:53.903] service stop                                 module=main msg="Stopping Node service" impl=Node
I[2025-11-19|12:19:53.903] Stopping Node                                module=main
I[2025-11-19|12:19:53.903] service stop                                 module=events msg="Stopping EventBus service" impl=EventBus
I[2025-11-19|12:19:53.903] service stop                                 module=pubsub msg="Stopping PubSub service" impl=PubSub
I[2025-11-19|12:19:53.903] service stop                                 module=txindex msg="Stopping IndexerService service" impl=IndexerService
I[2025-11-19|12:19:53.904] service stop                                 module=blockchain msg="Stopping Reactor service" impl=Reactor
E[2025-11-19|12:19:53.904] Error stopping pool                          module=blockchain err="already stopped"
I[2025-11-19|12:19:53.904] service stop                                 module=consensus msg="Stopping Consensus service" impl=ConsensusReactor
I[2025-11-19|12:19:53.904] service stop                                 module=consensus msg="Stopping State service" impl=ConsensusState
I[2025-11-19|12:19:53.904] service stop                                 module=consensus msg="Stopping TimeoutTicker service" impl=TimeoutTicker
I[2025-11-19|12:19:53.904] service stop                                 module=consensus wal=/home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/cometbft/data/cs.wal/wal msg="Stopping Group service" impl=Group
I[2025-11-19|12:19:53.904] service stop                                 module=evidence msg="Stopping Evidence service" impl=Evidence
I[2025-11-19|12:19:53.904] service stop                                 module=statesync msg="Stopping StateSync service" impl=StateSync
I[2025-11-19|12:19:53.904] Closing rpc listener                         module=main listener="&{Listener:0xc000366940 sem:0xc00056f6c0 closeOnce:{_:{} done:{_:{} v:0} m:{_:{} mu:{state:0 sema:0}}} done:0xc00056f730}"
I[2025-11-19|12:19:53.904] New websocket connection                     module=rpc-server protocol=websocket remote=127.0.0.1:46394
I[2025-11-19|12:19:53.904] Closing blockstore                           module=main
I[2025-11-19|12:19:53.904] service start                                module=rpc-server protocol=websocket remote=127.0.0.1:46394 msg="Starting wsConnection service" impl=wsConnection
I[2025-11-19|12:19:53.904] RPC HTTP server stopped                      module=rpc-server err="accept tcp 127.0.0.1:26857: use of closed network connection"
E[2025-11-19|12:19:53.904] Error serving server                         module=main err="accept tcp 127.0.0.1:26857: use of closed network connection"
I[2025-11-19|12:19:53.904] Client closed the connection                 module=rpc-server protocol=websocket remote=127.0.0.1:45454
I[2025-11-19|12:19:53.904] service stop                                 module=rpc-server protocol=websocket remote=127.0.0.1:45454 msg="Stopping wsConnection service" impl=wsConnection
E[2025-11-19|12:19:53.904] error while stopping connection              module=rpc-server protocol=websocket error="already stopped"
I[2025-11-19|12:19:53.905] Subscribe to query                           module=rpc remote=127.0.0.1:46394 query="tm.event = 'NewBlock'"
I[2025-11-19|12:19:53.905] Closing statestore                           module=main
I[2025-11-19|12:19:53.905] WSJSONRPC                                    module=rpc-server protocol=websocket remote=127.0.0.1:46394 method=subscribe
I[2025-11-19|12:19:53.905] Closing evidencestore                        module=main
2025-11-19T12:19:53.918889Z  INFO namada_node: Tendermint node is no longer running.
2025-11-19T12:19:53.919021Z  INFO namada_node::abortable: Tendermint has exited, shutting down...
2025-11-19T12:19:53.919042Z  INFO namada_node: Namada ledger node has shut down.
2025-11-19T12:19:53.919050Z  INFO namada_node::broadcaster: Shutting down broadcaster...
2025-11-19T12:19:53.919054Z  INFO namada_node: Broadcaster is no longer running.
2025-11-19T12:19:53.919071Z  INFO namada_node: Shutting down ABCI server...
2025-11-19T12:19:53.945509Z  INFO namada_node::ethereum_oracle: Ethereum event oracle is no longer running url="http://127.0.0.1:10545"
The application panicked (crashed).
Message:  panic in a destructor during cleanup
Location: library/core/src/panicking.rs:226
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
thread caused non-unwinding panic. aborting.
namadad.service: Main process exited, code=dumped, status=6/ABRT
namadad.service: Failed with result 'core-dump'.
namadad.service: Consumed 22h 25min 15.178s CPU time, 3.0G memory peak, 0B memory swap peak.
namadad.service: Scheduled restart job, restart counter is at 1.
Started namadad.service - Namada Daemon.
2025-11-19T12:20:05.173372Z  INFO namada_node: Available logical cores: 22
2025-11-19T12:20:05.173386Z  INFO namada_node: Using 11 threads for Rayon.
2025-11-19T12:20:05.173388Z  INFO namada_node: Using 11 threads for Tokio.
2025-11-19T12:20:05.179378Z  INFO namada_node: VP WASM compilation cache size not configured, using 1/6 of available memory.
2025-11-19T12:20:05.179563Z  INFO namada_node: Available memory: 37.751705169677734 GiB
2025-11-19T12:20:05.179573Z  INFO namada_node: VP WASM compilation cache size: 6.291950860992074 GiB
2025-11-19T12:20:05.179578Z  INFO namada_node: Tx WASM compilation cache size not configured, using 1/6 of available memory.
2025-11-19T12:20:05.179581Z  INFO namada_node: Tx WASM compilation cache size: 6.291950860992074 GiB
2025-11-19T12:20:05.179585Z  INFO namada_node: Block cache size not configured, using 1/3 of available memory.
2025-11-19T12:20:05.179588Z  INFO namada_node: RocksDB block cache size: 12.58390172291547 GiB
2025-11-19T12:20:05.179685Z  INFO namada_node: Loading MASP verifying keys.
2025-11-19T12:20:05.179714Z  INFO namada_node::ethereum_oracle: Ethereum event oracle is starting url="http://127.0.0.1:10545"
2025-11-19T12:20:05.180699Z  INFO namada_node::ethereum_oracle: Oracle is awaiting initial configuration
2025-11-19T12:20:05.190341Z  INFO namada_node::tendermint_node: CometBFT node started
I[2025-11-19|12:20:05.200] deprecated usage found in configuration file usage="[fastsync] table detected. This section has been renamed to [blocksync]. The values in this deprecated section will be disregarded."
I[2025-11-19|12:20:05.201] deprecated usage found in configuration file usage="fast_sync key detected. This key has been renamed to block_sync. The value of this deprecated key will be disregarded."
I[2025-11-19|12:20:05.201] deprecated usage found in configuration file usage="unused and deprecated upnp field detected in P2P config."
I[2025-11-19|12:20:05.235] service start                                module=proxy msg="Starting multiAppConn service" impl=multiAppConn
I[2025-11-19|12:20:05.235] service start                                module=abci-client connection=query msg="Starting socketClient service" impl=socketClient
E[2025-11-19|12:20:05.236] abci.socketClient failed to connect to tcp://127.0.0.1:26858.  Retrying after 3s... module=abci-client connection=query err="dial tcp 127.0.0.1:26858: connect: connection refused"
2025-11-19T12:20:05.426143Z  INFO namada_node: Done loading MASP verifying keys.
2025-11-19T12:20:05.426445Z  INFO namada_node::storage::rocksdb: Using 5 compactions threads for RocksDB.
2025-11-19T12:20:05.427540Z  INFO namada_node::broadcaster: Starting broadcaster.
E[2025-11-19|12:20:08.236] abci.socketClient failed to connect to tcp://127.0.0.1:26858.  Retrying after 3s... module=abci-client connection=query err="dial tcp 127.0.0.1:26858: connect: connection refused"
E[2025-11-19|12:20:11.236] abci.socketClient failed to connect to tcp://127.0.0.1:26858.  Retrying after 3s... module=abci-client connection=query err="dial tcp 127.0.0.1:26858: connect: connection refused"
E[2025-11-19|12:20:14.239] abci.socketClient failed to connect to tcp://127.0.0.1:26858.  Retrying after 3s... module=abci-client connection=query err="dial tcp 127.0.0.1:26858: connect: connection refused"
2025-11-19T12:20:15.622914Z  INFO tower_abci::v037::server: ABCI server starting on tcp socket addr=127.0.0.1:26858
2025-11-19T12:20:15.623049Z  INFO namada_node: Namada ledger node started.
2025-11-19T12:20:15.623063Z  INFO namada_node: This node is a validator
I[2025-11-19|12:20:17.240] service start                                module=abci-client connection=snapshot msg="Starting socketClient service" impl=socketClient
2025-11-19T12:20:17.240841Z  INFO tower_abci::v037::server: listening for requests
2025-11-19T12:20:17.240960Z  INFO tower_abci::v037::server: listening for requests
I[2025-11-19|12:20:17.241] service start                                module=abci-client connection=mempool msg="Starting socketClient service" impl=socketClient
I[2025-11-19|12:20:17.241] service start                                module=abci-client connection=consensus msg="Starting socketClient service" impl=socketClient
I[2025-11-19|12:20:17.241] service start                                module=events msg="Starting EventBus service" impl=EventBus
I[2025-11-19|12:20:17.241] service start                                module=pubsub msg="Starting PubSub service" impl=PubSub
I[2025-11-19|12:20:17.241] service start                                module=txindex msg="Starting IndexerService service" impl=IndexerService
2025-11-19T12:20:17.241281Z  INFO tower_abci::v037::server: listening for requests
2025-11-19T12:20:17.241290Z  INFO tower_abci::v037::server: listening for requests
2025-11-19T12:20:17.241413Z  INFO namada_node::shell: Last state root hash: 532ac02058d73d22f7ccd36a79b2b9be7043503e88e25951fd7652c1139915a1, height: 4351675
I[2025-11-19|12:20:17.241] ABCI Handshake App Info                      module=consensus height=4351675 hash=532AC02058D73D22F7CCD36A79B2B9BE7043503E88E25951FD7652C1139915A1 software-version=v101.1.4 protocol-version=1
I[2025-11-19|12:20:17.241] ABCI Replay Blocks                           module=consensus appHeight=4351675 storeHeight=4351676 stateHeight=4351675
I[2025-11-19|12:20:17.241] Replay last block using real app             module=consensus
2025-11-19T12:20:17.259368Z  INFO namada_node::shell::process_proposal: Received block proposal proposer="0CE0EEB069DD6344BB639E8AF332E2F16F1FC6B1" height=4351676 hash="EB3A9A59B92B1B9104E27A02B530EB6429C179559128D7C654F40BF94CBE0687" n_txs=0
2025-11-19T12:20:17.259904Z  INFO namada_node::shell::finalize_block: Block height: 4351676, epoch: 1402, is new epoch: false, is masp new epoch: false.
2025-11-19T12:20:17.430547Z  INFO namada_node::shell::finalize_block: Applied 0 transactions. Wrappers: 0, successful inner txs: 0, rejected inner txs: 0, errored inner txs: 0, unrun txs: 0, valid txs discarded by failing atomic batch: 0, vp cache size: 0 - 0, tx cache size 0 - 0
2025-11-19T12:20:17.430557Z  INFO namada_node::shell::finalize_block: txs executed: 0
I[2025-11-19|12:20:17.471] executed block                               module=consensus height=4351676 num_valid_txs=0 num_invalid_txs=0
The application panicked (crashed).
Message:  Encountered a storage error while committing a block: Custom(CustomError(Custom(CustomError(DBError("Corruption: block checksum mismatch: stored = 1624078082, computed = 698440519, type = 4  in /home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/db/8375703.sst offset 65977035 size 1621")))))
Location: crates/node/src/shell/mod.rs:823
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
The application panicked (crashed).
Message:  flush failed: DBError("Corruption: block checksum mismatch: stored = 1624078082, computed = 698440519, type = 4  in /home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/db/8375703.sst offset 65977035 size 1621")
Location: crates/node/src/storage/rocksdb.rs:262
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
2025-11-19T12:20:17.588543Z  INFO namada_node::shims::abcipp_shim: ABCI response channel didn't respond
E[2025-11-19|12:20:17.588] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=consensus
I[2025-11-19|12:20:17.588] service stop                                 module=abci-client connection=consensus msg="Stopping socketClient service" impl=socketClient
E[2025-11-19|12:20:17.588] consensus connection terminated. Did the application crash? Please restart CometBFT module=proxy err="read message: EOF"
The application panicked (crashed).
Message:  called `Result::unwrap()` on an `Err` value: RecvError(())
Location: /home/username/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tower-abci-0.19.1/src/v037/server.rs:179
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
E[2025-11-19|12:20:17.588] client error during proxyAppConn.CommitSync  module=consensus err="read message: EOF"
ERROR: failed to create node: error during handshake: error on replay: commit failed for application: read message: EOF
2025-11-19T12:20:17.591091Z  INFO namada_node: Tendermint node is no longer running.
2025-11-19T12:20:17.591106Z ERROR namada_node: Err(Tendermint(Runtime("exit status: 1")))
2025-11-19T12:20:17.591200Z  INFO namada_node::abortable: Tendermint has exited, shutting down...
2025-11-19T12:20:17.591225Z  INFO namada_node: Namada ledger node has shut down.
2025-11-19T12:20:17.591236Z  INFO namada_node::broadcaster: Shutting down broadcaster...
2025-11-19T12:20:17.591247Z  INFO namada_node: Broadcaster is no longer running.
2025-11-19T12:20:17.591247Z  INFO namada_node: Shutting down ABCI server...
The application panicked (crashed).
Message:  panic in a destructor during cleanup
Location: library/core/src/panicking.rs:226
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
thread caused non-unwinding panic. aborting.
2025-11-19T12:20:17.618302Z  INFO namada_node::ethereum_oracle: Ethereum event oracle is no longer running url="http://127.0.0.1:10545"
namadad.service: Main process exited, code=dumped, status=6/ABRT
namadad.service: Failed with result 'core-dump'.
namadad.service: Consumed 22.153s CPU time.
namadad.service: Scheduled restart job, restart counter is at 2.
Started namadad.service - Namada Daemon.

Byeee :)!
ZEN

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions