bug: db corruption (likely when RAM limits get reached on the running system)

Hey!

I have DB corruptions happening on my validator from time to time. I caught it once in real time due to my RAM hitting its limit (64Gb + 16GB swap; I was doing some unrelated experiments that started to use a ton of RAM).

I suspect that the other times it too got corrupted due to such mem spikes, but I can't prove this yet. These mem spikes likely came from other things running (like my hermes relayer, which oddly has moments where it starts using quite some RAM, + operating other nodes). 

My other nodes never corrupt though when this happens, so I wonder if the node is _or_ can be safe guarded against random usage spikes like these. Have you guys checked whether a DB corruption occurs if there's a sudden memory spike? And if this is actually the case, is it possible to implement something to prevent it from corrupting?

A DB corruption could likely happen in many different ways, but the logs of this issue looks similar #86 (thanks for the diving deep into the archives @b0dski!).

Log excerpt:
```
I[2025-11-19|12:19:46.274] executed block                               module=state height=4351675 num_valid_txs=0 num_invalid_txs=0
2025-11-19T12:19:46.398914Z  INFO namada_node::shell: Committed block hash: 532ac02058d73d22f7ccd36a79b2b9be7043503e88e25951fd7652c1139915a1, height: 4351675
I[2025-11-19|12:19:46.439] committed state                              module=state height=4351675 num_txs=0 app_hash=532AC02058D73D22F7CCD36A79B2B9BE7043503E88E25951FD7652C1139915A1
I[2025-11-19|12:19:46.440] indexed block events                         module=txindex height=4351675
I[2025-11-19|12:19:52.030] Timed out                                    module=consensus dur=5.588606414s height=4351676 round=0 step=RoundStepNewHeight
I[2025-11-19|12:19:52.947] received proposal                            module=consensus proposal="Proposal{4351676/0 (EB3A9A59B92B1B9104E27A02B530EB6429C179559128D7C654F40BF94CBE0687:1:C8D681C2B1D6, -1) 3DEC1ED43EEC @ 2025-11-19T12:19:52.334888612Z}" proposer=0CE0EEB069DD6344BB639E8AF332E2F16F1FC6B1
I[2025-11-19|12:19:53.068] received complete proposal block             module=consensus height=4351676 hash=EB3A9A59B92B1B9104E27A02B530EB6429C179559128D7C654F40BF94CBE0687
2025-11-19T12:19:53.076745Z  INFO namada_node::shell::process_proposal: Received block proposal proposer="0CE0EEB069DD6344BB639E8AF332E2F16F1FC6B1" height=4351676 hash="EB3A9A59B92B1B9104E27A02B530EB6429C179559128D7C654F40BF94CBE0687" n_txs=0
I[2025-11-19|12:19:53.572] finalizing commit of block                   module=consensus height=4351676 hash=EB3A9A59B92B1B9104E27A02B530EB6429C179559128D7C654F40BF94CBE0687 root=532AC02058D73D22F7CCD36A79B2B9BE7043503E88E25951FD7652C1139915A1 num_txs=0
2025-11-19T12:19:53.588833Z  INFO namada_node::shell::finalize_block: Block height: 4351676, epoch: 1402, is new epoch: false, is masp new epoch: false.
2025-11-19T12:19:53.785741Z  INFO namada_node::shell::finalize_block: Applied 0 transactions. Wrappers: 0, successful inner txs: 0, rejected inner txs: 0, errored inner txs: 0, unrun txs: 0, valid txs discarded by failing atomic batch: 0, vp cache size: 2 - 2, tx cache size 11 - 11
2025-11-19T12:19:53.785750Z  INFO namada_node::shell::finalize_block: txs executed: 0
I[2025-11-19|12:19:53.826] executed block                               module=state height=4351676 num_valid_txs=0 num_invalid_txs=0
The application panicked (crashed).
Message:  Encountered a storage error while committing a block: Custom(CustomError(Custom(CustomError(DBError("Corruption: block checksum mismatch: stored = 1624078082, computed = 698440519, type = 4  in /home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/db/8375703.sst offset 65977035 size 1621")))))
Location: crates/node/src/shell/mod.rs:823
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
2025-11-19T12:19:53.903389Z  INFO namada_node::shims::abcipp_shim: ABCI response channel didn't respond
E[2025-11-19|12:19:53.903] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=consensus
The application panicked (crashed).
Message:  
I[2025-11-19|12:19:53.903] service stop                                 module=abci-client connection=consensus msg="Stopping socketClient service" impl=socketClient
called `Result::unwrap()` on an `Err` value: RecvError(())
Location: /home/username/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tower-abci-0.19.1/src/v037/server.rs:179
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
The application panicked (crashed).
Message:  flush failed: DBError("Corruption: block checksum mismatch: stored = 1624078082, computed = 698440519, type = 4  in /home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/db/8375703.sst offset 65977035 size 1621")
Location: crates/node/src/storage/rocksdb.rs:262
E[2025-11-19|12:19:53.903] consensus connection terminated. Did the application crash? Please restart CometBFT module=proxy err="read message: EOF"
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
E[2025-11-19|12:19:53.903] client error during proxyAppConn.CommitSync  module=state err="read message: EOF"
E[2025-11-19|12:19:53.903] CONSENSUS FAILURE!!!                         module=consensus err="failed to apply block; error commit failed for application: read message: EOF" stack="goroutine 829 [running]:\nruntime/debug.Stack()\n\t/opt/hostedtoolcache/go/1.24.7/x64/src/runtime/debug/stack.go:26 +0x5e\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine.func2()\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:737 +0x46\npanic({0xf8a880?, 0xc005502650?})\n\t/opt/hostedtoolcache/go/1.24.7/x64/src/runtime/panic.go:792 +0x132\ngithub.com/cometbft/cometbft/consensus.(*State).finalizeCommit(0xc0003a9c08, 0x4266bc)\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:1720 +0xec5\ngithub.com/cometbft/cometbft/consensus.(*State).tryFinalizeCommit(0xc0003a9c08, 0x4266bc)\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:1620 +0x2e5\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit.func1()\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:1555 +0x9c\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit(0xc0003a9c08, 0x4266bc, 0x0)\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:1593 +0xc0f\ngithub.com/cometbft/cometbft/consensus.(*State).addVote(0xc0003a9c08, 0xc0022485a0, {0xc000151380, 0x28})\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:2233 +0x17df\ngithub.com/cometbft/cometbft/consensus.(*State).tryAddVote(0xc0003a9c08, 0xc0022485a0, {0xc000151380?, 0x0?})\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:2022 +0x26\ngithub.com/cometbft/cometbft/consensus.(*State).handleMsg(0xc0003a9c08, {{0x13c5440, 0xc003843cd0}, {0xc000151380, 0x28}})\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:866 +0x3d0\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine(0xc0003a9c08, 0x0)\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:773 +0x3f1\ncreated by github.com/cometbft/cometbft/consensus.(*State).OnStart in goroutine 82\n\t/home/runner/work/cometbft/cometbft/consensus/state.go:384 +0x107\n"
I[2025-11-19|12:19:53.903] service stop                                 module=consensus wal=/home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/cometbft/data/cs.wal/wal msg="Stopping baseWAL service" impl=baseWAL
I[2025-11-19|12:19:53.903] signal trapped                               module=main msg="captured terminated, exiting..."
I[2025-11-19|12:19:53.903] service stop                                 module=main msg="Stopping Node service" impl=Node
I[2025-11-19|12:19:53.903] Stopping Node                                module=main
I[2025-11-19|12:19:53.903] service stop                                 module=events msg="Stopping EventBus service" impl=EventBus
I[2025-11-19|12:19:53.903] service stop                                 module=pubsub msg="Stopping PubSub service" impl=PubSub
I[2025-11-19|12:19:53.903] service stop                                 module=txindex msg="Stopping IndexerService service" impl=IndexerService
I[2025-11-19|12:19:53.904] service stop                                 module=blockchain msg="Stopping Reactor service" impl=Reactor
E[2025-11-19|12:19:53.904] Error stopping pool                          module=blockchain err="already stopped"
I[2025-11-19|12:19:53.904] service stop                                 module=consensus msg="Stopping Consensus service" impl=ConsensusReactor
I[2025-11-19|12:19:53.904] service stop                                 module=consensus msg="Stopping State service" impl=ConsensusState
I[2025-11-19|12:19:53.904] service stop                                 module=consensus msg="Stopping TimeoutTicker service" impl=TimeoutTicker
I[2025-11-19|12:19:53.904] service stop                                 module=consensus wal=/home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/cometbft/data/cs.wal/wal msg="Stopping Group service" impl=Group
I[2025-11-19|12:19:53.904] service stop                                 module=evidence msg="Stopping Evidence service" impl=Evidence
I[2025-11-19|12:19:53.904] service stop                                 module=statesync msg="Stopping StateSync service" impl=StateSync
I[2025-11-19|12:19:53.904] Closing rpc listener                         module=main listener="&{Listener:0xc000366940 sem:0xc00056f6c0 closeOnce:{_:{} done:{_:{} v:0} m:{_:{} mu:{state:0 sema:0}}} done:0xc00056f730}"
I[2025-11-19|12:19:53.904] New websocket connection                     module=rpc-server protocol=websocket remote=127.0.0.1:46394
I[2025-11-19|12:19:53.904] Closing blockstore                           module=main
I[2025-11-19|12:19:53.904] service start                                module=rpc-server protocol=websocket remote=127.0.0.1:46394 msg="Starting wsConnection service" impl=wsConnection
I[2025-11-19|12:19:53.904] RPC HTTP server stopped                      module=rpc-server err="accept tcp 127.0.0.1:26857: use of closed network connection"
E[2025-11-19|12:19:53.904] Error serving server                         module=main err="accept tcp 127.0.0.1:26857: use of closed network connection"
I[2025-11-19|12:19:53.904] Client closed the connection                 module=rpc-server protocol=websocket remote=127.0.0.1:45454
I[2025-11-19|12:19:53.904] service stop                                 module=rpc-server protocol=websocket remote=127.0.0.1:45454 msg="Stopping wsConnection service" impl=wsConnection
E[2025-11-19|12:19:53.904] error while stopping connection              module=rpc-server protocol=websocket error="already stopped"
I[2025-11-19|12:19:53.905] Subscribe to query                           module=rpc remote=127.0.0.1:46394 query="tm.event = 'NewBlock'"
I[2025-11-19|12:19:53.905] Closing statestore                           module=main
I[2025-11-19|12:19:53.905] WSJSONRPC                                    module=rpc-server protocol=websocket remote=127.0.0.1:46394 method=subscribe
I[2025-11-19|12:19:53.905] Closing evidencestore                        module=main
2025-11-19T12:19:53.918889Z  INFO namada_node: Tendermint node is no longer running.
2025-11-19T12:19:53.919021Z  INFO namada_node::abortable: Tendermint has exited, shutting down...
2025-11-19T12:19:53.919042Z  INFO namada_node: Namada ledger node has shut down.
2025-11-19T12:19:53.919050Z  INFO namada_node::broadcaster: Shutting down broadcaster...
2025-11-19T12:19:53.919054Z  INFO namada_node: Broadcaster is no longer running.
2025-11-19T12:19:53.919071Z  INFO namada_node: Shutting down ABCI server...
2025-11-19T12:19:53.945509Z  INFO namada_node::ethereum_oracle: Ethereum event oracle is no longer running url="http://127.0.0.1:10545"
The application panicked (crashed).
Message:  panic in a destructor during cleanup
Location: library/core/src/panicking.rs:226
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
thread caused non-unwinding panic. aborting.
namadad.service: Main process exited, code=dumped, status=6/ABRT
namadad.service: Failed with result 'core-dump'.
namadad.service: Consumed 22h 25min 15.178s CPU time, 3.0G memory peak, 0B memory swap peak.
namadad.service: Scheduled restart job, restart counter is at 1.
Started namadad.service - Namada Daemon.
2025-11-19T12:20:05.173372Z  INFO namada_node: Available logical cores: 22
2025-11-19T12:20:05.173386Z  INFO namada_node: Using 11 threads for Rayon.
2025-11-19T12:20:05.173388Z  INFO namada_node: Using 11 threads for Tokio.
2025-11-19T12:20:05.179378Z  INFO namada_node: VP WASM compilation cache size not configured, using 1/6 of available memory.
2025-11-19T12:20:05.179563Z  INFO namada_node: Available memory: 37.751705169677734 GiB
2025-11-19T12:20:05.179573Z  INFO namada_node: VP WASM compilation cache size: 6.291950860992074 GiB
2025-11-19T12:20:05.179578Z  INFO namada_node: Tx WASM compilation cache size not configured, using 1/6 of available memory.
2025-11-19T12:20:05.179581Z  INFO namada_node: Tx WASM compilation cache size: 6.291950860992074 GiB
2025-11-19T12:20:05.179585Z  INFO namada_node: Block cache size not configured, using 1/3 of available memory.
2025-11-19T12:20:05.179588Z  INFO namada_node: RocksDB block cache size: 12.58390172291547 GiB
2025-11-19T12:20:05.179685Z  INFO namada_node: Loading MASP verifying keys.
2025-11-19T12:20:05.179714Z  INFO namada_node::ethereum_oracle: Ethereum event oracle is starting url="http://127.0.0.1:10545"
2025-11-19T12:20:05.180699Z  INFO namada_node::ethereum_oracle: Oracle is awaiting initial configuration
2025-11-19T12:20:05.190341Z  INFO namada_node::tendermint_node: CometBFT node started
I[2025-11-19|12:20:05.200] deprecated usage found in configuration file usage="[fastsync] table detected. This section has been renamed to [blocksync]. The values in this deprecated section will be disregarded."
I[2025-11-19|12:20:05.201] deprecated usage found in configuration file usage="fast_sync key detected. This key has been renamed to block_sync. The value of this deprecated key will be disregarded."
I[2025-11-19|12:20:05.201] deprecated usage found in configuration file usage="unused and deprecated upnp field detected in P2P config."
I[2025-11-19|12:20:05.235] service start                                module=proxy msg="Starting multiAppConn service" impl=multiAppConn
I[2025-11-19|12:20:05.235] service start                                module=abci-client connection=query msg="Starting socketClient service" impl=socketClient
E[2025-11-19|12:20:05.236] abci.socketClient failed to connect to tcp://127.0.0.1:26858.  Retrying after 3s... module=abci-client connection=query err="dial tcp 127.0.0.1:26858: connect: connection refused"
2025-11-19T12:20:05.426143Z  INFO namada_node: Done loading MASP verifying keys.
2025-11-19T12:20:05.426445Z  INFO namada_node::storage::rocksdb: Using 5 compactions threads for RocksDB.
2025-11-19T12:20:05.427540Z  INFO namada_node::broadcaster: Starting broadcaster.
E[2025-11-19|12:20:08.236] abci.socketClient failed to connect to tcp://127.0.0.1:26858.  Retrying after 3s... module=abci-client connection=query err="dial tcp 127.0.0.1:26858: connect: connection refused"
E[2025-11-19|12:20:11.236] abci.socketClient failed to connect to tcp://127.0.0.1:26858.  Retrying after 3s... module=abci-client connection=query err="dial tcp 127.0.0.1:26858: connect: connection refused"
E[2025-11-19|12:20:14.239] abci.socketClient failed to connect to tcp://127.0.0.1:26858.  Retrying after 3s... module=abci-client connection=query err="dial tcp 127.0.0.1:26858: connect: connection refused"
2025-11-19T12:20:15.622914Z  INFO tower_abci::v037::server: ABCI server starting on tcp socket addr=127.0.0.1:26858
2025-11-19T12:20:15.623049Z  INFO namada_node: Namada ledger node started.
2025-11-19T12:20:15.623063Z  INFO namada_node: This node is a validator
I[2025-11-19|12:20:17.240] service start                                module=abci-client connection=snapshot msg="Starting socketClient service" impl=socketClient
2025-11-19T12:20:17.240841Z  INFO tower_abci::v037::server: listening for requests
2025-11-19T12:20:17.240960Z  INFO tower_abci::v037::server: listening for requests
I[2025-11-19|12:20:17.241] service start                                module=abci-client connection=mempool msg="Starting socketClient service" impl=socketClient
I[2025-11-19|12:20:17.241] service start                                module=abci-client connection=consensus msg="Starting socketClient service" impl=socketClient
I[2025-11-19|12:20:17.241] service start                                module=events msg="Starting EventBus service" impl=EventBus
I[2025-11-19|12:20:17.241] service start                                module=pubsub msg="Starting PubSub service" impl=PubSub
I[2025-11-19|12:20:17.241] service start                                module=txindex msg="Starting IndexerService service" impl=IndexerService
2025-11-19T12:20:17.241281Z  INFO tower_abci::v037::server: listening for requests
2025-11-19T12:20:17.241290Z  INFO tower_abci::v037::server: listening for requests
2025-11-19T12:20:17.241413Z  INFO namada_node::shell: Last state root hash: 532ac02058d73d22f7ccd36a79b2b9be7043503e88e25951fd7652c1139915a1, height: 4351675
I[2025-11-19|12:20:17.241] ABCI Handshake App Info                      module=consensus height=4351675 hash=532AC02058D73D22F7CCD36A79B2B9BE7043503E88E25951FD7652C1139915A1 software-version=v101.1.4 protocol-version=1
I[2025-11-19|12:20:17.241] ABCI Replay Blocks                           module=consensus appHeight=4351675 storeHeight=4351676 stateHeight=4351675
I[2025-11-19|12:20:17.241] Replay last block using real app             module=consensus
2025-11-19T12:20:17.259368Z  INFO namada_node::shell::process_proposal: Received block proposal proposer="0CE0EEB069DD6344BB639E8AF332E2F16F1FC6B1" height=4351676 hash="EB3A9A59B92B1B9104E27A02B530EB6429C179559128D7C654F40BF94CBE0687" n_txs=0
2025-11-19T12:20:17.259904Z  INFO namada_node::shell::finalize_block: Block height: 4351676, epoch: 1402, is new epoch: false, is masp new epoch: false.
2025-11-19T12:20:17.430547Z  INFO namada_node::shell::finalize_block: Applied 0 transactions. Wrappers: 0, successful inner txs: 0, rejected inner txs: 0, errored inner txs: 0, unrun txs: 0, valid txs discarded by failing atomic batch: 0, vp cache size: 0 - 0, tx cache size 0 - 0
2025-11-19T12:20:17.430557Z  INFO namada_node::shell::finalize_block: txs executed: 0
I[2025-11-19|12:20:17.471] executed block                               module=consensus height=4351676 num_valid_txs=0 num_invalid_txs=0
The application panicked (crashed).
Message:  Encountered a storage error while committing a block: Custom(CustomError(Custom(CustomError(DBError("Corruption: block checksum mismatch: stored = 1624078082, computed = 698440519, type = 4  in /home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/db/8375703.sst offset 65977035 size 1621")))))
Location: crates/node/src/shell/mod.rs:823
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
The application panicked (crashed).
Message:  flush failed: DBError("Corruption: block checksum mismatch: stored = 1624078082, computed = 698440519, type = 4  in /home/username/nodes/namada/home/namada.5f5de2dd1b88cba30586420/db/8375703.sst offset 65977035 size 1621")
Location: crates/node/src/storage/rocksdb.rs:262
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
2025-11-19T12:20:17.588543Z  INFO namada_node::shims::abcipp_shim: ABCI response channel didn't respond
E[2025-11-19|12:20:17.588] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=consensus
I[2025-11-19|12:20:17.588] service stop                                 module=abci-client connection=consensus msg="Stopping socketClient service" impl=socketClient
E[2025-11-19|12:20:17.588] consensus connection terminated. Did the application crash? Please restart CometBFT module=proxy err="read message: EOF"
The application panicked (crashed).
Message:  called `Result::unwrap()` on an `Err` value: RecvError(())
Location: /home/username/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tower-abci-0.19.1/src/v037/server.rs:179
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
E[2025-11-19|12:20:17.588] client error during proxyAppConn.CommitSync  module=consensus err="read message: EOF"
ERROR: failed to create node: error during handshake: error on replay: commit failed for application: read message: EOF
2025-11-19T12:20:17.591091Z  INFO namada_node: Tendermint node is no longer running.
2025-11-19T12:20:17.591106Z ERROR namada_node: Err(Tendermint(Runtime("exit status: 1")))
2025-11-19T12:20:17.591200Z  INFO namada_node::abortable: Tendermint has exited, shutting down...
2025-11-19T12:20:17.591225Z  INFO namada_node: Namada ledger node has shut down.
2025-11-19T12:20:17.591236Z  INFO namada_node::broadcaster: Shutting down broadcaster...
2025-11-19T12:20:17.591247Z  INFO namada_node: Broadcaster is no longer running.
2025-11-19T12:20:17.591247Z  INFO namada_node: Shutting down ABCI server...
The application panicked (crashed).
Message:  panic in a destructor during cleanup
Location: library/core/src/panicking.rs:226
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
thread caused non-unwinding panic. aborting.
2025-11-19T12:20:17.618302Z  INFO namada_node::ethereum_oracle: Ethereum event oracle is no longer running url="http://127.0.0.1:10545"
namadad.service: Main process exited, code=dumped, status=6/ABRT
namadad.service: Failed with result 'core-dump'.
namadad.service: Consumed 22.153s CPU time.
namadad.service: Scheduled restart job, restart counter is at 2.
Started namadad.service - Namada Daemon.
```

Byeee :)!
ZEN

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: db corruption (likely when RAM limits get reached on the running system) #4988

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: db corruption (likely when RAM limits get reached on the running system) #4988

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions