You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is to track the update status, which was confirmed on hackerone, but you guys mark this is duplicate with #20685 .
I found that the current inter-block cache is not thread-safe. Both simulation and block commit operations can occur asynchronously, which creates a race condition. Specifically, the query can also be executed asynchronously, but I observed that the CreateQueryContext function returns an unwrapped store that bypasses the inter-block cache (Probably fixed in the PR).
In CosmosSDK, when creating the root multi-store, the IAVL stores are wrapped with an inter-block cache, as shown here.
Previously, CosmosSDK handled queries, simulations, and block commits through Tendermint, which ensured a global lock on both commit and query|simulation operations. However, with the new approach where CosmosSDK provides its own interface for queries and simulations, a problem has arisen. During simulations, the cache-wrapped store (via checkState) is used, and similarly, block commits use the cache-wrapped store (via finalizeBlockState). I have verified simulation touching inter-block cache.
This setup allows multiple goroutines to access and modify the inter-block cache concurrently, which is not thread-safe, as outlined here.
In particular, the simulation goroutine can update the inter-block cache via .Get, while the block commit process can modify the cache using .Get, .Set or .Delete. This concurrent access to the cache can lead to data races and eventually app hash crash.
We have confirmed that the Kami testnet is now free from the app hash issues that previously caused periodic breaks every day after the inter-block cache was disabled.
Details
If simulation and finalize are not performed in a sync way, there is no problem. That's why previously cosmos chains had no problem, but latest cosmos performs simulation and finalize in an async way. so with small possibility, we met app hash crash when simulation change the inter-block cache during finalize block commit happening.
ex) Let's say block commit write V1: K1 in gorountine (G1), and simulation try to read value of K1 simultaneously in gorountine (G2).
Is there an existing issue for this?
What happened?
This issue is to track the update status, which was confirmed on hackerone, but you guys mark this is duplicate with #20685 .
I found that the current inter-block cache is not thread-safe. Both simulation and block commit operations can occur asynchronously, which creates a race condition. Specifically, the query can also be executed asynchronously, but I observed that the
CreateQueryContext
function returns an unwrapped store that bypasses the inter-block cache (Probably fixed in the PR).In CosmosSDK, when creating the root multi-store, the IAVL stores are wrapped with an inter-block cache, as shown here.
Previously,
CosmosSDK
handled queries, simulations, and block commits throughTendermint
, which ensured a global lock on both commit and query|simulation operations. However, with the new approach whereCosmosSDK
provides its own interface for queries and simulations, a problem has arisen. During simulations, the cache-wrapped store (viacheckState
) is used, and similarly, block commits use the cache-wrapped store (viafinalizeBlockState
). I have verified simulation touching inter-block cache.This setup allows multiple goroutines to access and modify the inter-block cache concurrently, which is not thread-safe, as outlined here.
In particular, the simulation goroutine can update the
inter-block
cache via .Get, while the block commit process can modify the cache using .Get, .Set or .Delete. This concurrent access to the cache can lead to data races and eventually app hash crash.We have confirmed that the Kami testnet is now free from the app hash issues that previously caused periodic breaks every day after the inter-block cache was disabled.
Details
If simulation and finalize are not performed in a sync way, there is no problem. That's why previously cosmos chains had no problem, but latest cosmos performs simulation and finalize in an async way. so with small possibility, we met app hash crash when simulation change the inter-block cache during finalize block commit happening.
ex) Let's say block commit write
V1
:K1
in gorountine (G1
), and simulation try to read value ofK1
simultaneously in gorountine (G2
).K1
inG2
GetK1
inG1
Setnil
empty value fromIAVL
and overwrite cache forK1
inG2
GetK1
:V1
toIAVL
Set inG1
and in next block's finalize when we try to read
K1
, then we will getnil
empty value becauseGet
overwrite cache with wrong empty value.Cosmos SDK Version
v0.50~main
How to reproduce?
Here is the test, I assume it should return committed data, but it sometimes return
nil
, which is overwritten byGet
.I think we should introduce lock around this cache.
The text was updated successfully, but these errors were encountered: