Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: inter-block-cache causes app hash crash #23891

Open
1 task done
beer-1 opened this issue Mar 5, 2025 · 0 comments
Open
1 task done

[Bug]: inter-block-cache causes app hash crash #23891

beer-1 opened this issue Mar 5, 2025 · 0 comments
Labels

Comments

@beer-1
Copy link
Contributor

beer-1 commented Mar 5, 2025

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

This issue is to track the update status, which was confirmed on hackerone, but you guys mark this is duplicate with #20685 .

I found that the current inter-block cache is not thread-safe. Both simulation and block commit operations can occur asynchronously, which creates a race condition. Specifically, the query can also be executed asynchronously, but I observed that the CreateQueryContext function returns an unwrapped store that bypasses the inter-block cache (Probably fixed in the PR).

In CosmosSDK, when creating the root multi-store, the IAVL stores are wrapped with an inter-block cache, as shown here.

Previously, CosmosSDK handled queries, simulations, and block commits through Tendermint, which ensured a global lock on both commit and query|simulation operations. However, with the new approach where CosmosSDK provides its own interface for queries and simulations, a problem has arisen. During simulations, the cache-wrapped store (via checkState) is used, and similarly, block commits use the cache-wrapped store (via finalizeBlockState). I have verified simulation touching inter-block cache.

This setup allows multiple goroutines to access and modify the inter-block cache concurrently, which is not thread-safe, as outlined here.

In particular, the simulation goroutine can update the inter-block cache via .Get, while the block commit process can modify the cache using .Get, .Set or .Delete. This concurrent access to the cache can lead to data races and eventually app hash crash.

We have confirmed that the Kami testnet is now free from the app hash issues that previously caused periodic breaks every day after the inter-block cache was disabled.

Details

If simulation and finalize are not performed in a sync way, there is no problem. That's why previously cosmos chains had no problem, but latest cosmos performs simulation and finalize in an async way. so with small possibility, we met app hash crash when simulation change the inter-block cache during finalize block commit happening.

ex) Let's say block commit write V1: K1 in gorountine (G1), and simulation try to read value of K1 simultaneously in gorountine (G2).

  1. Cache miss for K1 in G2 Get
  2. Write cache for K1 in G1 Set
  3. Get nil empty value from IAVL and overwrite cache for K1 in G2 Get
  4. Write K1: V1 to IAVL Set in G1

and in next block's finalize when we try to read K1, then we will get nil empty value because Get overwrite cache with wrong empty value.

Cosmos SDK Version

v0.50~main

How to reproduce?

Here is the test, I assume it should return committed data, but it sometimes return nil, which is overwritten by Get.

I think we should introduce lock around this cache.

func Test_Concurrent_ReadWrite(t *testing.T) {
	db := wrapper.NewDBWrapper(dbm.NewMemDB())
	mngr := cache.NewCommitKVStoreCacheManager(cache.DefaultCommitKVStoreCacheSize)

	sKey := types.NewKVStoreKey("test")
	tree := iavl.NewMutableTree(db, 100, false, log.NewNopLogger())
	store := iavlstore.UnsafeNewStore(tree)
	kvStore := mngr.GetStoreCache(sKey, store)

	numGoRoutines := 100

	wg := sync.WaitGroup{}
	wg.Add(int(numGoRoutines) * 2)
	for i := uint(0); i < uint(numGoRoutines); i++ {
		key := []byte(fmt.Sprintf("key_%d", i))
		value := []byte(fmt.Sprintf("value_%d", i))

		// simulation from client
		go func(key []byte) {
			defer wg.Done()
			_ = kvStore.Get(key)
		}(key)

		// block commit
		go func(key, value []byte) {
			defer wg.Done()
			kvStore.Set(key, value)
		}(key, value)
	}

	wg.Wait()

	// should return block committed data
	for i := uint(0); i < uint(numGoRoutines); i++ {
		key := []byte(fmt.Sprintf("key_%d", i))
		value := []byte(fmt.Sprintf("value_%d", i))

		require.Equal(t, value, kvStore.Get(key))
	}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: No status
Development

No branches or pull requests

1 participant