Rewrite State Sync, from a giant state machine to proper async code. #12172

robin-near · 2024-09-30T15:49:18Z

This rewrites state sync. All functionality is expected to continue to work without any protocol or database changes.

See the top of state/mod.rs for an overview.

State sync status is now available on the debug page; an example:

codecov · 2024-10-01T03:08:30Z

Codecov Report

Attention: Patch coverage is 61.94939% with 406 lines in your changes missing coverage. Please review.

Project coverage is 71.43%. Comparing base (359564c) to head (5d423a9).

Files with missing lines	Patch %	Lines
chain/client/src/sync/state/network.rs	1.57%	188 Missing ⚠️
chain/client/src/sync/state/downloader.rs	75.64%	26 Missing and 12 partials ⚠️
chain/client/src/sync/state/mod.rs	83.56%	33 Missing and 3 partials ⚠️
chain/client/src/sync/state/shard.rs	82.65%	6 Missing and 28 partials ⚠️
chain/client/src/client_actor.rs	25.80%	19 Missing and 4 partials ⚠️
chain/client/src/sync/state/external.rs	86.08%	8 Missing and 8 partials ⚠️
chain/client/src/test_utils/setup.rs	11.11%	16 Missing ⚠️
chain/client-primitives/src/types.rs	31.57%	13 Missing ⚠️
chain/client/src/info.rs	0.00%	13 Missing ⚠️
chain/network/src/peer/peer_actor.rs	0.00%	8 Missing ⚠️
... and 8 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #12172      +/-   ##
==========================================
- Coverage   71.61%   71.43%   -0.18%     
==========================================
  Files         824      830       +6     
  Lines      165348   165033     -315     
  Branches   165348   165033     -315     
==========================================
- Hits       118412   117890     -522     
- Misses      41803    42021     +218     
+ Partials     5133     5122      -11

Flag	Coverage Δ
backward-compatibility	`0.17% <0.00%> (+<0.01%)`	⬆️
db-migration	`0.17% <0.00%> (+<0.01%)`	⬆️
genesis-check	`1.26% <0.00%> (+<0.01%)`	⬆️
integration-tests	`38.57% <61.94%> (-0.15%)`	⬇️
linux	`71.22% <61.94%> (-0.18%)`	⬇️
linux-nightly	`71.00% <61.94%> (-0.19%)`	⬇️
macos	`53.88% <8.06%> (-0.25%)`	⬇️
pytests	`1.52% <0.00%> (+<0.01%)`	⬆️
sanity-checks	`1.32% <0.00%> (+<0.01%)`	⬆️
unittests	`65.15% <8.06%> (-0.19%)`	⬇️
upgradability	`0.21% <0.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

shreyan-gupta

Looks great!

shreyan-gupta · 2024-10-01T17:55:55Z

chain/client/src/sync/state/mod.rs

+        store: Store,
+        epoch_manager: Arc<dyn EpochManagerAdapter>,
+        runtime: Arc<dyn RuntimeAdapter>,
+        network_adapter: AsyncSender<PeerManagerMessageRequest, PeerManagerMessageResponse>,


nit: Why not use type PeerManagerAdapter here like everywhere else?

shreyan-gupta · 2024-10-01T18:04:19Z

chain/client/src/sync/state/downloader.rs

+                    )
+                    .await?;
+                let state_root = header.chunk_prev_state_root();
+                if runtime_adapter.validate_state_part(


We can remove the validate_state_part from runtime to just being a part of some util. It doesn't have any need to be in runtime.

shreyan-gupta · 2024-10-01T18:12:52Z

chain/client/src/client.rs

@@ -2579,7 +2576,7 @@ impl Client {
        sync_hash: CryptoHash,
        state_sync_info: &StateSyncInfo,
        me: &Option<AccountId>,
-    ) -> Result<HashMap<u64, ShardSyncDownload>, Error> {
+    ) -> Result<HashMap<u64, ShardSyncStatus>, Error> {


ShardSyncDownload and ShardSyncDownloadView are no longer used anywhere, can be deleted.
Nit: update the comment above

shreyan-gupta · 2024-10-01T18:17:15Z

chain/client/src/client.rs

@@ -2465,34 +2464,40 @@ impl Client {

        for (sync_hash, state_sync_info) in self.chain.chain_store().iterate_state_sync_infos()? {
            assert_eq!(sync_hash, state_sync_info.epoch_tail_hash);
-            let network_adapter = self.network_adapter.clone();
-
            let shards_to_split = self.get_shards_to_split(sync_hash, &state_sync_info, &me)?;


Ugh, I think Alex merged in some change here. Will need to check if we need shards_to_split at all.

shreyan-gupta · 2024-10-01T18:18:44Z

chain/client/src/client.rs

-                            network_adapter,
+                            self.runtime_adapter.store().clone(),
+                            self.epoch_manager.clone(),
+                            self.runtime_adapter.clone(),


I'm not super happy passing in the runtime into state sync. On the top level it doesn't seem like there should be any dependency between these two. Could we check if it's possible to decouple? Maybe as a follow up to this PR.

shreyan-gupta · 2024-10-01T19:44:52Z

chain/client/src/sync/state/shard.rs

+    return_if_cancelled!(cancel);
+
+    // Finalize; this needs to be done by the Chain.
+    *status.lock().unwrap() = ShardSyncStatus::StateApplyFinalizing;


Nice! With this setup, it should ideally be possible for us to convert ShardSyncStatus to a string instead? And we can keep adding more and more status with more information?

shreyan-gupta · 2024-10-01T19:48:06Z

chain/client/src/sync/state/shard.rs

+/// would be blocked by the computation, thereby not allowing computation of other
+/// futures driven by the same driver to proceed. This function respawns the future
+/// onto the FutureSpawner, so the driver of the returned future would not be blocked.
+fn respawn_for_parallelism<T: Send + 'static>(


Instead of this, is it not possible to have small enough compute intensive tasks so as to not block the driver? This looks a bit odd overall where we are effectively "transferring" the future to a different spawner.awkward

shreyan-gupta · 2024-10-01T19:52:27Z

chain/client/src/sync/state/mod.rs

+    /// and then in `run` we process them.
+    header_validation_queue: UnboundedReceiver<StateHeaderValidationRequest>,
+    chain_finalization_queue: UnboundedReceiver<ChainFinalizationRequest>,
+    chain_finalization_sender: UnboundedSender<ChainFinalizationRequest>,


Why do we need to store the sender as part of StateSync?

shreyan-gupta · 2024-10-01T19:53:44Z

chain/client/src/sync/state/mod.rs

+                    }
+                    Err(TryRecvError::Empty) => entry.get().status(),
+                },
+                Entry::Vacant(entry) => {


Could we encapsulate this into a function like self.start_state_sync_for_shard for better readability?

shreyan-gupta · 2024-10-01T19:56:00Z

chain/client/src/sync/state/mod.rs

+    chain_finalization_sender: UnboundedSender<ChainFinalizationRequest>,
+
+    /// There is one entry in this map for each shard that is being synced.
+    shard_syncs: HashMap<(CryptoHash, ShardId), StateSyncShardHandle>,


Wait, where and how are we adding new entries into this?

shreyan-gupta · 2024-10-02T01:52:04Z

chain/client/src/sync/state/mod.rs

+    }
+
+    /// Processes the requests that the state sync module needed the Chain for.
+    fn process_chain_requests(&mut self, chain: &mut Chain) {


Slightly wary about having this setup for sync stuff happening...
While this is objectively better for our use case here, we have a pattern where we send an actix message to client to handle all things that should be in sync in client and this sorta breaks that... I can't think of better ways or alternatives

saketh-are · 2024-10-02T23:29:49Z

chain/client/src/client.rs

+                            sync_status: shards_to_split.clone(),
+                            download_tasks: Vec::new(),
+                            computation_tasks: Vec::new(),
+                        },
                        BlocksCatchUpState::new(sync_hash, *epoch_id),
                    )
                });

            // For colour decorators to work, they need to printed directly. Otherwise the decorators get escaped, garble output and don't add colours.


nit: This comment is no longer relevant

saketh-are · 2024-10-02T23:34:32Z

chain/client/src/client_actor.rs

@@ -2079,13 +2044,21 @@ impl ClientActorInner {

        if block_hash == sync_hash {
            // The first block of the new epoch.
+            if let Err(err) = self.client.chain.validate_block(&block) {


Currently we see a lot of "Received an invalid block during state sync" spam during state sync because the node doesn't know how to validate blocks at the head of the chain. I think it makes sense to only validate the blocks that state sync is specifically looking for

saketh-are · 2024-10-03T00:00:11Z

chain/client/src/sync/state/network.rs

+                            part_id: *part_id,
+                        },
+                    );
+                    let state_value = PendingPeerRequestValue { peer_id: None, sender };


In the handling for NetworkRequests::StateRequestPart we select a specific peer from which to request the part, so it should be possible to store that and verify that the response comes back from the expected peer. However, it might be a bit ugly to pass the selected peer id back here from the network side of things, and I expect to redo how the state headers work soon, so I am fine with just leaving this as-is for now.

saketh-are · 2024-10-03T00:03:32Z

chain/network/src/peer/peer_actor.rs

@@ -1100,7 +1100,14 @@ impl PeerActor {
                    .map(|response| PeerMessage::VersionedStateResponse(*response.0)),
                PeerMessage::VersionedStateResponse(info) => {
                    //TODO: Route to state sync actor.


seems outdated

saketh-are · 2024-10-03T00:22:35Z

chain/client/src/sync/state/mod.rs

+        let downloader = Arc::new(StateSyncDownloader {
+            clock,
+            store: store.clone(),
+            preferred_source: peer_source,


At the moment we are in an ugly situation with state headers; a node needs to be tracking all shards to serve headers for any shard, nodes no longer track all shards, and the strategy for trying to obtain headers from the network is to request them from direct peers of the node at random. It is a remnant of the times when every peer tracked every shard, and it needs to be rewritten entirely.

Before this PR, if an external source was available we would just directly get the headers from it without any attempts to get it from the network. It looks like we are changing that now and will try the network first for headers. I think it should be OK; just giving a heads up that there is a behavior change hidden here.

shreyan-gupta · 2024-10-08T09:53:33Z

chain/client/src/sync/state/mod.rs

+/// headers and parts in parallel for the requested shards, but externally, all that it exposes
+/// is a single `run` method that should be called periodically, returning that we're either
+/// done or still in progress, while updating the externally visible status.
+pub struct StateSync {


Btw, is the async part compatible with testloop?

robin-near requested a review from a team as a code owner September 30, 2024 15:49

robin-near requested a review from akhi3030 September 30, 2024 15:49

robin-near marked this pull request as draft September 30, 2024 15:49

robin-near force-pushed the statesync branch from 774ac44 to b3c5ff5 Compare October 1, 2024 02:46

robin-near requested review from saketh-are, VanBarbascu and shreyan-gupta October 1, 2024 02:47

Rewrite State Sync, from a giant state machine to proper async code.

5d423a9

robin-near force-pushed the statesync branch from b3c5ff5 to 5d423a9 Compare October 1, 2024 02:48

robin-near marked this pull request as ready for review October 1, 2024 02:48

shreyan-gupta reviewed Oct 1, 2024

View reviewed changes

shreyan-gupta reviewed Oct 2, 2024

View reviewed changes

saketh-are reviewed Oct 3, 2024

View reviewed changes

shreyan-gupta reviewed Oct 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite State Sync, from a giant state machine to proper async code. #12172

Rewrite State Sync, from a giant state machine to proper async code. #12172

robin-near commented Sep 30, 2024 •

edited

Loading

codecov bot commented Oct 1, 2024

shreyan-gupta left a comment

shreyan-gupta Oct 1, 2024

shreyan-gupta Oct 1, 2024

shreyan-gupta Oct 1, 2024

shreyan-gupta Oct 1, 2024

shreyan-gupta Oct 1, 2024

shreyan-gupta Oct 1, 2024

shreyan-gupta Oct 1, 2024

shreyan-gupta Oct 1, 2024

shreyan-gupta Oct 1, 2024

shreyan-gupta Oct 1, 2024

shreyan-gupta Oct 2, 2024

saketh-are Oct 2, 2024

saketh-are Oct 2, 2024

saketh-are Oct 3, 2024

saketh-are Oct 3, 2024

saketh-are Oct 3, 2024

shreyan-gupta Oct 8, 2024

Rewrite State Sync, from a giant state machine to proper async code. #12172

Are you sure you want to change the base?

Rewrite State Sync, from a giant state machine to proper async code. #12172

Conversation

robin-near commented Sep 30, 2024 • edited Loading

codecov bot commented Oct 1, 2024

Codecov Report

shreyan-gupta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robin-near commented Sep 30, 2024 •

edited

Loading