-
Notifications
You must be signed in to change notification settings - Fork 95
feat: add a sanity check for sync state of rocksdb vs sqlite, minor partitioning changes #1532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bin/node/src/commands/store.rs
Outdated
| /// has diverged from the database. This will delete existing tree storage and rebuild | ||
| /// it from scratch, which may take some time for large databases. | ||
| #[arg(long = "force-rebuild-tree-storage", default_value_t = false)] | ||
| force_rebuild_tree_storage: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relevant code to review 👆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced we want this to be an explicit option. We can also trigger this by manually deleting the database which should be sufficient so long as things are sort of working okay..
From a deployment perspective, this is problematic because we will either have it permanently true or false. Doing this once-off would require a manual intervention in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I think manually removing the files is fine. We should probably just document this (as behavior may not be obvious).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the option is preferred, since it doesn't require having knowledge what all of this means. Deleting a specific subset of files seems to be an unnecessary footgun for an operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there is no operator that will be able to use it since you cannot "deploy" with this option? Its only useable in a local context imo.
We similarly don't have an option to rebuild the sqlite database for example. Though technically its possible from raw blocks.
imo the fewer options there are, the better. The rarer the occurrence, the more "manual" it should be. If this becomes a common problem, then sure we can address it differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, that it should be done as a subcommand, the current form 'd require a separate deployment cycle, which makes it less useful.
|
Left markers for most relevant code changes |
bobbinth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you! I left some comments inline - many of them are for the future.
| NullifierTree::with_storage_from_entries(self, entries) | ||
| .map_err(StateInitializationError::FailedToCreateNullifierTree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but would be great to make the methodology of loading account and nullifier trees consistent. Currently, for nullifier tree we use NullifierTree::with_storage_from_entries() and for account tree we use AccountTree::new().
Some of these changes may need to happen in miden-base. Let's create issues for these (unless we have the already).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that with_storage_from_entries won't work anyways for larger nullifier sets, and hence this will go away in the next PR #1536
bin/node/src/commands/store.rs
Outdated
| /// has diverged from the database. This will delete existing tree storage and rebuild | ||
| /// it from scratch, which may take some time for large databases. | ||
| #[arg(long = "force-rebuild-tree-storage", default_value_t = false)] | ||
| force_rebuild_tree_storage: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there is no operator that will be able to use it since you cannot "deploy" with this option? Its only useable in a local context imo.
We similarly don't have an option to rebuild the sqlite database for example. Though technically its possible from raw blocks.
imo the fewer options there are, the better. The rarer the occurrence, the more "manual" it should be. If this becomes a common problem, then sure we can address it differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we return an error if we have a root mismatch during apply_block.
miden-node/crates/store/src/state.rs
Lines 423 to 425 in a2de154
| if nullifier_tree_update.as_mutation_set().root() != header.nullifier_root() { | |
| return Err(InvalidBlockError::NewBlockInvalidNullifierRoot.into()); | |
| } |
Should we have a similar plan if the corruption happens "live"? Its a bit of a weird situation since we can't know if its an SMT bug, or a block building bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd just bubble the error up and find the corrupted tree store on startup, and do it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but this doesn't actually bubble up to anywhere. This is part of the gRPC server on the store side, so it gets sent back to the block-producer which shrugs, rolls the block back and tries again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll create a follow-up issue, maybe it's time for fatality to disambiguate which ones are relevant for the query response and which ones are an actual issue for the service running.
8c2d310 to
a4f0119
Compare
a4f0119 to
24b0977
Compare
| ntx_builder_listener, | ||
| block_producer_listener, | ||
| data_directory: dir, | ||
| grpc_timeout: std::time::Duration::from_secs(30), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did this compile previously??
| nullifier_tree: miden_protocol::block::nullifier_tree::NullifierTree< | ||
| miden_protocol::crypto::merkle::smt::LargeSmt<S>, | ||
| >, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (but let's address in a follow-up): we don't need to use fully-qualified paths here.
From #1326 (comment) :
Changes
--force-rebuildflag to the CLI to force re-building of the tree stores from sqlite