Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dedicated IBD rules for Nakamoto #5655

Open
wants to merge 39 commits into
base: develop
Choose a base branch
from
Open

Add dedicated IBD rules for Nakamoto #5655

wants to merge 39 commits into from

Conversation

jcnelson
Copy link
Member

@jcnelson jcnelson commented Jan 4, 2025

This fixes #5642 by adding a dedicated IBD inference rule for Nakamoto. The Stacks node is in IBD mode when either of the following conditions are true:

  • The sortition height is less than the Bitcoin block height
  • The highest available tenure (as determined by the Nakamoto block downloader) is higher than the ongoing Stacks tenure

This (hopefully) fixes some edge cases we've seen on testnet whereby a node can erroneously believe it is not sync'ed when it really is. This had impacted the affected node's ability to participate in StackerDB replication. I intend to test this on naka3.sh, on testnet, and on a single mainnet signer.

Writing the code for that second criterion led to the discoveries of #5649 and #5650, since I had been using rc_consensus_hash as the ongoing Stacks tenure when in fact it was not.

EDIT: this also contains #5667, so let's merge that first.

@jcnelson jcnelson requested a review from a team as a code owner January 4, 2025 05:38
obycode
obycode previously approved these changes Jan 6, 2025
Copy link
Contributor

@obycode obycode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

jbencin
jbencin previously approved these changes Jan 6, 2025
Copy link
Contributor

@jbencin jbencin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some style comments but looks fine to me

@aldur aldur added this to the 3.1.0.0.3 milestone Jan 7, 2025
@jcnelson jcnelson modified the milestones: 3.1.0.0.4, 3.1.0.0.3 Jan 13, 2025
@aldur aldur modified the milestones: 3.1.0.0.3, 3.1.0.0.4 Jan 13, 2025
@aldur aldur modified the milestones: 3.1.0.0.4, 3.1.0.0.5 Jan 21, 2025
jferrant
jferrant previously approved these changes Jan 21, 2025
Copy link
Contributor

@obycode obycode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments for now, but there are conflicts and failing tests.

@aldur aldur modified the milestones: 3.1.0.0.5, 3.1.0.0.6 Feb 4, 2025
@kantai kantai changed the title Fix/5642 Add dedicated IBD rules for Nakamoto Feb 4, 2025
Comment on lines 280 to 278
if TEST_BLOCK_ANNOUNCE_STALL.get() {
if relay::fault_injection::stacks_announce_is_blocked() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem like they could be independently controlled... TEST_BLOCK_ANNOUNCE_STALL is used in the integration tests to prevent the miner themselves from announcing a new block (this is used to test that the signer set can announce blocks by themselves) -- but if the relayer stall is also active, it seems like it prevents the chains coordinator thread from waking up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time this PR was written, this was an attempt to make some CI tests less flaky. It appears that not only is this not necessary anymore, but also my code here has led to CI breakage.

Comment on lines 955 to 956
config.tenure_last_block_proposal_timeout = Duration::from_secs(0);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the forked_tenure_testing test related to this changeset? Why did this test need to be altered, and can you describe the intended behavior changes to the test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted it -- it was part of me trying to un-flake this test from a while ago.

Comment on lines +417 to +419
// if the highest available tenure is known, then is it the same as the ongoing stacks
// tenure? If so, then we're not IBD. If not, then we're IBD.
// If it is not known, then we're not in IBD.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean that a running node operating at chain tip would switch into IBD = true at every tenure boundary?

@@ -400,6 +400,67 @@ impl RelayerThread {
|| !self.config.miner.wait_for_block_download
}

/// Compute and set the global IBD flag from a NetworkResult
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Compute and set the global IBD flag from a NetworkResult
/// Compute and set the global initial block download (IBD) flag using data from the given NetworkResult

Copy link
Contributor

@kantai kantai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changeset appears to have broken many of the integration tests.

This should also have some kind of unit test (or tight assertions in an integration test) about what the expected value of IBD should be given different values of the NetworkResult: its not clear to me what the intended behavior around the arrival of a new tenure is, and then what the downstream impact on miner commitments would end up being.

I also have some questions about the necessity of this -- it seems like #5735 actually resolves the testnet genesis sync issues (and the mainnet ones as well), so what is this PR solving? Is it speeding up genesis sync? Restarts? I can't tell from looking at this PR.

@aldur aldur removed this from the 3.1.0.0.6 milestone Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Status: 📋 Backlog
Development

Successfully merging this pull request may close these issues.

6 participants