Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dynamic Protocol State] EpochStateContainer stores epoch active identities #4834

Conversation

durkmurder
Copy link
Member

Context

This PR is an attempt to change ActiveIdentities to store only participants that can contribute to extending the chain instead of storing participants for current and prev/next epoch depending on phase.

Previously ActiveIdentities were storing current + prev epoch participants in staking phase and current + next in setup/commit phases. By storing identities for previous epoch it allows us to store only identities that were introduced in current epoch as part of ActiveIdentities. In nutshell ActiveIdentities holds dynamic part of identities that were introduced in epoch setup event.

I will be updating docs and tests but don't expect any major changes to the implementation itself, so feel free to review.

Alexander Hentschel added 2 commits October 20, 2023 01:15
• revised goDoc of protocol state implementation
• changed `ProtocolStateEntry.PreviousEpoch` to pointer (consistent with `ProtocolStateEntry.NextEpoch`)
• There should be no significant algorithmic changes. Though, I have switched the order of the if and else branches when processing an Epoch Setup event. For me, this is much more in line with the intuitive order of documentation.

state/protocol/protocol_state/updater.go:
• notable updates and revisions of goDoc
• tried to address concerns around inconsistent handling of invariances
• updated code to work with `ProtocolStateEntry.PreviousEpoch` being potentially a nil pointer

storage/badger/protocol_state_test.go:
• detailed goDoc revisions for test `assertRichProtocolStateValidity`
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This greatly improved code readability and mental structure for me. Amazing work. 👏

There are minor some concerns about a couple places in the code. I have created PR #4854 (targeting your branch) that addresses the first of my concerns but not the second one. Furthermore, the PR updated, polishes and extends a lot of goDocs.

storage/badger/protocol_state_test.go Outdated Show resolved Hide resolved
model/flow/protocol_state.go Outdated Show resolved Hide resolved
model/flow/protocol_state.go Show resolved Hide resolved
model/flow/protocol_state.go Outdated Show resolved Hide resolved
model/flow/protocol_state_test.go Outdated Show resolved Hide resolved
state/protocol/protocol_state/updater.go Outdated Show resolved Hide resolved
@@ -187,17 +150,20 @@ func (u *Updater) ProcessEpochCommit(epochCommit *flow.EpochCommit) error {
// No errors are expected during normal operations.
func (u *Updater) UpdateIdentity(updated *flow.DynamicIdentityEntry) error {
Copy link
Member

@AlexHentschel AlexHentschel Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️

I have a hard time convincing myself that this code is correct. Just as a brief sketch of my thoughts:

  1. at the point when we are leaving an epoch, the Identity Table (EpochStateContainer.ActiveIdentities) essentially snapshots the latest state when these nodes were active.
    • In my opinion, changes of the dynamic identities essentially should only carry forward (causality) but not backwards.
    • Hence, ProtocolStateEntry.PreviousEpoch should never be modified. But we do here:
      prevEpochIdentity, foundInPrev := u.prevEpochIdentitiesLookup[updated.NodeID]
      if foundInPrev {
      prevEpochIdentity.Dynamic = updated.Dynamic
      }
  2. changing dynamic properties of node X affects the current and next epoch. However, as the weight might change at epoch boundaries, we cannot simply overwrite the weight for the current and next epoch by the same value. I think the following code is an incorrect simplification:
    currentEpochIdentity, foundInCurrent := u.currentEpochIdentitiesLookup[updated.NodeID]
    if foundInCurrent {
    currentEpochIdentity.Dynamic = updated.Dynamic
    }
    nextEpochIdentity, foundInNext := u.nextEpochIdentitiesLookup[updated.NodeID]
    if foundInNext {
    nextEpochIdentity.Dynamic = updated.Dynamic
    }

This is not addressed in my PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand we will address it in subsequent PR

Copy link
Member

@AlexHentschel AlexHentschel Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update:

  1. We discussed this point and decided that we really should continue to modify the identities also from past epochs.

    • We are essentially creating a snapshot of the identity table as of a certain block. By modifying an identity, we only carry this change forward.
    • Furthermore, we need to be able to modify "our current value" for a node X that participated in the last epoch but not in the current. Otherwise, we would not be able to eject X in the current epoch despite it misbehaving.

    Conclusion: current approach is correct.

  2. This will be addressed in a subsequent PR: we will replace (dynamic) weight by an enum representing the participation state (joining, active, leaving)

    • Currently, we maintain a dynamic weight but only differentiate between zero weight (not active) and positive (active).
    • It is not clear whether the protocol would benefit from dynamically changing trust weight in the future. Reasoning:
      • For consensus participants (consensus nodes, collector nodes), we anyway use their initial weight. The practical benefits are massive, because this enables light clients that don't need to download and locally check every block header.

        Leader selection is also pre-determined for an entire epoch based on initial weight.

      • For other Verification [VNs] and Execution [ENs] nodes we currently don't have meaningful ways for weight-based load balancing. ENs anyway need to execute every block and our chunk-assignment algorithm generates uniform load across all VNs. It is not clear whether weighted protocol variants even exist (and even if such algorithms exist [open research topic], a large amount of complicated software changes would likely be needed).

    Reworked TODO in [Dynamic Protocol State] Remaining work and ToDos #4649 (second item in High priority) to reflect our latest approach to this challenge

state/protocol/protocol_state/updater.go Outdated Show resolved Hide resolved
state/protocol/protocol_state/updater.go Outdated Show resolved Hide resolved
state/protocol/protocol_state/updater.go Outdated Show resolved Hide resolved
Comment on lines 92 to 93
// - Per convention, the system smart contracts should list the IdentitySkeletons in canonical order. This is useful for
// most efficient construction of the full active Identities for an epoch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The system contract does not list identities in canonical order. Conceptually, the identity table is an unordered set. We assume it is ordered a particular way at various points in the flow-go code to simplify various implementation details, like the DKG.

We enforce that the identity list used within flow-go is ordered by ordering it during the conversion from cadence to Go:

participants = participants.Sort(order.Canonical)
.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update our smart contract to do this? Either way we would need to work on it in future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have the smart contract do sorting. It's a resource-constrained environment and it is not critical that the set be sorted within the system contract.

It seems fine to me for the protocol layer to translate the event into a representation that is easier for it to work with.

Copy link
Member

@AlexHentschel AlexHentschel Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jordan for flagging that my comment in the code was incorrect. I have updated it in 49934ce to read:

// sanity checking SAFETY-CRITICAL INVARIANT (II):
// - Per convention, the `flow.EpochSetup` event should list the IdentitySkeletons in canonical order. This is useful
// for most efficient construction of the full active Identities for an epoch. We enforce this here at the gateway
// to the protocol state, when we incorporate new information from the EpochSetup event.
// - Note that the system smart contracts manage the identity table as an unordered set! For the protocol state, we desire a fixed
// ordering to simplify various implementation details, like the DKG. Therefore, we order identities in `flow.EpochSetup` during
// conversion from cadence to Go in the function `convert.ServiceEvent(flow.ChainID, flow.Event)` in package `model/convert`

Thereby, the documentation precisely describes the current state, which I think is all that is needed for this PR (?)

My thoughts on where ordering could be implemented in the future

  1. On the one hand, I see (weak) benefits of not having an ordering requirement in the system smart contracts. Conceptually, a set is sufficient for the current functionality of the system smart contracts. Not requiring an ordering is one less details that engineers have know about and that could accidentally be broken.

    For similar reasons, a block proposer is not required to include seals in an ordered manner. Instead, we re-order them when ingesting the block, which improves modularity of the code by removing inter-dependencies.

  2. On the other hand, in the future, the system smart contracts will also need to have the canonical ordering. This is because we want to eventually implement performance-dependent rewards for consensus nodes, which in turn requires a smart contract to decode the signer indices for blocks to determine which node contributed to a QC (the basis for the subsequent reward payout at the end of the epoch). To decode the signer indices, the smart contract needs to know the ordering.

    • We then have two choices: (i) the system smart contract maintains an un-ordered set and we re-order the identities at every block (a huge waste of computation) or (ii) the system smart maintains the identities in canonical ordering
    • Btw, note that to decode QC signer identities, we also need a notion of identities for each epoch. In other words, we would no longer be able top the current epoch's identity table at the end of the staking phase 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires a smart contract to decode the signer indices for blocks to determine which node contributed to a QC

Or, we could implement an API on the injected Block Cadence type that does this?

Copy link
Member

@AlexHentschel AlexHentschel Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could implement an API on the injected Block Cadence type that does this?

We could. At which point to we change the publisher of the data (system smart contract), vs layering on logic on the consumers (protocol state, dynamic reward's logic)? If we have only one consumer, I don't feel it makes much of a difference for the implementation because the ordering has to live either on the consumer or the publisher -- but putting the ordering into the consumer improves modularity. With two consumers, we are starting to duplicate the ordering logic. To me, that would be the point, where I think it would be more beneficial to put the ordering in the publisher.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insightful discussion on this point. I feel for the scope of this PR, the comment is addressed (?)

state/protocol/protocol_state/updater.go Outdated Show resolved Hide resolved
state/protocol/protocol_state/updater.go Show resolved Hide resolved
}
}
currentEpochParticipants := entry.CurrentEpochIdentityTable.Filter(func(identity *flow.Identity) bool {
_, found := entry.CurrentEpochSetup.Participants.ByNodeID(identity.NodeID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May not matter for a fixture, but ByNodeID does linear search so this is going to be $O(n^2)$

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but I feel it's more expressive and it's worth paying the cost in such case(especially in tests)

Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, ignoring our discussion about removing Weight altogether from this morning, which we'll address in a separate PR.

The one change I'd like to make sure we include is using sentinels to denote when a service event is invalid: InvalidServiceEventError. That way we can handle that failure path in the upper-level logic later on.

Base automatically changed from yurii/4649-todos-and-refactoring-part-1 to feature/dynamic-protocol-state October 23, 2023 08:07
@codecov-commenter
Copy link

codecov-commenter commented Oct 23, 2023

Codecov Report

Attention: 60 lines in your changes are missing coverage. Please review.

Comparison is base (de3c468) 49.69% compared to head (c638874) 55.83%.

Additional details and impacted files
@@                        Coverage Diff                         @@
##           feature/dynamic-protocol-state    #4834      +/-   ##
==================================================================
+ Coverage                           49.69%   55.83%   +6.13%     
==================================================================
  Files                                 480      946     +466     
  Lines                               47751    87792   +40041     
==================================================================
+ Hits                                23729    49017   +25288     
- Misses                              22253    35092   +12839     
- Partials                             1769     3683    +1914     
Flag Coverage Δ
unittests 55.83% <61.53%> (+6.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
state/protocol/inmem/convert.go 66.54% <100.00%> (ø)
storage/badger/protocol_state.go 66.98% <100.00%> (ø)
state/protocol/protocol_state/updater.go 97.72% <93.47%> (ø)
state/protocol/badger/mutator.go 66.94% <0.00%> (ø)
model/flow/protocol_state.go 48.59% <71.01%> (ø)
utils/unittest/fixtures.go 0.00% <0.00%> (ø)

... and 465 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +1124 to +1128
if protocol.IsInvalidServiceEventError(err) {
// we have observed an invalid service event, which triggers epoch fallback mode
updater.SetInvalidStateTransitionAttempted()
return dbUpdates, nil
}
Copy link
Member

@AlexHentschel AlexHentschel Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️

I am a little worried about this:
At this point, we don't know that the updater's internal state is a valid Protocol State. For all we know, it could be complete garbage. There is no atomicity requirement on the StateUpdater as far as I can tell, in that it either applies the update entirely or not at all.

Suggestion:

  • I think it it is ok to treat all protocol state operations mutations in a block as one atomic state transition. It either succeeds in its entirety or in case of a failure, there is no update to the protocol state (except setting InvalidStateTransitionAttempted to true).
  • I think it is straight forward to extend the protocol state Updater to "drop all modifications if InvalidStateTransitionAttempted" and document this properly throughout the code base. In a nutshell, within the Build method I would check the state.InvalidStateTransitionAttempted first; in that case we could just return a copy of the parentState with only the state.InvalidStateTransitionAttempted set to true.
  • While the code changes are relatively small, we need to properly test this and add detailed documentation to the Updater, the corresponding interface, and ideally also the places where we call the Updater

Further thoughts:

  • If the updater is signalling that it encountered an invalid state transition via an InvalidStateTransitionAttempted, should the updater maybe already set its own InvalidStateTransitionAttempted flag? We would still raise the error to signal the failure to the caller. However, the caller would no longer be required to explicitly call updater.SetInvalidStateTransitionAttempted() (I'd leave that method, to allow Updater external logic to set this flag for other reasons).

I think we should not do this as part of the current PR. The PR is already big enough and we keep layering on changes which makes it hard to keep track of without re-reviewing the entire PR. Added todo to #4649

…`Updater.ProcessEpochSetup()` such that the documentation and implementation are in the same order
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is ready to be merged. Great work!

There are a few remaining comments, but I think those should all be addressed in subsequent PRs. I went through the PR comments that are still open and added the respective items as todos to #4649

Comment on lines 70 to 72
if u.state.InvalidStateTransitionAttempted {
return nil // won't process new events if we are in EECC
return nil // won't process new events if we are in epoch fallback mode.
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding these checks:

I would be inclined to move these check to the very beginning of the respective methods. Otherwise, we would probably continue encountering errors after we once set InvalidStateTransitionAttempted, because later events will not make sense anymore as the protocol state stoped updating.

Comment on lines 106 to 116
// sanity checking invariant (I):
currentEpochDynamicProperties, found := activeIdentitiesLookup[nextEpochIdentitySkeleton.NodeID]
if found && currentEpochDynamicProperties.Dynamic.Ejected { // invariance violated
return protocol.NewInvalidServiceEventErrorf("node %v is ejected in current epoch %d but readmitted by EpochSetup event for epoch %d", nextEpochIdentitySkeleton.NodeID, u.parentState.CurrentEpochSetup.Counter, epochSetup.Counter)
}

nextEpochIdentities := make(flow.DynamicIdentityEntryList, 0, len(currentEpochIdentities))
currentEpochIdentitiesLookup := currentEpochIdentities.Lookup()
// For an `identity` participating in the upcoming epoch, we effectively perform steps 2 and 3 from above within a single loop.
for _, identity := range epochSetup.Participants {
// Step 2: node is _not_ participating in the current epoch, but joining in the upcoming epoch.
// The node is allowed to join the network already in this epoch's Setup Phase, but has weight 0.
if _, found := currentEpochIdentitiesLookup[identity.NodeID]; !found {
currentEpochIdentities = append(currentEpochIdentities, &flow.DynamicIdentityEntry{
NodeID: identity.NodeID,
Dynamic: flow.DynamicIdentity{
Weight: 0,
Ejected: false,
},
})
// sanity checking invariant (II):
if idx > 0 && !order.IdentifierCanonical(prevNodeID, nextEpochIdentitySkeleton.NodeID) {
return protocol.NewInvalidServiceEventErrorf("epoch setup event lists active participants not in canonical ordering")
}
prevNodeID = nextEpochIdentitySkeleton.NodeID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty to reorder these and the respective in-code documentation above such that the documentation and implementation are in the same order 👉 c638874

@durkmurder durkmurder merged commit e210519 into feature/dynamic-protocol-state Oct 24, 2023
36 checks passed
@durkmurder durkmurder deleted the yurii/4649-prev-epoch-refactoring-attempt branch October 24, 2023 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants