[BFT] Recoverable Random Beacon State Machine #6725

durkmurder · 2024-11-15T16:14:35Z

🤯 would desire stronger isolation between the business logics of ReactorEngine and BeaconKeyRecovery

I have spent multiple hours on possible interaction of the ReactorEngine with the BeaconKeyRecovery and I am still very much worried about open edge cases. Aspects that possibly overlap across both components

When precisely the ReactorEngine declares flow.DKGEndStateSuccess.
When precisely the key is labeled as safe (which is decided by yet another component: SafeBeaconPrivateKeys).
Asynchronicity between ReactorEngine and BeaconKeyRecovery despite them both accessing the same database fields (ReactorEngine acts upon EpochCommittedPhaseStarted while ReactorEngine listens to EpochFallbackModeExited which are possibly emitted at the same time)

I have been digging into the code and my gut feeling is that its correct. But the problem is there is a whole bunch of hypothetical edge cases and scenarios where we have to confirm that things can't go wrong. I don't think there are any broken edge-cases, but we also provide no exhaustive argument why. In my opinion, the logic is distributed over two asynchronous components with weak isolation modifying the same state, which makes it too time-intensive to affirm correctness just implicitly from the code.

Thoughts / my mindset

I finally understand now the critical role of the DKGState as an isolation layer. What I am struggling with is the precise specification of what state transitions are allowed.

Concrete suggestion:

I have tried to draw the state machine, which I think DKGState should enforce:
Lets review that and make sure we all agree.
My gut feeling is the DKGState permits all those state transitions, but also a few more state transitions that shouldn't be allowed. For example,
- I would be inclined to reject modifications of the private Random Beacon key once the flow.DKGEndState reaches Success or Recovered (we should silently swallow identical write requests in the spirit of building an information-driven system; just inconsistent changes of MyBeaconPrivateKey would be refused with an exception). At this point, a "safe read" of the key could already succeed and we can't change the value anymore without risking slashing. This should be documented and enforced in the DKGState (this removes interdependencies between ReactorEngine and BeaconKeyRecovery for correct behaviour)
I would be inclined to rename DKGState implementation to RecoverablePrivateBeaconKeyState

I would recommend to restructure the interfaces (my gut feeling is this wouldn't break a lot of code):

// SafeBeaconKeys is a safe way to access beacon keys.
type SafeBeaconKeys interface {

	// RetrieveMyBeaconPrivateKey retrieves my beacon private key for the given
	// epoch, only if my key has been confirmed valid and safe for use.
	//
	// Returns:
	//   - (key, true, nil) if the key is present and confirmed valid
	//   - (nil, false, nil) if the key has been marked invalid or unavailable
	//     -> no beacon key will ever be available for the epoch in this case
	//   - (nil, false, storage.ErrNotFound) if the DKG has not ended
	//   - (nil, false, error) for any unexpected exception
	RetrieveMyBeaconPrivateKey(epochCounter uint64) (key crypto.PrivateKey, safe bool, err error)
}

// DKGStateReader ...
type DKGStateReader interface {
	SafeBeaconKeys
	
	// GetDKGStarted checks whether the DKG has been started for the given epoch.
	// No errors expected during normal operation.
	GetDKGStarted(epochCounter uint64) (bool, error)

	// GetDKGEndState retrieves the end state for the given DKG.
	// Error returns: storage.ErrNotFound
	GetDKGEndState(epochCounter uint64) (flow.DKGEndState, error)

	// UnsafeRetrieveMyBeaconPrivateKey retrieves the random beacon private key for an epoch.
	//
	// CAUTION: these keys are stored before they are validated against the
	// canonical key vector and may not be valid for use in signing. Use SafeBeaconKeys
	// to guarantee only keys safe for signing are returned
	// Error returns: storage.ErrNotFound
	UnsafeRetrieveMyBeaconPrivateKey(epochCounter uint64) (crypto.PrivateKey, error)
}

// DKGState is the storage interface for storing all artifacts and state
// related to the DKG process, including the latest state of a running or
// completed DKG, and computed beacon keys.
type DKGState interface {
	DKGStateReader
	
	// SetDKGStarted sets the flag indicating the DKG has started for the given epoch.
	// Error returns: storage.ErrAlreadyExists
	SetDKGStarted(epochCounter uint64) error

	// SetDKGEndState stores that the DKG has ended, and its end state.
	// Error returns: storage.ErrAlreadyExists
	SetDKGEndState(epochCounter uint64, endState flow.DKGEndState) error

	// InsertMyBeaconPrivateKey stores the random beacon private key for an epoch.
	//
	// CAUTION: these keys are stored before they are validated against the
	// canonical key vector and may not be valid for use in signing. Use SafeBeaconKeys
	// to guarantee only keys safe for signing are returned
	// Error returns: storage.ErrAlreadyExists
	InsertMyBeaconPrivateKey(epochCounter uint64, key crypto.PrivateKey) error
}

// EpochRecoveryMyBeaconKey is a specific interface that allows to overwrite the beacon private key for given epoch.
// This interface is used *ONLY* in the epoch recovery process and only by the consensus participants.
// Each consensus participant takes part in the DKG, after finishing the DKG protocol each replica obtains a random beacon
// private key which is stored in the database along with DKG end state which will be equal to flow.DKGEndStateSuccess.
// If for any reason DKG fails, then the private key will be nil and DKG end state will be equal to flow.DKGEndStateDKGFailure.
// It's not a problem by itself, but when the epoch recovery takes place, we need to query last valid beacon private key for
// the current replica and set it for recovered epoch, otherwise replicas won't be able to vote for blocks in the recovered epoch.
type EpochRecoveryMyBeaconKey interface {
	DKGStateReader

	// UpsertMyBeaconPrivateKey overwrites the random beacon private key for the epoch that recovers the protocol from
	// epoch fallback mode. Effectively, this function overwrites whatever might be available in the database with
	// given private key for current consensus participant.
	// No errors are expected during normal operations.
	UpsertMyBeaconPrivateKey(epochCounter uint64, key crypto.PrivateKey) error
}

This PR is already large enough. I would appreciate if we could create an issue and address this in a follow-up PR

Originally posted by @AlexHentschel in #6424 (comment)

The text was updated successfully, but these errors were encountered:

durkmurder · 2024-11-21T17:40:41Z

@AlexHentschel @jordanschalm I've made an iteration on your state machine representation which I think simplifies complexity and makes reasoning a bit simpler. It has same number of states(except I think yours has one extra recovery unless I missed something).
I've tried to remove DKG Started and rely only on DKGState which will represent current state. Merging "DKG started" into the state will simplify state management in my opinion.

Regarding DKG recovery I have tried to make it explicit that before we can execute recovery we have to have previous epoch committed and it's not possible to commit DKG and then enter recovery state while staying in the same epoch.

AlexHentschel · 2024-11-22T05:53:39Z

very nice work Yurii. I agree with all your revisions to the state machine:

I've tried to remove DKG Started and rely only on DKGState which will represent current state. Merging

great simplification! Thanks for discovering that.
yours has one extra recovery [state]

Yes, agree there is no benefit to differentiate the two different recovery states. From my perspective, there is a difference in the state transitions to DKG Recovered (specifically where the private key is coming from). But for the resulting state, we don't really care about the origin of the private key -- we only care that we have a valid private key.

Minor suggestions

There are some minor thoughts / suggestions for extensions, which weren't part of my initial state machine sketch either ... my thinking has also evolved a little (maybe I just forgot some details already 😅 since reviewing your PR).

(ii) wording below the `DKG Recovered` state

pk* for epoch N+1 is the pk for epoch N if the state machine has entered "DKG Committed"

I am confused why the explanation of DKG Recovered state is mentioning the DKG Committed state, because DKG Committed is a terminal state. Once we are in that state, the state machine will no longer change states. I completely agree with your comment:

it's not possible to commit DKG and then enter recovery state while staying in the same epoch.

so should the red description below DKG Recovered maybe read:

pk* for epoch N+1 is the pk for epoch N if the state machine has entered "DKG Completed"

?

(ii) specification of which paths can lead to `DKG Recovered`

pk* for epoch N+1 is the pk for epoch N if the state machine has entered "DKG ~~Committed~~ Completed" state. In other words, it's the safe pk from previous epoch. pk* must match EpochCommit for the next epoch (N+1)

I am not sure about this wording. I would guess we agree. Though, lets try to remove any ambiguity in the scientific wording. The reason why got tripped by the statement is the first sentence, which can we re-worded to (preserving the mathematical meaning):

"If the state machine has entered "DKG Committed" state, then pk* for epoch N+1 is the pk for epoch N". You follow with the phrase "in other words, ..." which means we are expressing the previous as a different but equivalent statement. The problem is that the previous statement is incomplete in my opinion. I'll try to provide a precise definition of all the cases, which I think we should consider in the state machine. Ideally, I'd like to define the state machine such that the private key can come from anywhere, not only the last (or recent) DKG.

situation (a): we are in the DKG Completed state and observe an OnEpochFallbackModeExited event and the public key for epoch N+1 (from EpochCommit) matches our private key pk for Epoch N+1 from the DKG Completed state. In this case we use pk.

This could be the case if for example the collectors failed to submit their votes for the collector root blocks, but the DKG completed. I think it makes sense to allow the governance committee to still use the successful DKG results.
situation (b): we are in the DKG Completed state and observe an OnEpochFallbackModeExited event and the public key (from EpochCommit) for epoch N+1 does not matches our private key for Epoch N+1 from the DKG Completed state but the private key pk* for Epoch N does. Then we use pk*
situation (c): if we are in the initialization state or DKG started or DKG Failed state and the public key (from EpochCommit) for epoch N+1 matches our private key pk* for Epoch N. Then we use pk*
situation (d): if we are in the initialization state or DKG started or DKG Failed or DKG Completed state and the public key (from EpochCommit) for epoch N+1 matches the private key pk' manually provided by the operator (to be implemented). Then we use pk'.

We don't have this path "private key manually provided by the operator" implemented, but I think we should include this case in the state machine. To add this case in the future, implementing some way to inject an additional key into the node is relatively trivial. But updating the state machine would probably not be, while I think the overhead to cover this case already is not much extra work.

In terms of the state machine, my gut feeling is that we should allow state transitions to DKG Recovered from all states (including the initial state) except for DKG Committed. I think we can algorithmically consolidate all cases: $\alpha_N$
Let $pk*$ denote the private key for epoch $N$, $pk$ for epoch $N+1$ and $pk'$ a private key manually provided by the operator. If a key doesn't exist, its value is nil (and in our implementation $pk'$ would always nil for now). Then, we iterate over the list of keys $[pk*, pk, pk']$ and pick the first private key that matches the public key provided in the EpochCommit event.

(iii) I don't think we have an ordering convention of `OnEpochCommit` and `OnEpochFallbackModeExited`

Maybe we should have one, e.g. OnEpochCommit is always emitted first before OnEpochFallbackModeExited ... maybe you have used this to cover situation (a)? If we are relying on that property in the state machine diagram, lets include this as a prerequisite explicitly in the diagram:

In case of EFM recovery, this state machine requires that OnEpochCommit is observed before OnEpochFallbackModeExited

(iv) lets limit the usage of "DKG" to the cases, where it is strictly the DKG

Specifically, the naming of the state DKG Recovered is in my opinion overly specific with the term DKG. From my perspective we don't really care where the Random Beacon key came from, all we care about is that we have a valid Random Beacon Key. Hence, I would like to rename this state to BeaconKey Recovered.

durkmurder added Preserve Stale Bot repellent S-Consensus S-BFT labels Nov 15, 2024

durkmurder self-assigned this Nov 15, 2024

durkmurder mentioned this issue Nov 15, 2024

EFM Integration Test Part 2 #6424

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BFT] Recoverable Random Beacon State Machine #6725

[BFT] Recoverable Random Beacon State Machine #6725

durkmurder commented Nov 15, 2024 •

edited

Loading

durkmurder commented Nov 21, 2024 •

edited

Loading

AlexHentschel commented Nov 22, 2024

[BFT] Recoverable Random Beacon State Machine #6725

[BFT] Recoverable Random Beacon State Machine #6725

Comments

durkmurder commented Nov 15, 2024 • edited Loading

Thoughts / my mindset

durkmurder commented Nov 21, 2024 • edited Loading

AlexHentschel commented Nov 22, 2024

Minor suggestions

(ii) wording below the DKG Recovered state

(ii) specification of which paths can lead to DKG Recovered

(iii) I don't think we have an ordering convention of OnEpochCommit and OnEpochFallbackModeExited

(iv) lets limit the usage of "DKG" to the cases, where it is strictly the DKG

durkmurder commented Nov 15, 2024 •

edited

Loading

durkmurder commented Nov 21, 2024 •

edited

Loading

(ii) wording below the `DKG Recovered` state

(ii) specification of which paths can lead to `DKG Recovered`

(iii) I don't think we have an ordering convention of `OnEpochCommit` and `OnEpochFallbackModeExited`