-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a journal, to persist and restore the algorithms' state? #421
Comments
On the other hand, that makes it easy to mistakenly persist secret keys and key shares, since those are part of the algorithm structs. Also, We should probably encourage the user to store the So we could implement a serializable state representation for each algorithm, which wouldn't include the Alternatively maybe we could add a journal to |
As discussed privately the "journal" function would be best. An essential feature for "production-ready" status. |
Conceptually, the event handler methods (
and they output:
The journal would be added to The naive way of implementing that is not an option: If we try to go through the code and create a journal entry in every one of the hundreds of places where we modify the mutable algorithm state, we'll never get it right, and the code wouldn't be maintainable; every change where we forget to add or update the corresponding journal entry creation would break it. I see three possible ways forward: 1. Only implement checkpoints.Instead, just make sure that the event handlers are all fully deterministic. In fact, this should already be the case! So the user can already emulate journaling functionality by:
Then after a crash or restart, the algorithm's state can be restored by replaying all inputs and using the same seeds. The output, outgoing messages and fault logs can all be thrown away, since they have already been handled. To avoid having to store all of this forever, it would suffice to make the algorithm state serializable. Then the log can periodically be deleted and replaced with the current snapshot. I suspect that this option would actually not be very wasteful: hbbft's message types don't contain much redundant information, so in normal operation the log might not even require more space than journal entries would. (To avoid storing messages that didn't change the state, we could add some flag to 2. Diff the states after event handling.All However, it can be wasteful to clone the full state every time. In some places, this could be optimized a bit, e.g. we'd certainly put the 3. Encapsulate the algorithm states.This would be the most invasive change, but it would avoid the cloning: Separate all the algorithms' states, and make their fields private. They can only be modified by calling This would mean almost a complete rewrite, though. |
Interesting approach using deterministic replay of hbbft input to get back to the state before shutdown. In Parity we already will have to persist outgoing, but unsent, messages - persisting incoming messages as long as a hbbft epoch is active should be similar in implementation. |
I don't think it makes sense to implement snapshot functionality for all the one-shot subalgorithms. And for So I'm actually not sure we need to add any functionality to hbbft at all. Here's what I think should already work:
|
Please reopen if there's anything we need to add to hbbft to support restarts and crash recovery. |
For Parity (poanetwork/parity-ethereum#181) and other applications, we need to make it possible to persist the state of consensus to disk. That allows a node to e.g. restart without becoming faulty.
We will add a journaling feature to hbbft: Each
Step
returned by the algorithms will contain journal log entries that represent the internal state change. These can be persisted, and even after a crash, the full state can be recovered by replaying the journal. That way, a node can continue to operate without being marked as faulty or skipping an epoch.To keep the journal from growing indefinitely, the entries' epoch number should be user-visible: The upper layer can always delete all entries from previous epochs.
So let's make all algorithm structs implement the serde traits (Serialize
andDeserialize
).The text was updated successfully, but these errors were encountered: