-
Notifications
You must be signed in to change notification settings - Fork 228
Self healing open #7189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Self healing open #7189
Conversation
This reverts commit e9cb10d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long review. The core logic looks sound as far as I've traced it, this is mostly about naming/coding style. I'll take another pass on the actual message flow logic tomorrow.
include/ccf/node/startup_config.h
Outdated
{ | ||
std::vector<std::string> addresses; | ||
ccf::ds::TimeString retry_timeout = {"100ms"}; | ||
ccf::ds::TimeString timeout = {"2000ms"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest we call this failover_timeout
, if that's the term for the fallback state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not unqualified timeout, is this when we stop waiting for new votes? If so something like ballot_timeout or recovery_ballot_timeout would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@achamayou It is when we use the failover path to advance to the next phase rather than the election path.
I've tentatively renamed it to failover_retry.
src/node/rpc/node_frontend.h
Outdated
return make_error( | ||
HTTP_STATUS_INTERNAL_SERVER_ERROR, | ||
ccf::errors::InternalError, | ||
"This replica has already voted"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this more verbose - say who we did vote for. Even in plain-text, it'll save us debugging time somewhere down the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not introduce the term replica in the code, which is used nowhere at the moment. It's a node.
src/node/rpc/node_frontend.h
Outdated
if (is_invalid.has_value()) | ||
{ | ||
auto [code, message] = is_invalid.value(); | ||
return make_error(code, ccf::errors::InvalidQuote, message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we make everything InvalidQuote
here? That seems wrong in the BAD_REQUEST
path, we can add a new value to odata_error
for this, or use an existing generic error.
src/service/network_tables.h
Outdated
} | ||
|
||
// Self-healing open tables | ||
const SelfHealingOpenNodeInfo self_healing_open_node_info = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need these definitions any more? To access a table, you can have an instance:
struct Tables
{
MyTableType my_table_instance = {MY_TABLE_NAME};
};
Tables& tables = ...;
auto handle = tx.rw(tables.my_table_instance);
Or you can just access it directly by type and name:
auto handle = tx.rw<MyTableType>(MY_TABLE_NAME);
The former helps use a consistent type+name, but at the cost of a lot of boiler-plate. We used to use it to auto-generate wrappers for built-in tables (in which case these instances also need to be returned from something like get_all_internal_tables()
). But that's all vestigial and unused, and I think we should avoid this boilerplate for new tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent can do! I had a mix of both at one point and converged it in the wrong direction seemingly :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now all updated.
tests/infra/clients.py
Outdated
55, | ||
60, | ||
]: # PEER_FAILED_VERIFICATION, SSL_CONNECT_ERROR | ||
]: # COULDNT_CONNECT, PEER_FAILED_VERIFICATION, SEND_ERROR, SSL_CONNECT_ERROR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't make a direct suggestion since some lines are untouched, but suggest we pair these comment lines directly:
# COULDNT_CONNECT
7,
# PEER_FAILED_VERIFICATION
35,
# SEND_ERROR
55,
# SSL_CONNECT_ERROR
60,
@@ -0,0 +1,11 @@ | |||
# Self-healing-open specification in [stateright](https://github.com/stateright/stateright) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not checking this stays fresh in any workflow, I don't think it should be checked into main
. We can put a pointer to this spec, on a branch, in the GH Discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, there is surely a way to add this to one of the verification pipelines we already have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory added now.
Co-authored-by: Eddy Ashton <[email protected]>
Co-authored-by: Eddy Ashton <[email protected]>
Co-authored-by: Eddy Ashton <[email protected]>
This PR is the reification of: #7003
The idea is that if a service has fully crashed, it should be able to heal itself so long as it isn't too damaged.
Specifically, the restarting nodes should gossip the knowledge they have locally to try and elect the replica with the best local state.
The result is that after the self-healing-open one replica is chosen to recover and open, while all others restart to then join it.
The protocol can be roughly surmised as:
transition_to_open
and broadcastIAmOpen
to the other nodes.IAmOpen
from a trusted node, restart and join it.This still requires the submission of ledger recovery shares, however if local sealing is available those can be used instead.