Self healing open #7189

cjen1-msft · 2025-08-15T10:15:55Z

This PR is the reification of: #7003

The idea is that if a service has fully crashed, it should be able to heal itself so long as it isn't too damaged.
Specifically, the restarting nodes should gossip the knowledge they have locally to try and elect the replica with the best local state.

The result is that after the self-healing-open one replica is chosen to recover and open, while all others restart to then join it.

The protocol can be roughly surmised as:

Start up.
Gossip your state (claimed length of ledger) and authenticate other replicas attestation and identities.
Once you have heard from everyone you expect to hear from, vote for the node with the longest ledger (ties broken by identity).
If you receive votes from a majority of your expected cluster, transition_to_open and broadcast IAmOpen to the other nodes.
If you receive IAmOpen from a trusted node, restart and join it.

This still requires the submission of ledger recovery shares, however if local sealing is available those can be used instead.

…ling-open

This reverts commit e9cb10d.

eddyashton

Sorry for the long review. The core logic looks sound as far as I've traced it, this is mostly about naming/coding style. I'll take another pass on the actual message flow logic tomorrow.

include/ccf/service/tables/self_healing_open.h

eddyashton · 2025-09-23T12:44:22Z

include/ccf/node/startup_config.h

+  {
+    std::vector<std::string> addresses;
+    ccf::ds::TimeString retry_timeout = {"100ms"};
+    ccf::ds::TimeString timeout = {"2000ms"};


Suggest we call this failover_timeout, if that's the term for the fallback state?

Definitely not unqualified timeout, is this when we stop waiting for new votes? If so something like ballot_timeout or recovery_ballot_timeout would be better.

@achamayou It is when we use the failover path to advance to the next phase rather than the election path.
I've tentatively renamed it to failover_retry.

doc/operations/recovery.rst

doc/host_config_schema/cchost_config.json

eddyashton · 2025-09-23T13:12:26Z

src/node/rpc/node_frontend.h

+          return make_error(
+            HTTP_STATUS_INTERNAL_SERVER_ERROR,
+            ccf::errors::InternalError,
+            "This replica has already voted");


We should make this more verbose - say who we did vote for. Even in plain-text, it'll save us debugging time somewhere down the line.

Let's not introduce the term replica in the code, which is used nowhere at the moment. It's a node.

eddyashton · 2025-09-23T13:15:03Z

src/node/rpc/node_frontend.h

+        if (is_invalid.has_value())
+        {
+          auto [code, message] = is_invalid.value();
+          return make_error(code, ccf::errors::InvalidQuote, message);


Why do we make everything InvalidQuote here? That seems wrong in the BAD_REQUEST path, we can add a new value to odata_error for this, or use an existing generic error.

eddyashton · 2025-09-23T13:21:37Z

src/service/network_tables.h

    }

+    // Self-healing open tables
+    const SelfHealingOpenNodeInfo self_healing_open_node_info = {


I think we don't need these definitions any more? To access a table, you can have an instance:

struct Tables { MyTableType my_table_instance = {MY_TABLE_NAME}; }; Tables& tables = ...; auto handle = tx.rw(tables.my_table_instance);

Or you can just access it directly by type and name:

auto handle = tx.rw<MyTableType>(MY_TABLE_NAME);

The former helps use a consistent type+name, but at the cost of a lot of boiler-plate. We used to use it to auto-generate wrappers for built-in tables (in which case these instances also need to be returned from something like get_all_internal_tables()). But that's all vestigial and unused, and I think we should avoid this boilerplate for new tables.

Excellent can do! I had a mix of both at one point and converged it in the wrong direction seemingly :)

Now all updated.

eddyashton · 2025-09-23T13:24:07Z

tests/infra/clients.py

+                    55,
                    60,
-                ]:  # PEER_FAILED_VERIFICATION, SSL_CONNECT_ERROR
+                ]:  # COULDNT_CONNECT, PEER_FAILED_VERIFICATION, SEND_ERROR, SSL_CONNECT_ERROR


I can't make a direct suggestion since some lines are untouched, but suggest we pair these comment lines directly:

# COULDNT_CONNECT 7, # PEER_FAILED_VERIFICATION 35, # SEND_ERROR 55, # SSL_CONNECT_ERROR 60,

eddyashton · 2025-09-23T13:27:32Z

tla/disaster-recovery/Readme.md

@@ -0,0 +1,11 @@
+# Self-healing-open specification in [stateright](https://github.com/stateright/stateright)


If we're not checking this stays fresh in any workflow, I don't think it should be checked into main. We can put a pointer to this spec, on a branch, in the GH Discussion.

Agreed, there is surely a way to add this to one of the verification pipelines we already have.

In theory added now.

Co-authored-by: Eddy Ashton <[email protected]>

cjen1-msft added 30 commits May 9, 2025 12:38

Add tla spec

0a63230

Update spec to refine safety property

859d30d

Add basic fizzbee spec

9eb305d

Add stateright model

0bf26f9

Update stateright dr spec

2b8a1d6

Update Readme.md

bad8a13

Update Readme.md

1965453

broken version

5f066e5

refactor

09388e7

Restore correct liveness property.

b991b9d

Add more checked conditions

4f6de45

Add reasonably clean curlm support

5a98922

Add proper curl and libuv interaction

9edf637

Pass curl singleton over enclave barrier

c745ade

Ensure singleton is initialised

71c1fb3

Make quote endorsement client use curl_multi

695f351

Add curl to public ccf linked libraryes

2956c38

fix cond

709228f

Initialise request

32d1361

Fix handler

f88d7b5

fiddle with pointers

4458c8b

Fix timeout

cdebe29

Maybe fix issue?

4ea2bb7

refmt

6214b6c

Merge branch 'main' into curlm

a4be0c3

Update

fce77da

fmt

58eb20c

remove static_cast

5b52e3d

Fix url query

b876cca

Add kickstart for curlm and document interaction between libuv and curlm

68aff99

cjen1-msft requested a review from a team as a code owner September 22, 2025 16:21

cjen1-msft changed the title ~~Self healing open~~ [Draft] Self healing open Sep 22, 2025

cjen1-msft marked this pull request as draft September 23, 2025 08:45

cjen1-msft changed the title ~~[Draft] Self healing open~~ Self healing open Sep 23, 2025

cjen1-msft added 9 commits September 23, 2025 10:56

Add docs

afb4a1d

Add flag for detecting whether a timeout has occurred during self-hea…

fd8750b

…ling-open

Doc update

13e05b7

typo

a9ab437

Update path names

474b199

Revert "Allow curl handles to fix themselves during shutdown."

c06895c

This reverts commit e9cb10d.

Merge branch 'main' into self-healing-open

499ea78

Update docs

7e5af0d

Make clang-tidy happy

009a1c1

cjen1-msft added the run-long-test Run Long Test job label Sep 23, 2025

eddyashton reviewed Sep 23, 2025

View reviewed changes

cjen1-msft and others added 15 commits September 23, 2025 14:36

Update doc/host_config_schema/cchost_config.json

9e7d6e0

Co-authored-by: Eddy Ashton <[email protected]>

Update doc/operations/recovery.rst

f01931a

Co-authored-by: Eddy Ashton <[email protected]>

Update src/common/configuration.h

125a0fb

Co-authored-by: Eddy Ashton <[email protected]>

typoing

be6edc9

config snags

7ee3d5b

inline restarter

ccb43b7

Refactoring

af6757f

Don't use network.tables anymore

a83a44e

Refactor and document

56ebb3e

rejig

cb23343

de-replica-ing

e977b74

improved error messages

c3a8a46

Refactor node_frontend

d6827c9

fmt

eacbf66

Add model checking

0e25ca7

		@@ -0,0 +1,11 @@
		# Self-healing-open specification in [stateright](https://github.com/stateright/stateright)

Self healing open #7189

Are you sure you want to change the base?

Self healing open #7189

Uh oh!

Conversation

cjen1-msft commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eddyashton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cjen1-msft commented Aug 15, 2025 •

edited

Loading