Spike: Use raft consensus for networking #1591

ch1bo · 2024-09-02T07:54:12Z

Why

We created a new test suite about resilience of our network stack in #1532 (see also #1106, #1436, #1505). With this in place, we can now explore various means to reach our goal of a crash-tolerant network layer.

This fairly old research paper explored various consensus protocols used in blockchain space and reminds us of the correspondence between consensus and broadcasts:

the form of consensus relevant for blockchain is technically known as atomic
broadcast

Furthermore, it listed at least one of these early, permissioned blockchains that achieved crash-tolerance of $t < n/2$ by simply using etcd with its Raft consensus algorithm.

What

Run fault injection tests with a hydra-node network connected through etcd
Create a PR with results of the spike, but do not merge it

How

Create a Hydra.Network.Etcd network component that implements broadcast using the replicated log of etcd
- Move authentication into application
- Re-use ~~--peer~~ from command line
- Fork ~~etcd~~ when starting
- Message type + sender as key
- Use watch and revisions to be notified of messages while offline
Maybe: Compact messages (this will upper bound reliance to faults?) or use leases

The text was updated successfully, but these errors were encountered:

ch1bo · 2024-09-13T15:31:50Z

After directly identifying a required change in semantics of our NetworkComponentss #1624, I started implementation yesterday and achieved so far:

Initialization of a etcd cluster by re-using existing --peer (and other) command line arguments
Sending and receiving of messages: very basic, hex-encoded, using etcdctl as "client" and by polling a single key
First manual testing and some end-to-end tests / benchmarks working -> fragile (likely because of process calls and polling)

Notes so far:

First step: Starting a etcd in the background of network component.
When making etcd available to hydra-node (repl), I stumbled over cached paths (somewhere in dist-newstyle)
Separated --host and --port is annoying, should just parse a full Host.
etcd has a few command line arguments to get right, this guide was helpful: https://etcd.io/docs/v3.5/op-guide/clustering/
Etcd client libraries are not in the best state in haskell land, resorted to just invoking etcdctl for now (this will be have horrible performance)
- Need to marshal binary encoded messages to etcdtcl put through stdin using hex
- Can use etcdctl get -w json to get a JSON encoded result with base64 encoded values (of hex encoded bytes)
- Should really use a grpc or at least http/json client
Who should be signing and verifying messages? Re-use authentication layer or build it into etcd network component (and re-use keys for transport-level security)?
- Decided to re-use withAuthentication and do some plumbing
- Expand list of valid senders to include "us" (because we do not do short-cuts anymore)
Stopping the last etcd instance does not work properly. It seems like its not gracefully handling the SIGTERM while reconnecting to other nodes.
Surprisingly, the etcd network component even works if we just poll a single key in a busy loop and re-deliver this message over and over. At least in a manual, interactive test using the hydra-tui.

ch1bo added the spike label Sep 2, 2024

ch1bo changed the title ~~Spike: Using raft consensus for networking~~ Spike: Use raft consensus for networking Sep 2, 2024

ch1bo self-assigned this Sep 2, 2024

ch1bo assigned ffakenz Sep 9, 2024

ch1bo linked a pull request Sep 13, 2024 that will close this issue

Spike / not-merge: Raft-based network using etcd #1632

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: Use raft consensus for networking #1591

Spike: Use raft consensus for networking #1591

ch1bo commented Sep 2, 2024

ch1bo commented Sep 13, 2024

Spike: Use raft consensus for networking #1591

Spike: Use raft consensus for networking #1591

Comments

ch1bo commented Sep 2, 2024

Why

What

How

ch1bo commented Sep 13, 2024