Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Use raft consensus for networking #1591

Open
ch1bo opened this issue Sep 2, 2024 · 1 comment · May be fixed by #1632
Open

Spike: Use raft consensus for networking #1591

ch1bo opened this issue Sep 2, 2024 · 1 comment · May be fixed by #1632
Assignees
Labels

Comments

@ch1bo
Copy link
Collaborator

ch1bo commented Sep 2, 2024

Why

We created a new test suite about resilience of our network stack in #1532 (see also #1106, #1436, #1505). With this in place, we can now explore various means to reach our goal of a crash-tolerant network layer.

This fairly old research paper explored various consensus protocols used in blockchain space and reminds us of the correspondence between consensus and broadcasts:

the form of consensus relevant for blockchain is technically known as atomic
broadcast

Furthermore, it listed at least one of these early, permissioned blockchains that achieved crash-tolerance of $t < n/2$ by simply using etcd with its Raft consensus algorithm.

What

  • Run fault injection tests with a hydra-node network connected through etcd
  • Create a PR with results of the spike, but do not merge it

How

  • Create a Hydra.Network.Etcd network component that implements broadcast using the replicated log of etcd
    • Move authentication into application
    • Re-use --peer from command line
    • Fork etcd when starting
    • Message type + sender as key
    • Use watch and revisions to be notified of messages while offline
  • Maybe: Compact messages (this will upper bound reliance to faults?) or use leases
@ch1bo ch1bo added the spike label Sep 2, 2024
@ch1bo ch1bo changed the title Spike: Using raft consensus for networking Spike: Use raft consensus for networking Sep 2, 2024
@ch1bo ch1bo self-assigned this Sep 2, 2024
@ch1bo ch1bo linked a pull request Sep 13, 2024 that will close this issue
4 tasks
@ch1bo
Copy link
Collaborator Author

ch1bo commented Sep 13, 2024

After directly identifying a required change in semantics of our NetworkComponentss #1624, I started implementation yesterday and achieved so far:

  • Initialization of a etcd cluster by re-using existing --peer (and other) command line arguments
  • Sending and receiving of messages: very basic, hex-encoded, using etcdctl as "client" and by polling a single key
  • First manual testing and some end-to-end tests / benchmarks working -> fragile (likely because of process calls and polling)

Notes so far:

  • First step: Starting a etcd in the background of network component.
  • When making etcd available to hydra-node (repl), I stumbled over cached paths (somewhere in dist-newstyle)
  • Separated --host and --port is annoying, should just parse a full Host.
  • etcd has a few command line arguments to get right, this guide was helpful: https://etcd.io/docs/v3.5/op-guide/clustering/
  • Etcd client libraries are not in the best state in haskell land, resorted to just invoking etcdctl for now (this will be have horrible performance)
    • Need to marshal binary encoded messages to etcdtcl put through stdin using hex
    • Can use etcdctl get -w json to get a JSON encoded result with base64 encoded values (of hex encoded bytes)
    • Should really use a grpc or at least http/json client
  • Who should be signing and verifying messages? Re-use authentication layer or build it into etcd network component (and re-use keys for transport-level security)?
    • Decided to re-use withAuthentication and do some plumbing
    • Expand list of valid senders to include "us" (because we do not do short-cuts anymore)
  • Stopping the last etcd instance does not work properly. It seems like its not gracefully handling the SIGTERM while reconnecting to other nodes.
  • Surprisingly, the etcd network component even works if we just poll a single key in a busy loop and re-deliver this message over and over. At least in a manual, interactive test using the hydra-tui.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants