Update networking page (#1606)

This is a short writeup of the findings; added to the "networking" page as an investigation. I think a broader re-write of the page is in order, but can come separately, as it's a bit more work to document all the existing knowledge and implementation details.
cardano-scaling · Sep 9, 2024 · d2785e7 · d2785e7
1 parent c4baec6
commit d2785e7
Show file tree

Hide file tree

Showing 2 changed files with 61 additions and 7 deletions.
diff --git a/.github/workflows/ci-nix.yaml b/.github/workflows/ci-nix.yaml
@@ -316,7 +316,7 @@ jobs:
     - name: 🚧 Setup Node.js
       uses: actions/setup-node@v4
       with:
-        node-version: 16
+        node-version: 18
         cache: 'yarn'
         cache-dependency-path: docs/yarn.lock
 

diff --git a/docs/docs/dev/architecture/networking.md b/docs/docs/dev/architecture/networking.md
@@ -1,11 +1,7 @@
 # Networking
 
-This document provides details about the Hydra networking layer, which encompasses the network of Hydra nodes where heads can be opened.
-
-:::warning
-
-🛠 This document is a work in progress. We recognize that the current state of networking is suboptimal, serving as an initial implementation to establish a functional basis. Efforts are underway to enhance the network dynamics through a proposed improvement initiative, detailed in [this proposal](https://github.com/input-output-hk/hydra/pull/237).
-:::
+This page provides details about the Hydra networking layer, which encompasses
+the network of Hydra nodes where heads can be opened.
 
 ## Questions
 
@@ -30,6 +26,64 @@ This document provides details about the Hydra networking layer, which encompass
 
 ## Investigations
 
+### Network resilience
+
+In August 2024 we added some network resilience tests, implemented as a GitHub
+action step in [network-test.yaml](https://github.com/cardano-scaling/hydra/blob/master/.github/workflows/network-test.yaml).
+
+The approach is to use [Pumba](https://github.com/alexei-led/pumba) to inject
+networking faults into a docker-based setup. This is effective, because of the
+[NetEm](https://srtlab.github.io/srt-cookbook/how-to-articles/using-netem-to-emulate-networks.html)
+capability that allows for very powerful manipulation of the networking stack
+of the containers.
+
+Initially, we have set up percentage-based loss on some very specific
+scenarios; namely a three-node setup between `Alice`, `Bob` and `Carol`.
+
+With this setup, we tested the following scenarios:
+
+- Three nodes, 900 transactions ("scaling=10"):
+  - 1% packet loss to both peers: ✅ Success
+  - 2% packet loss to both peers: ✅ Success
+  - 3% packet loss to both peers: ✅ Success
+  - 4% packet loss to both peers: ✅ Success
+  - 5% packet loss to both peers: Sometimes works, sometimes fails
+  - 10% packet loss to both peers: Sometimes works, sometimes fails
+  - 20% packet loss to both peers: ❌Failure
+
+- Three nodes, 4500 transactions ("scaling=50"):
+  - 1% packet loss to both peers: ✅ Success
+  - 2% packet loss to both peers: ✅ Success
+  - 3% packet loss to both peers: ✅ Success
+  - 4% packet loss to both peers: Sometimes works, sometimes fails
+  - 5% packet loss to both peers: Sometimes works, sometimes fails
+  - 10% packet loss to both peers: ❌Failure
+  - 20% packet loss to both peers: ❌Failure
+
+"Success" here means that _all_ transactions were processed; "Failure" means
+one or more transactions did not get confirmed by all participants within a
+particular timeframe.
+
+The main conclusion here is ... there's a limit to the amount of packet loss
+we can sustain, it's related to how many transactions we are trying to send
+(naturally, [given the percent of failure is per
+ packet](http://www.voiptroubleshooter.com/indepth/burstloss.html).)
+
+You can keep an eye on the runs of this action here: [Network fault
+tolerance](https://github.com/cardano-scaling/hydra/actions/workflows/network-test.yaml).
+
+The main things to note are:
+
+- Overall, the CI job will succeed even if every scenario fails. This is,
+  ultimately, due to a bug in [GitHub
+  actions](https://github.com/actions/runner/issues/2347) that prevents one
+  from declaring an explicit pass-or-fail expectation per scenario. The impact
+  is that you should manually check this job on each of your PRs.
+- It's okay to see certain configurations fail, but it's certainly not
+  expected to see them _all_ fail; certainly not the zero-loss cases. Anything
+  that looks suspcisious should be investigated.
+
+
 ### Ouroboros
 
 We held a meeting with the networking team on February 14, 2022, to explore the integration of the Ouroboros network stack into Hydra. During the discussion, there was a notable focus on performance, with Neil Davies providing insightful performance metrics.