Skip to content

Logbook 2023 H2

Sasha Bogicevic edited this page Dec 27, 2023 · 114 revisions

December 2023

2023-12-27

SB on Fixing smoke-tests

  • Our smoke-tests failed when we tried to prepare everything for release 0.15: https://github.com/input-output-hk/hydra/actions/runs/7336777595/job/19976632279

  • We got QueryEraMismatchException when trying to publish hydra scripts - this is something I observed locally.

hydra-cluster: QueryEraMismatchException (EraMismatch {ledgerEraName = "Alonzo", otherEraName = "Babbage"})
  • It seems like the cardano-node starts but it's not fully synced when we query it.

  • In turn this means that waitForFullySynchronized is not working properly.

  • After investigating, latest changes that removed protocol-parameters now require that we query them from a running node BUT if you query the cardano-node that is not yet in sync you get the aforementioned exception.

  • I made changes to wait for the node to be in sync before proceeding further and that solved our problem

  • We still need to check if these latest changes related to query exceptions make sense to the users which try to run the hydra-node on unsynced cardano-node.

2023-12-22

  • SN: The treefmt cache was tricked into believing all is fine, but the CI failed on a branch for me. In such cases, use treefmt --fail-on-change --no-cache to fix formatting and see the same errors as the CI.

SB on incremental decommits

  • Came to a point where the functions needed for decommit to work are all stubbed out (tx creation, observation, head logic changes) so the interesting work of constructing the decrement tx now begins.

  • I should write test to assert the decrement tx is valid even before actually writing some code but in this case I think I'd like to see what kind of things would I need first for this tx to work.

  • Implemented decrement by using the similar pattern for other functions we have (like collectCom) and stubbed out decrementTx which will hold the code to actually construct the decrement tx.

  • Now the fun starts!

2023-12-19

FT on: macOS updates often break nix installation

This issue still persists, and I stumble upon it after upgrading my system to Sonoma v14.2.

Basically the problem is that after system upgrade nix becomes a command not found.

To avoid having to re-install the tool, I found a workaround by following this comment.

2023-12-18

SB on incremental decommits

  • Current work so far gives us a way of requesting a decommit and then have all parties observe the request and alter their local state with the utxo to decommmit.

  • When ReqDec event occurs we also issue a ReqSn since all parties need to agree on decommitting something from a head and that requires all of them to sign new Snapshot that will actually do the decrement.

  • I'd like to write more tests around the existing logic so far before moving to changes in the AckSn.

  • Added tests to make sure the local state is updated when ReqDec is observed. We want to record the utxo someone wants to decommit and later on when we see a ReqSn we would like to check if it actually applies. The only check we do on ReqDec is to make sure there is no decommit in flight already.

  • Added some more tests to cover the decommit logic, like one where we check that the local state is updated and that we issue ReqSn on observing valid ReqDec.

  • Came to a point where I should make sure to update the confirmed snapshot on decommit but what I see is that if you can apply a decommit tx to local ledger state then the resulting UTxO is already removed from this state and we don't need to do anything special?

2023-12-11

SB on removing protocol-parameters from hydra-cluster

  • The final piece of work for this is to use ledger PParams everywhere instead of cardano-api ProtocolParameters.

  • Problem with this is that cardano-ledger is missing FromJSON instance for BabbagePParams type.

  • Easy-peasy, I'll just add the orphan instance, submit the upstram pr to add it to cardano-ledger and then remove the orphan once the pr is merged...NOT

  • When adding an orphan instance and running the tests I observe error related to protocoVersion json field:

user error (Error in $: key "protocolVersion" not found)
  • This happens on hydra-node start when we try to read passed in pparams file. We construct this file in cluster tests by querying cardano-node and then saving them to a file which is later on used by hydra-node.

  • Observing the PParams I get from the cardano-node I see that there is a protocol version field BUT when encoding them to json string the version is stripped for some reason.

  • When I just want to use emptyBabbagePParams from cardano-ledger in the repl I get this:

<interactive>:10:1: error: [GHC-64725]
    • Cannot satisfy: Cardano.Ledger.Binary.Version.MinVersion <= ProtVerHigh
                                                                    era0
    • In the first argument of 'System.IO.print', namely 'it'
      In a stmt of an interactive GHCi command: System.IO.print it

So there is definitely something funny going on.

SB on incremental decommits

  • The workflow we would most probably like in the end is to have the user submit a decommit request with the UTxO they want to decommit using the /decommit http route.

  • The response should contain decommit tx which the user then needs to sign and and re-submit to the hydra-node. This is a bit of a pain in the ass for the users probably but what we need to make sure is the user is able to spend aforementioned UTxO. Otherwise someone could DDoS a Head by continuously submitting requests to decommit some UTxO which they don't own. This is because our previous plan is to provide a command to decommit some UTxO but we discovered the shortcomings of this approach.

  • If we make the users sign a tx and then try to apply it to the Hydra local ledger (without modifying the ledger state) then we would know specific UTxO can be spend by this user and could continue further with processing the decommit from a Head.

Pair on debugging the non-processing head on mainnet

  • Investigating why mainnet stopped processing transactions
  • Check who signed the last snapshot and who is missing in state changed stream -> Sasha is missing
  • Sashas node has the SnapshotRequested in state, but is missing it's own signature
  • At this point the state persistence is not revealing anything further, but the logs are: Quickly after processing the ReqSn event into the SnapshotRequestd state changed, sashas node seemingly restarted because we see a NodeOptions log
  • There are some stderr hints where we see a PersistenceException. The only location where that type is used and thrown is when loading messages and decoding failed (with a NOTE of partial writes). As this loadAll is used with append in the reliability layer on network-messages, it could be possible that a load when resentMessagesIfLagging was still processing the file while we do already appendMessage.
  • Ironically, the Reliability layer likely had Sashas node crash..

SB on committing from eternl wallet

  • Spinning the node on prepod so that I can test what is preventing us from using eternl to commit to a Head.

  • For this I am using this (tool)[https://github.com/Devnull-org/easy-rider]

  • I have cardano-node running on preprod and I am experimenting with a single party Hydra Head.

  • Head is in the initializing state and seems like the eternl requires of us to set the lower and upper validity bounds -> related (issue)[https://github.com/orgs/input-output-hk/projects/25/views/2?filterQuery=eternl&pane=issue&itemId=31875917]

  • I don't want to change the code yet so I just added the lower/upper validity slots in the commitTx manually.

  • Now eternl imports the transaction! But we get invalid signatures error:

Some transactions failed to submit (400)
failed:
c05ddb19eee7b30c151a513acb7821a9d354b8761fd95551fe9b26d1b2b5f1b4
Some signatures are invalid. 'data.invalidSignatories' contains a list of keys for which the signature didn't verify. As a reminder, you must only sign the serialised transaction *body*, without metadata or witnesses.

- Seems like eternl does not support signing with multiple keys yet since the
  commit tx needs to be already signed by the hydra-node.

This is the commit Tx Cbor for the reference:

84a9008382582086257012e2185cd0234d980cba014289af3ac522f35dec6bfd86b522081213b500825820dcde9a5ff7efee28999f391dad4a5a2fa51909192af81463f86b8cbc53a3857d01825820dcde9a5ff7efee28999f391dad4a5a2fa51909192af81463f86b8cbc53a3857d020d81825820dcde9a5ff7efee28999f391dad4a5a2fa51909192af81463f86b8cbc53a3857d021281825820f917dcd1fa2653e33d6d0ca5a067468595b546120c3085fab60848c34f92c265000182a300581d708dcc1fb34d1ba168dfb0b82e7d1a31956a2db5856f268146b0fd7f2a01821b00000002542a6880a1581c5d82894940af39859b76983501941661a3f51569cf8065eeaf5f449ba1581cf8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d010282005820bdc6480864cd21b1aa4adbd9d5833e94c44256795af2db6e869f49f8b1299428a200581d60f8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d011b000000022c21d652021a0035d860031a02c8ee92081a02c767f20e81581cf8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d0b5820802e05eca956063227bafd16e6458564467f25270f89dbc2a6564d6a7df822f1a30081825820eb94e8236e2099357fa499bfbc415968691573f25ec77435b7949f5fdfaa5da05840c43b5cbc595be17f6385166379022d5a9697869c49cf11430a8d7dd16a36db84de4a2e344452f5f4436646fa28b631cd78f284a3d57d8839645db1b36a0284090482581c5d82894940af39859b76983501941661a3f51569cf8065eeaf5f449bd8799f5820b37aabd81024c043f53a069c91e51a5b52e4ea399ae17ee1fe3cb9c44db707eb9fd8799fd8799fd8799f582086257012e2185cd0234d980cba014289af3ac522f35dec6bfd86b522081213b5ff00ff5f5840d8799fd8799fd8799f581c4b8bff029bc5382d236278c70b4c64a2a5ee0d0f5a87f4bb0f1c2d1bffd8799fd8799fd8799f581ccefb9314e2a2c369f01fc678525827294b26f21c118fd481af420d937ae6ffffffffa140a1401b00000002540be400d87980d87a80ffffffff581c5d82894940af39859b76983501941661a3f51569cf8065eeaf5f449bff0581840001d87a9f9fd8799fd8799f582086257012e2185cd0234d980cba014289af3ac522f35dec6bfd86b522081213b5ff00ffffff821a00d59f801b00000002540be400f5f6
  • I asked on their discord so let's see. Perhaps we can use some other wallet that supports multiple vkey signatures?

SN on markdown spec

  • Further converting markdown with latex to markdown rendered by docusaurus.
  • Fixed rendering of figures by upgrading docusaurus to version 3, which uses mdxjs 3, which seemingly treats markdown in between <figure> html better.
  • Fixing the references which had weird url (from .bib) rendering.
  • Suddenly everything stops working as docusaurus chokes with Error: MDX compilation failed for file "/home/ch1bo/code/iog/hydra/docs/docs/core-concepts/hydra-spec.md" Cause: Could not parse expression with acorn on seemingly random points of the markdown.
  • Removing the sections in the offending file keeps the error with same line.. this can't be right.
  • It was a remainder of a copied hydra-spec.md from a previous run .. I need to clean the working copy better.

2023-12-04

SB on commit mutation debugging

  • Looking to finish work on the stateless-observation we ran into one red test: commit/only proper head is observed

  • The commit tx is valid but when trying to apply a mutation to it we see the error not found redeemer when trying to find the initial input redeemer.

  • Hypothesis: We observe initial transaction. After that we generate a commit utxo (empty or belonging to some generated key). When trying to mutate commit tx together with the known/spendable utxo + the committed utxo the tx reedeemer was not found because we didn't generate correctly this committed utxo e.g. we didn't properly create initial output for the party to consume in the commit tx.

November 2023

2023-11-23

SB: TextEnvelope description not compatible with cardano-cli

  • While working on Hydra-Poll I noticed that our draft commit tx is no longer compatible with the cardano-cli version we have in scope 8.1.2. cardano-cli throws an error related to the text envelope description which it thinks should be Witnessed Tx BabbageEra while what our Tx text envelope instance produces as a description is Tx BabbageEra.

  • When looking at the cardano-api and the HasTextEnvelope instance for Tx type for the version Hydra uses cardano-api >=8.20.0 && <8.21 I can assert that it is indeed what we (produce)[https://github.com/input-output-hk/cardano-api/blob/cardano-api-8.20.0.0/cardano-api/internal/Cardano/Api/Tx.hs#L247].

  • So now why does cardano-cli doesn't accept this format and fails with:

Command failed: transaction sign  Error: Failed to decode neither the cli's serialisation format nor the ledger's CDDL serialisation format.
TextEnvelope error: /tmp/cardano-cli-0f44de8773765037/tx.raw: TextEnvelope type error:
Expected one of: TxUnsignedByron, TxUnsignedShelley, TxBodyAllegra, TxBodyMary, TxBodyAlonzo, TxBodyBabbage Actual: Tx BabbageEra
  • Investigating cardano-cli package and it's bounds on the cardano-api reveals that they have much looser constraint cardano-api ^>=8.2 and they don't set the upper bound (here)[https://input-output-hk.github.io/cardano-haskell-packages/package/cardano-cli-8.1.2/].

  • So how can we fix this? We don't!

  • I was using the wrong flag :( it should be tx-file instead of tb-body-file derp...This is just one example of how you can loose time on dumb errors like this one but that's life. Live and learn.

SN Reason why min utxo value was off on mainnet

  • Extracted the transaction output from the error from yesterday and ran it through getMinCoinTxOut and it yields the same coins
  • Double and tripple checked protocol paramters
  • After some further debugging and rubber-ducking the whole #ledger channel, it dawned on me why the cost is higher..
  • 💡 The head we tried to open on mainnet has 5 participants, also we are using inline datums now.. so that is 5 public keys in the datum.. which is more costly than only 1 or 3 keys (as used in our tests)

2023-11-22

SN Debugging min utxo values

  • Debugging why min utxo value is too low on mainnet while it is seems to be enough for our local devnets in e2e tests?
  • Running the smoke test on preview hydra_cluster_datadir=hydra-cluster cabal exec hydra-cluster -- --preview --state-directory state-preview --publish-hydra-script
    • Learning: cabal run does not work with the hydra_cluster_datadir override
    • Learning: When using the hydra-cluster locally against a persistent directory, make sure to clean up the persisted state
  • With preview network protocol parameters we compute also low min utxo values for the initTx: StrictSeq {fromStrict = fromList [Coin 1611940,Coin 1293000,Coin 857690]}
  • Can't reproduce. On all networks tested I saw min utxo values < 2 ADA.

  • Instead of fixing the hard-coded value and hoping this will not occurr on mainnet again, let's remove the hard-coded ada value and add a proper min utxo value calculation to the internal wallet.
  • Suprise: commit transactions fail on the initial script. Reason: equality in value required, but it should be okay if the ADA deposit is higher than before.
  • Encountered another problem where the initial script was failing in end-to-end tests. Checked output values (no debug log from plutus) and realized that the min commit output value was lower than committed + initial value -> the wallet was reducing it again and we need to make sure to use ensureMinCoinTxOut!

2023-11-14

Ensemble on stateless observation

  • Put observeHeadTx into the hydra-node chain handler
  • No ChainState is produced by an observation
  • Switched ChainStateAt to be a slot-indexed UTxO only
  • Mocking out all usage sites of ChainState into a form where we would not do any state switching, but just construct transactions given the last known ChainStateAt holding a UTxO
  • When considering the abort tx creation we see three options to come up with the right inputs to abortTx
  1. Add more parameters and construct the necessary datums to spend UTxOs (HeadParameters etc.)
    • Con: the committed UTxOs might be tricky (and other transactions? all seem possible though)
    • Pro: Most flexible tracking of UTxOs, no need to keep track of datums
  2. Track a "spendable" UTxO which not only holds TxIn and TxOut, but also a ScriptData for each entry
    • Outlined in ADR25
    • Basically what we do right now, but generalized
    • Pro: similar to current "ThreadOutput" chain state
    • Con: Quite some code changes
  3. Switch the head protocol to using inline datums so the normal UTxO can hold everything needed to spend.
    • Would need to update the ADR25 (or a separate)
    • Pro: Simpler protocol and easier exploration (explorers do show inline datums)
    • Con: Need to put a ADA deposit into tx out (but: we currently overpay that already with 2 ADA)
    • Maybe: smaller transactions when spending from a tx out with inline datum?

AB fixing #1161

Fixing review comments and failing tests:

  • Changing the ordering of serialisation of the snapshot's content obviously ripples in various tests: It changes the plutus scripts and the online representation which has an impact on a few golden tests
  • Needed to add a HeadId parameter to genConfirmedSnapshot function in order for it to be consistent with previous state of the head. We still need to refactor the state and transitions generators to take into account new observation stuff
  • Maintaining the spec in sync is burdensome: I thought there was no need to update them, because it states the cid is used for signing, but there's a subtle coloring in play to identify gaps :rolling_eyes:

2023-11-13

AB & SB on #1087

Starting work on the issue, looking at the corresponding GHSA. Now need to write a failing test...

  • How to write a test to ensure the CID is also signed as part of the close transaction? Ideally I would produce a signature with a different CID through a mutation and it would be caught by the test, but this means the signature needs to have the right structure?
  • This test would be green for the "Wrong reason" because the signatures wouldn't match from the get go is the structure is already different
  • There are property tests that check the on-chain verifySignature functions, they could be a good start

Current plan:

  • Adapt unit tests for on-chain verification
  • Observe mutation/higher level tests now failing beacuse off-chain signing is off
  • Change HeadLogic to produce the right signatures (including the CID)
  • Introduce a Mutation for the CID???

While writing unit tests with CID tacked in on-chain we realise we also need the headId in the snapshot, which is a significant change as it impacts a lot of components.

  • Need to break some cycle as HeadId is defined in Chain which depends on Snapshot -> extract to own module
  • We make the unit tests for on-chain validation functions pass by ensuring we use the correct HeadId

Some tests are now hanging forever although they seem unrelated

  • Tests were hanging becuase they failed to observe the right value and the wait time was incorrectly scaled by 1000000. There's been some changes in the io-classes API's handling of time, where threadDelay moved from usign a DiffTime to an Integer number of microseconds which is more consistent with the existing API, but the timeout function still used a DiffTime

2023-11-08

Ensemble on #1096

While implementing observation of CollectComTx we realise we only need the UTxO to observe all transaction types, and this is all the state we need to carry around.

  • We introduce a utxo argument to observeHeadTx function and reuse the existing CollectComObservation directly
  • In the ChainSync implementation we pass around the known UTxO set and update it with transaction we observe
  • Et voilà 🎉

From that point on, adding more transaction types is very mechanical so does not need to be done in the ensemble -> We decide to work on a property based test for the observation code as we now have all the lements to make it, which would alleviate the need to make the ETE test more complex

We cover more of the lifecycle in our ChainObserverSpec but then the test fails unexpectedly to observe the CloseTx

  • We fail to observe properly the CloseTx from the ChainObserver even though the node emits it and observes it
  • We suspect there's something wrong in the way to maintain the UTxO set passed to the observation functions. We add some traces to the ChainObserver to list the txIds and render the Tx

⚠️ We keep accumulating UTxOs in a map, but then when we resolve the head output from this UTxO, we can find another output than the one we are expecting, e.g a UTxO that's been consumed but is still in our map.

Reusing the adjustUTxO function from MockChain allowed us to have the ChainObserver ETE test to pass 🍾 as now it's properly updated when new Tx are observed

  • Added a test to ensure the ChainObserver covers all type of transaction from the Head. There's some redundancy with a TxSpec test but it's useful on its own as it covers us in case we add more transaction types, which we will when we work on incremental commits and decommits
  • Added another test (or rather 2 properties) to check we only update the UTxO upon seeing a Head lifecycle transaction
    • This is done in order to avoid the UTxO set to grow too large and include every single tx we see
    • The code to do that is quite ugly and would benefit from some love

2023-11-07

Ensemble on #1096

Working on stateless chain observer tool, starting from existing ETE test:

  • We make the test pass for observing a raw InitTx easily, then work on the commit tx observation.
  • It seems we could reuse the CommitTxObservation directly but we need to pass a list of TxIn which we don't have.
  • Those initials are used only as a sanity check, leading to an error (and therefore a potential DoS) if we observe something which does not match what we expect
  • As a first step we separate the raw observation from the specific head observation, following the same pattern than in the initTx introducing a RawCommitObservation. We notice that all of the information in the CommitObservation is extracted from the outputs of the tx, so there would be no need to pass additional information to observe it.
  • Passing the expected TxIns would be useful as an additional filter and we could just ignore a RawCommitObservation that does not match what we expect

Trying to implement raw observation for CollectComTx, we observe we actually need the ResolvedTx there, because we want to check the datum and redeemer for the spent inputs:

  • We could kick the can a bit more for the collectCom, using only the head output's datum, but in the close and contest case we would need that anyway
  • Looking at previous work implementing observation on top of ResolvedTx

Discussing more what information we need to observe CollectCom/Close/Contest, we observe that we probably don't need the tx inputs' resolved TxOut, but we need to know the seed for the head policy id, which implies we need to add it to the datum

  • By induction, if we have an output that has a valid ST token for some seed, then it must have been produced from a valid output
  • This means we could simplify our observation

Next steps:

  • Keep on with current strategy to separate "raw" observations from "per-head" observations
  • In the context of a single head we could do the TxIn resolution easily by keeping the Head's transactions outputs resolved, e.g maintaining an evolving map of TxIn ↦ TxOut as we observe the various txs
  • We could also do it in the chain observer to maintain a broader context, or just do a state query to resolve TxIns on the fly

2023-11-06

From Tactical

  • There's a PR in the making to ensure formatting CI check in place in order to ensure consistency. Question is: Should there be one workflow with multiple tools, or could we use something like treefmt to provide a single entry point?
    • We need to format different languages (haskell, cabal, nix), also need to run hlint
    • Linters and formatters should be easily runnable from the command-line in order to minimise frictions with the CI and shorten feedback loop
    • Having another binary be part of the devshell does not seem like a big issue, let's experiment and timebox it
  • SN has put up a contribution to foliage to enable retrieval of git submodules. This is useful in general not only for us
    • Perhaps we do not really need/want submodules. We are using cardano-configurations in the smoke test and we could either pull them from cardano world, or check them in if we really want to guarantee they are not pulled under our feet

October 2023

2023-10-26

  • We are switching back and forth between Lucid and Meshjs to build and submit txs into the head from the frontend Poll dApp
  • The main problem we are facing is that those frameworks are not intended for people wanting to build transactions "by hand" and therefore we don't have access to a lot of fields in a Tx

We spend time debugging and investigating the various options to build a Tx and finally end up with the following structure:

            const tx: any = new Transaction({ initiator: wallet, parameters })
              .redeemValue({
                  value: utxo,
                  script,
                  datum: { alternative: 0, fields: [] },
                  redeemer: { alternative: 0, fields: [] }
              })
              .sendLovelace({
                  address: scriptAddress,
                  datum: {
                      value: { alternative: 0, fields: [] }, inline: true
                  }
              }, 98000000')
              .setCollateral([utxo])

          // HACK
          tx["__visits"].push("setTxInputs");
          const unsignedTx = await tx.build()
          const signedTx = await wallet.signTx(unsignedTx, true)

Key points:

  • the redeemValue is the meshjs method to consume a script, it takes care of adding the script and the hashes
  • the sendLovelace is the simplest method to pay to an address, here a script
  • we use the script input itself as collateral
  • the "hack" is there to prevent the builder from trying to balance the tx and adding inputs from our wallet which of course won't be spendable in the Head
  • the true argument to signTx is a partial signature for the case of txs with multiple required signers. It's not the case for our tx so not sure why we need this

The produced transaction looks OK:

$ ./cardano-cli transaction view --tx-file tx.raw
auxiliary scripts: null
certificates: null
collateral inputs:
- 8dcd2cf83a72b4f3891e5acb8984bef599f56a8a1ddfef2b7a4a364b9be87361#0
era: Babbage
fee: 0 Lovelace
inputs:
- 8dcd2cf83a72b4f3891e5acb8984bef599f56a8a1ddfef2b7a4a364b9be87361#0
metadata: null
mint: null
outputs:
- address: addr_test1wrrjcfmvzn3tjuwa92yagf079lfh7xfpawak47jxjmjm62snxuzpf
  address era: Shelley
  amount:
    lovelace: 98000000
  datum:
    constructor: 0
    fields: []
  network: Testnet
  payment credential script hash: c72c276c14e2b971dd2a89d425fe2fd37f1921ebbb6afa4696e5bd2a
  reference script: null
  stake reference: null
reference inputs: []
required signers (payment key hashes needed for scripts):
- c1bad7b2e02d94573e0dbdae531927cfd3ed33c6aab9dbc76e6c5edc
return collateral: null
total collateral: null
update proposal: null
validity range:
  lower bound: null
  upper bound: null
withdrawals: null
witnesses: []

After much fiddling the tx is still rejected by the Head with an error related to fees! So it seems the head has been opened with a ledger requiring fees, eg. default parameters.

  ApplyTxError
    [ UtxowFailure (AlonzoInBabbageUtxowPredFailure (NonOutputSupplimentaryDatums (fromList [SafeHash "923918e403bf43c34b4ef6b48eb2ee04babed17320d8d1b9ff9ad086e86f44ec"]) (fromList [])))
    , UtxowFailure (AlonzoInBabbageUtxowPredFailure (PPViewHashesDontMatch (SJust (SafeHash "0a354d10d96547b543a22f8be5fe9c1c0953d5d230f69b0279fd42ac3dfce97a")) (SJust (SafeHash "5f272c20ed07e9634e7bd811910936f9936e7d17778c6ef1bc432065f4b813ac"))))
    , UtxowFailure (UtxoFailure (AlonzoInBabbageUtxoPredFailure (UtxosFailure (CollectErrors [NoCostModel PlutusV2]))))
    , UtxowFailure (UtxoFailure (AlonzoInBabbageUtxoPredFailure (ScriptsNotPaidUTxO (UTxO (fromList [(TxIn (TxId{unTxId = SafeHash "8dcd2cf83a72b4f3891e5acb8984bef599f56a8a1ddfef2b7a4a364b9be87361"}) (TxIx 0), (Addr Testnet (ScriptHashObj (ScriptHash "c72c276c14e2b971dd2a89d425fe2fd37f1921ebbb6afa4696e5bd2a")) StakeRefNull, MaryValue 98000000 (MultiAsset (fromList [])), Datum "\216y\128", SNothing))])))))
    , UtxowFailure (UtxoFailure (AlonzoInBabbageUtxoPredFailure (FeeTooSmallUTxO (Coin 620200) (Coin 0))))
    ]

There are other errors we need to cater for.

2023-10-25

AB on #529

Found a way to separate the 2 parts of observeInitTx:

  • observeRawInitTx only takes on-chain information, mostly the raw tx and the network id to resolve script addresses and produces, if it matches a RawInitTxObservation which has extracted all the relevant information to qualify the tx as a head init tx
  • then observeInitTx takes a RawInitTxObservation and, using the ChainContext which is all the static information about the head I am interested in, tells whether or not the tx is relevant and the actual initialisation of the head we want to be part of
  • Coming back to continue with committing a script to a head...

  • AB points that I am not actually creating a proper script utxo since I am not setting any datum and redeemer in the producing tx.

  • So the script produced needs to specify the datum and shouldn't use reference script since we would like to be able to consume the output over and over and if we use referenced script than the user would need to know its address in order to use it which we want to avoid.

  • I had to actually add a function to print the script utxo so that I could use it in a curl call.

curl -i \
    -H "Accept: application/json" \
    -H "Content-Type: application/json" \
    -X POST -d '{"e020bac0c84b9f317ecbab2f8632f085579e3becb4d990f28b13eb4d8e9ef0c5#0":{"address":"addr_test1wrrjcfmvzn3tjuwa92yagf079lfh7xfpawak47jxjmjm62snxuzpf","datum":null,"inlineDatum":{"constructor":0,"fields":[]},"inlineDatumhash":"923918e403bf43c34b4ef6b48eb2ee04babed17320d8d1b9ff9ad086e86f44ec","referenceScript":null,"value":{"lovelace":913720}}}' \
    http://localhost:4001/commit

and BAM! we got the commit script back:


{"cborHex":"84a700838258205deab14528dca483af678779af037fec26f2cb42d1cb381ec9808738d062bbf401825820e020bac0c84b9f317ecbab2f8632f085579e3becb4d990f28b13eb4d8e9ef0c500825820e020bac0c84b9f317ecbab2f8632f085579e3becb4d990f28b13eb4d8e9ef0c5010d81825820e020bac0c84b9f317ecbab2f8632f085579e3becb4d990f28b13eb4d8e9ef0c5011281825820f917dcd1fa2653e33d6d0ca5a067468595b546120c3085fab60848c34f92c265000182a300581d708dcc1fb34d1ba168dfb0b82e7d1a31956a2db5856f268146b0fd7f2a01821a002c75b8a1581c44e32df2ca810d3b9190acadc5469f0b0b48e8d7814bc4236d30e44fa1581cf8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d0102820058201bce4115046ca8dc59dc19070976178e1169e7b098d79e57448fe76c8f628efda200581d60f8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d011b0000000248b47790021a0035d8600e81581cf8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d0b5820e44102692bd73ded9b73661862fc3e011ffbca1e337d040a773b37e17f9b95dea30081825820eb94e8236e2099357fa499bfbc415968691573f25ec77435b7949f5fdfaa5da058402e9c2f764d706992b7853b8999ba8aff538f4be61ba2dd93393de91b08ea1cfc796f5a2a99bd04f3207d91c3f6a95814afdf91ecf1543f499a81f4fc3120a5040482d8799f58208e4522169051101c4675c92a9a21fe11e4fc6f1ad57903efa4ff273f995b0d2b9fd8799fd8799fd8799f5820e020bac0c84b9f317ecbab2f8632f085579e3becb4d990f28b13eb4d8e9ef0c5ff00ff5840d8799fd8799fd87a9f581cc72c276c14e2b971dd2a89d425fe2fd37f1921ebbb6afa4696e5bd2affd87a80ffa140a1401a000df138d87b9fd87980ffd87a80ffffff581c44e32df2ca810d3b9190acadc5469f0b0b48e8d7814bc4236d30e44fff581c44e32df2ca810d3b9190acadc5469f0b0b48e8d7814bc4236d30e44f0581840000d87a9f9fd8799fd8799f5820e020bac0c84b9f317ecbab2f8632f085579e3becb4d990f28b13eb4d8e9ef0c5ff00ffffff821a00d59f801b00000002540be400f5f6","description":"Hydra commit transaction","type":"Tx BabbageEra"}

  • Now I signed the tx with alice's key using cardano-cli which worked but when trying to submit this is the error:

Command failed: transaction submit  Error: Error while submitting tx: ShelleyTxValidationError ShelleyBasedEraBabbage (ApplyTxError [UtxowFailure (AlonzoInBabbageUtxowPredFailure (ShelleyInAlonzoUtxowPredFailure (MissingScriptWitnessesUTXOW (fromList [ScriptHash "c72c276c14e2b971dd2a89d425fe2fd37f1921ebbb6afa4696e5bd2a"]))))])
  • After inspecting our API code (I worked on it but I am obviously forgetfull zombie) we realize that there is a field to specify script witnesses. If you don't populate this field hydra-node will just provide the payment key witness for the transaction.

  • This is how the final curl call looks:


curl -i \
    -H "Accept: application/json" \
    -H "Content-Type: application/json" \
    -X POST -d '{"e020bac0c84b9f317ecbab2f8632f085579e3becb4d990f28b13eb4d8e9ef0c5#0":{"address":"addr_test1wrrjcfmvzn3tjuwa92yagf079lfh7xfpawak47jxjmjm62snxuzpf","datum":null,"inlineDatum":{"constructor":0,"fields":[]},"inlineDatumhash":"923918e403bf43c34b4ef6b48eb2ee04babed17320d8d1b9ff9ad086e86f44ec","referenceScript":null,"value":{"lovelace":913720}, "witness": { "plutusV2Script": { "cborHex": "59064559064201000032323232332222253353332221220023333573466e1cd55ce9baa0034800080208c98c8028cd5ce0038040041999ab9a3370e6aae74dd5001240004010464c6401466ae7001c020020c8cccd5cd19b8735573a0029000119191919191919910919800801801191999ab9a3370e6aae74005200023232323232323232323232323232323232323232323232323232323333332222221233333333333300100701200600500400e00d00c00300a0020083300d2323333573466e1cd55ce800a4000464646466442466002006004604c00460280026ae84d5d128011aba15001135573c004464c6406266ae700b80bc0bcdd5000806198068070051998083ae500f00933301075ca01e0106601aeb8010ccc041d710008011aba135744a0206ae85403cd5d0a8079aba1500f35742a01e6ae85403cd5d0a8079aba1500f35742a01e6ae85403cd5d0a8079aba1500f2322300237580026044446666aae7c00480848cd4080c010d5d080118019aba20020222323333573466e1cd55ce800a4000464646464646464646666444424666600200a008006004646666ae68cdc39aab9d001480008c8c8c8cc8848cc00400c008c090008cc02808c004d5d09aba2500235742a00226aae780088c98c80b4cd5ce0150158159baa00433300d75ca01800664646666ae68cdc3800a4008464244460040086ae84d55cf00191999ab9a3370e0049001119091118008021bae357426aae780108cccd5cd19b87003480008488800c8c98c80c0cd5ce0168170170168161aab9d00137540046600aeb8004d5d09aba2500535742a0086ae854010d5d0a8021119191999ab9a3370e002900011919091180100198030009aba135573c00646666ae68cdc3801240044244002464c6405866ae700a40a80a80a4d55ce8009baa001135744a00226ae8940044d55cf00111931901199ab9c0200210213754002266002eb9d69119118011bab00130202233335573e002403e46466a03e66442466002006004600c6aae754004c014d55cf280098021aba200313574200404026ae8940044d5d1280089aba25001135744a00226ae8940044d5d1280089aba25001135744a00226ae8940044d5d1280089aab9e0022326320133357380200220226ea8008c8c8cccd5cd19b87001480188c848888c010014c8c8cccd5cd19b870014803084888888800c8c8c8c8cccd5cd19b87005480288488888880108cccd5cd19b87006480208c8c8cc8848888888cc004024020dd70011bad001357426ae894010d5d0a80191999ab9a3370e00e900311919199109111111198010048041bae002375c0026ae84d5d128031aba1500523333573466e1c021200423232332212222222330060090083013002375c0026ae84d5d128041aba1500723333573466e1c0252002232321222222230070083013001357426aae7802c8cccd5cd19b8700a480008c8c848888888c014020c050004d5d09aab9e00c23263202033573803a03c03c03a03803603403226aae780144d55cf00209aab9e00301535573a0026ea8d5d09aab9e00323333573466e1c0092004232321222230020053009001357426aae780108cccd5cd19b87003480088c8c848888c004014c024004d5d09aab9e00523333573466e1c0112000232122223003005375c6ae84d55cf00311931900b99ab9c01401501501401301235573a0026ea8004d5d09aba2500535742a0084646666ae68cdc39aab9d001480008c8c8c8cc8848cc00400c008c8cccd5cd19b8735573a002900011bae357426aae780088c98c8058cd5ce00980a00a1baa002375a0026ae84d5d128011aba15001135573c004464c6402266ae7003803c03cdd5000919191999ab9a3370e0029001119191919191999110911998008028020019bad003375a0046eb4004d5d09aba2500335742a0046ae8540084d5d1280089aab9e00323333573466e1c00920002323212230020033007001357426aae780108c98c8048cd5ce0078080080079aab9d0013754002464646666ae68cdc3800a400446424460020066eb8d5d09aab9e00323333573466e1c00920002321223002003375c6ae84d55cf00211931900899ab9c00e00f00f00e35573a0026ea80044d55cf00111931900599ab9c0080090093754002200e264c6401266ae7124103505435000071220021221223300100400349010350543100120014988c8c00400488cc00cc0080080041", "description": "", "type": "PlutusScriptV2" }, "redeemer": "d87980" }}}' \
    http://localhost:4001/commit

  • And voila! HeadIsOpened!

  • This is probably the first script being committed to a Hydra head.

  • Now back to the frontend problems with browser wallet...

2023-10-24

Ensemble on stateless observation

Starting work on stateless observation, looking at SN's work and ADR, trying to find what tests would be interesting to write or modify. We currently do not have direct tests for observeXXXTx, those functions are only indirectly used in some other tests.

  • Seems like #529 would be a good start: We want to notify the user when an InitTx fails, which is hard to tell from the logs or messages because we need to wait for the tx to be observed which can happen after a significant amount of time depending on the network conditions.
  • The problem with the proposed design is that moving All the observations into the HeadLogic will bloat it significantly, and I am pretty sure this will make it harder and harder to understand over time. The HeadLogic should be dedicated to handling a single head lifecycle which is already intricate (and will become even more so with incremental commits and decommits)

Interestingly, there's another way to notify clients of things happening in the node than having to go through the HeadLogic, which is used by the network layer:

  connectionMessages Server{sendOutput} = \case
    Connected nodeid -> sendOutput $ PeerConnected nodeid
    Disconnected nodeid -> sendOutput $ PeerDisconnected nodeid

Here connectionMessages is passed to the network layer to propagate connection informations direclty, without having the need to add those messages in the HeadLogic. We could apply the same logic to the observations made on-chain that might or might not be relevant to the user. We could even channel the logs directly to the client, much like we do for the monitoring stuff?

And we have the same couple of types in the Chain interface than we have in Network so we could compose them easily:

withChainObserver (withDirectChain ....) callback action = ...

We start writing an ETE test to manifest the problem with #529 hoping this will lead us into needing stateless observation for InitTx

  • Trying to write a proper ETE is not easy, the hydra-node is crashing without a log...
  • Seems like we need to generate a proper key? We had some odd test failures because we were using the wrong key for bob, thanks to the --Wno-name-shadowing option being present in EndToEndSpec file, meaning we were using the wrong vkey/party id for Bob
  • Now node 3 is not observing the head initialising because we are passing its own vk in the ChainConfig so it cannot observe the initTx either

In the observation code, we mix 2 concerns:

  • Observing transactions that "look like" Head transactions
  • Filtering those txs according to our context

We introduce a ServerOutput constructor to expose a SomeHeadInitializing messages.

SB on Commiting a Script to a Head

  • In order to make Hydra Poll a bit better in a sense that we would like to make transactions on the frontend our plan is to create a always spendable script output, commit that to a Head and then from the browser wallet we should be able to construct a transaction that consumes this utxo and also has it in the output which should work inside of a Head (NewTx).

  • When constructing a NewTx there are no further checks but one where you are actually allowed to spend some utxo belonging to a open Head.

  • So theoretically if you are able to produce always spendable utxo and commit it to a head, any wallet could produce a valid NewTx since we don't check if signing key is part of the Head (we assume that if you can consume the head utxo that should be enough)

  • I added some Haskell code to produce a script output that is always spendable.

  • Had minor problems because I was using protocol parameters file with zeroed fees so wasn't able to contruct a valid tx because of min utxo rule so just had to query the node to obtain pparams.

    
    

submitTestScript :: SocketPath -> FilePath -> NetworkId -> IO TxId submitTestScript socketPath skPath networkId = do (vk, sk) <- readKeyPair skPath utxo <- queryUTxOFor networkId socketPath QueryTip vk pparams <- queryProtocolParameters networkId socketPath QueryTip let script = examplePlutusScriptAlwaysSucceeds WitCtxTxIn scriptAddress = mkScriptAddress networkId script changeAddress = mkVkAddress networkId vk let output = mkTxOutAutoBalance pparams scriptAddress mempty TxOutDatumNone ReferenceScriptNone

  totalDeposit = selectLovelace $ txOutValue output
  someUTxO =
    maybe mempty UTxO.singleton $
      UTxO.find (\o -> selectLovelace (txOutValue o) > totalDeposit) utxo

buildTransaction networkId socketPath changeAddress someUTxO [] [output] >>= \case Left e -> do print (renderUTxO someUTxO) error $ show e Right body -> do let tx = makeSignedTransaction [makeShelleyKeyWitness body (WitnessPaymentKey sk)] body print $ renderTx tx submitTransaction networkId socketPath tx void $ awaitTransaction networkId socketPath tx return $ getTxId body


- This function I called using ghci like so:

ghci> submitTestScript "db/node.socket" "alice.sk" (Testnet (NetworkMagic 1))


- In the explorer we can see this transaction
  https://preprod.cexplorer.io/tx/8f45e5dfd5475baaea15b157e8975d303199268f10655f4485fa312bdc033334
  and now I need to spin hydra-node to obtain commit transaction using this
  script utxo. That should give me a transaction that I can sign and submit to
  get the head opened using just Alice as a party.

- Then we'll connect Hydra Poll frontend to this head and test out the browser
  wallet functionality.

- I would need to do this on the aws server where I am actually planning to run
  a Head since the hydra-node needs to be in the initializing phase.

- Actually it is really nice to use hydra-node from the user perspective, it
  makes us see all the pain points.

- I had to expose the port for the Hydra's api server so that I could hit it
  using the request body constructed from the script utxo.

- Now we come to the problem of how to serialize transaction we see on the
  cardano explorer to json our hydra-node expects.

- We have nice docs when it comes to `commit` api endpoint and just using the
  json from there works but of course I need to construct one from the script
  utxo and the main problem here is how to obtain `cborHex` field of the
  `witness` part.

- I am struggling to create json in expected format:

ubuntu@ip-172-31-45-153:~$ curl -i
-H "Accept: application/json"
-H "Content-Type: application/json"
-X POST -d '{"8f45e5dfd5475baaea15b157e8975d303199268f10655f4485fa312bdc033334#0":{"address":"addr_test1wpv9qwsa3x3plj0u20t20nxw73e5z9663arkxm6hej7u5tg954ruz", "datum":null, "inlineDatum":null, "inlineDatumHash":null, "referenceScript":null, "value":{ "lovelace": 874930}, "witness": {"datum": null, "plutusV2Script":{"cborHex": "84a30081825820f5f17200d9cdd7dbb002e6026ca13a27bba6b5ae841c43a078bee6427b45479e010182a200581d7058503a1d89a21fc9fc53d6a7cccef47341175a8f47636f57ccbdca2d011a000d59b2a200581d60f8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d011b00000002491aa5eb021a00028ff1a10081825820eb94e8236e2099357fa499bfbc415968691573f25ec77435b7949f5fdfaa5da0584094a2c4419a00b4d0894ae36a8b277400ed2494d6d27055f62514cdddb5ae477a653ca606010f092e4b158ae6daff5af98ae37bc528bdeddb0682d8448c1a5107f5f6", "description": "", "type": "PlutusScriptV2"}, "redeemer": "02" }}}'
http://localhost:4001/commit HTTP/1.1 400 Bad Request Transfer-Encoding: chunked Date: Tue, 24 Oct 2023 15:02:01 GMT Server: Warp/3.3.29

"Error in $['8f45e5dfd5475baaea15b157e8975d303199268f10655f4485fa312bdc033334#0'].witness.plutusV2Script: TextEnvelopeDecodeError (DecoderErrorDeserialiseFailure "UsingRawBytes (PlutusScript PlutusScriptV2)" (DeserialiseFailure 0 "expected bytes"))"


- Just realized that I need TextEnvelope serialization for a script not complete tx, derp...

- Always succeeding script is actually PlutusV1 script so now I need to see what changed between versions:

ubuntu@ip-172-31-45-153:~$ curl -i
-H "Accept: application/json"
-H "Content-Type: application/json"
-X POST -d '{"8f45e5dfd5475baaea15b157e8975d303199268f10655f4485fa312bdc033334#0":{"address":"addr_test1wpv9qwsa3x3plj0u20t20nxw73e5z9663arkxm6hej7u5tg954ruz", "datum":null, "inlineDatum":null, "inlineDatumHash":null, "referenceScript":null, "value":{ "lovelace": 874930}, "witness": {"datum": null, "plutusV1Script":{"cborHex": "484701000022220011", "description": "", "type": "PlutusScriptV1"}, "redeemer": "02" }}}'
http://localhost:4001/commit HTTP/1.1 400 Bad Request Transfer-Encoding: chunked Date: Tue, 24 Oct 2023 15:09:52 GMT Server: Warp/3.3.29

"Error in $['8f45e5dfd5475baaea15b157e8975d303199268f10655f4485fa312bdc033334#0'].witness: parsing Hydra.API.HTTPServer.ScriptInfo(ScriptInfo) failed, key "plutusV2Script" not found"


- Ok, I could just write the simple validator myself and construct a tx using this one.

- I added a Dummy validator that does nothing but return True and after issuing the request I got this response:

ubuntu-hydra-node-1 | {"timestamp":"2023-10-24T16:12:56.782777486Z","threadId":10014,"namespace":"HydraNode-"Sasha"","message":{"api":{"reason":"ScriptFailedInWallet {redeemerPtr = "RdmrPtr Spend 0", failureReason = "RedeemerPointsToUnknownScriptHash (RdmrPtr Spend 0)"}","tag":"APIConnectionError"},"tag":"APIServer"}}



## 2023-10-23

### SB on gen-hydra-key

- While working on Hydra Poll I noticed that our long running Head on preprod
  stopped working.

- Initial exploration revealed that the docker container for hydra-node keeps
  restarting because of Misconfiguration error where loaded parties keys don't
  match.

- When looking at the logs I noticed one of the keys is not having the same
  hash.

- Realized that by accident I ran `gen-hydra-keys` on this server instead of Hydra poll one
which altered my hydra keys silently!

- Luckily I had the same keys localy and re-uploaded them to the server which
  made the Head live again and I opened this issue to prevent someone else from
  running into the same problem
  https://github.com/input-output-hk/hydra/issues/1136

- Fix should be simple and we should issue a warning if we detect that a key
  file already exists.

### SB on Hydra Poll

- Mentioned extended key format for hydra-node is not really needed right now
  since we are building the transaction on the backend (we suck at frontend
  basically).

- The nice idea of building a transaction on the frontend for Hydra Poll is on
  hold since it takes time for us to explore and implement the frontend code
  and we just don't want to spend this time now on for this purpose.

- Plan is to try and get help from our colleagues at IOG since what we have is
  already working but it isn't optimal.

- When deploying the app on aws all sorts of problems were arising and now at
  this last stage I would like to be able to run a production build of the
  frontend app not just lame `npm run dev`.

- We used Franco's knowledge of the frontend to convert the app to use React.
  Stubbed a nice websocket interface and had to fix few bugs along the way but
  overall we are in much better position to extend the app now.

- Everything works, I exposed the remote ports so you can even run the frontend
  locally. The plan is to try and build a transaction on the frontend now and
  hopefully if this works we will commit _always returns true_ script into a
  Head and use that utxo to build a frontend transaction.

## 2023-10-20

### AB + FT on fixing dev environment on Mac OS

* Trying to run the docker-based demo from `master` but it fails: the version number says 0.14.0 which does not exist as this is the _next_ version to be released. This was changed by running the `release.sh` script _after_ the release was made, passing the future version
  * We [change it back](https://github.com/input-output-hk/hydra/pull/1134) to `unstable` on master, expecting it to be set correctly at next release tag, but we should probably clarify the release proces
  * Anyhow, Docker is not working reliably on newer M1/M2 MBPs !
* We resort to use locally built binaries for both hydra-node and cardano-node which "just works"

### AB Fixing red master

* Build of static executables (and docker) has been _red_ without anyone noticing since we [merged ghc 9.6.2](https://github.com/input-output-hk/hydra/actions/runs/6558028968) upgrade
* The build is failing because of the lack of a musl-built GHC.
* This seems to be a [known issue](https://github.com/input-output-hk/cardano-db-sync/blob/20952ca316d03d2f27ea1212f3399e961e6c55dc/flake.nix#L194) in newer GHC versions which is fixed in nix using some attribute
* The [PR](https://github.com/input-output-hk/hydra/pull/1133) fixes the build issue but also removes branch filtering on static executables build job so that they are now executed on every PR

### Protocol parameters for Demo/Devnet

* We merged a [PR](https://github.com/input-output-hk/hydra/pull/1122) to upgrade the checked-in `protocol-parameters.json` file to the latest serialisation format used by the ledger
* We should probably [remove hardcoded parameters](https://github.com/input-output-hk/hydra/issues/1124) file and extract it from the devnet cardano-node when starting up the demo in order to keep it in sync

## 2023-10-19

### AB on Flaky ETE tests

Some ETE tests are failing unreliably, namely the ones related to misconfiguration: https://github.com/input-output-hk/hydra/actions/runs/6571475876/job/17850707077?pr=1128#step:5:305
This is an ETE test because we call the `checkHeadState` function that verifies the parameters mismatch into the `Main.hs`:

    withCardanoLedger chainConfig pparams $ \ledger -> do
      persistence <- createPersistenceIncremental $ persistenceDir <> "/state"
      (hs, chainStateHistory) <- loadState (contramap Node tracer) persistence initialChainState
      checkHeadState (contramap Node tracer) env hs
      nodeState <- createNodeState hs

This code feels very imperative and hard to understand, also mixing various responsibilities

We had some ETE test checking misconfiguration that were failing
frequently. These tests were checking the hydra-node process would
crash in the event of a misconfiguration and would log the issue, by:
* Capturing the process output and check some string is present
* Capturing the process' logs and checking some log entries are present

These tests are run in ETE because the checks are done during the
"startup" phase in the `Main.hs`, but they are actually implemented by
a function called `checkHeadState` which could more easily be tested
independently, so this is what I did: [Create several unit tests](https://github.com/input-output-hk/hydra/pull/1130) to
expose the behaviour of this function and remove the ETE tests.

In earnest, I should add a test that checks the hydra-node actually
does this check when starting up but this would require extracting the
"boot" function from the Main module

### AB on TUISpec tests breaking the build

The upgrade to GHC 9.6.2 and brick 1.1 surfaced various problems with the `TUISpec` tests:
* They sometimes segfault: https://github.com/input-output-hk/hydra/actions/runs/6569639856/job/17845836466?pr=1128#step:6:4502
* We ignore when they fail in CI, but the way we run them means we also ignore when the _build_ fails
* When the tests run and fail, they [break the XML](https://github.com/input-output-hk/hydra/issues/1126) file used to report tests execution

Trying to fix the CI in order to ensure the [TUI code _builds_](https://github.com/input-output-hk/hydra/pull/1131) but we skip the tests _execution_. Finally found the right command to only build the hydra-tui derivation:

nix develop .?submodules=1#tests.hydra-tui --build


Kinda makes sense :shrug:

Let's discuss the strategy moving forward. I see 3 paths:

1. drop the TUI in favor of a simpler REPL-like CLI, perhaps based on [hint](https://hackage.haskell.org/package/hint) although I fail to see the added value for the added weight (hint basically embeds a whole GHC)
2. Work to improve brick testing strategy (see [this issue](https://github.com/jtdaugherty/brick/issues/447), contributing upstream
3. Find a way to run those tests in a proper terminal in CI (which probably requires a custom runner?)

## 2023-10-18

### [Summit slides](https://github.com/input-output-hk/hydra/issues/1109)

* Start from RareEvo slidedeck
* Include content from last Cardano Summit and presentation at the CF in March 2023
* Add room for 3 or even 4 demos
  * Basic tutorial walkthrough to show the components in action (eg. _Basic Hydra Head_ topology)
  * Hydraw + [HydraPoll](https://github.com/input-output-hk/hydra/issues/1110) to demonstrate a dApp using _Delegated Hydra Head_
  * HydraNow
* We include a section on Mithril which is reaped from some high-level presentation
* We update the "branding" to use _Scaling Tribe_ logo and name

### TUI tests issues

As part of [#590](https://github.com/input-output-hk/hydra/issues/590#issuecomment-1307421209) we decided to ignore the TUI tests failure and this is biting us in the back.

2 PRs are failing because the TUISpec tests are failing and their output crashes the JUnit reporter:
* https://github.com/input-output-hk/hydra/pull/1119/checks?check_run_id=17832266763 : failure caused by TUISpec output not being correctly escaped
* https://github.com/input-output-hk/hydra/pull/1121/checks?check_run_id=17832359325 : same

We also discovered that some PRs have been green and merged while the TUISpec tests were not even compiling, or sometimes even crashing. This is caused by the CI ignoring the errors, but because of how we run the tests, this not only ignores the tests failure which _could_ be fine, but also ignoring the compilation errors.

  # TUI specs are flaky. They are failing because of SIGSEGV.
  # There is an open issue to tackle this problem. https://github.com/input-output-hk/hydra/issues/590
  continue-on-error: true

Wrote an [issue](https://github.com/input-output-hk/hydra/issues/1126) to clarify what is the problem, namely the fact that tests failures output non-escaped ANSI code which breaks

## 2023-10-16

### AB tinkering

Writing a games app, trying to provide a turnkey installation experience to run a hydra-node and cardano-node, leveraging mithril-client
* Wrapping the cardano-node was relatively straightforward, got it to install binaries in `~/.local/share/cardano`, config in XDG config directory, and DB in XDG data directory, and logs in XDG cache dir.
* Did not do the retrieval through mithril-client but it should be straightforward
* Having released artifacts for Mac OS is very useful, and surprisingly it works out of the box even though they are probably compiled for x86 on the cardano-node

Trying to do the same for hydra-node, I realise (again) how cumbersome configuring and running a hydra-node is:
* It needs the protocol parameters which can only be retrieved from the node, which requires the cardano-cli...
* It needs to expose a port for connections which is extremely annoying, as this port must be known in advance and advertised to peers
  * I realised the using UDP would alleviate a lot of those problems thanks to [hole punching](https://tailscale.com/blog/how-nat-traversal-works/) techniques. It might require us runnign a STUN/TURN service but that's probably easy to do, and does not need any trust
  * With UDP, a hydra-node could easily initialise a Head dynamically, binding to single possibly even randomly allocated port and then allowing the user to select its peers before starting the head
  * This could be coupled with sharing some symetric encryption key in the Head inittx so that participants can actually extract the off-chain connection information from the chain
* It needs a bunch of keys, both private and public
* It needs a cardano-node running with an open socket otherwise it just crashes

## 2023-10-13

## SB on Hydra Poll

- One nice idea for the Dubai summit is to have a poll dApp running on Hydra

- Since we are Haskell developers building frontend is not an easy task for us.

- Our findings are that to use any frontend library for building transactions
  you need to have your wallet loaded.

- We have the frontend talking to a hydra-node over websocket and we loaded nami wallet
to build the new tx to send to open hydra Head.

- How to load key address to nami?

- Hydra doesn't understand extended key format so it seems like a good idea to
  add that functionality.

- This would provide us with a way of having cardano key generated from some
  recovery phrase. This key could be loaded into hydra-node and also recovered
  in the nami wallet so that we could build a tx on frontend and have it
  recognized by hydra-node in an open head.

## 2023-10-12

### Building a DApp on Hydra

We brainstorm what it would look like to build a DApp (game) on top of Hydra that would be mostly like another app for end-users, eg. a [Chess game](https://github.com/input-output-hk/hydra/issues/1098). We spent some time on Miro trying to [storymap](https://miro.com/app/board/uXjVMA_OXQ0=/?moveToWidget=3458764566602788714&cot=14) a 2 players game, and [high-level architecture](https://miro.com/app/board/uXjVMA_OXQ0=/?moveToWidget=3458764561779396079&cot=14).

## 2023-10-11

### R&D meeting

* Short meeting to discuss incremental commits and [decommits](https://github.com/input-output-hk/hydra/issues/1057) with Mathias and Sandro
* There seems to be 2 competing approaches:
  * One approach based on "Snapshot Preparation" whereby the parties prepare off-chain a snapshot that's multisigned and authorises the Open -> Open transition
  * Another based on "Transaction Multisig" where the parties agree out-of-band on an Open->Open transition, craft and multisign relevant transaction, and then reflect the result off-chain
* Reached out to @chak to check whether or not the latter solution still makes sense

## 2023-10-10

### Grooming session

* Discussing remaining work to be done on [Network resilience](https://github.com/input-output-hk/hydra/issues/1079), we think it's "good enough" for now and we need more evidence to go the extra mile and guarantee the node can always pick up in cases of crashes
  * We close the issue and create a [follow-up](https://github.com/input-output-hk/hydra/issues/1106) issue to implement some tests that will explicitly try to expose those issues and test the cluster for consistency in face of non-byzantine failures
* Discussed incremental commit/decommits, whether there's more grooming to be done, and where to put them on the roadmap
  * We agree it's a feature that quite a few people expressed interest in but it would be great to have some (one) concrete use case demonstrating the need and how to use that feature
  * SB mentions that we experienced a problem incremental commits would have solved on Hydraw, namely the situation where one party holds all UTxOs therefore preventing other parties to post a tx

Getting the list of versions along with their release dates, seems like we consistently have a 6 weeks rhythm, so next version (0.14.0) would be due mid-November:

% for t in $(git tag ) ; do echo "$t $(git log -n 1 --format=%ad --date=short $t)"; done | sort -V 0.1.0 2021-09-30 0.2.0 2021-12-14 0.3.0 2022-02-02 0.4.0 2022-03-23 0.5.0 2022-05-06 0.6.0 2022-06-22 0.7.0 2022-08-23 0.8.0 2022-10-27 0.8.1 2022-11-17 0.9.0 2023-03-02 0.10.0 2023-05-11 0.11.0 2023-06-30 0.12.0 2023-08-18 0.13.0 2023-10-03


## 2023-10-06

### Weekly Update

#### Achievements

* 🎉Release 0.13.0
* 🦺Published vulnerability reports
* 📶Merged network resilience work part I
* 📣Planning for Cardano Summit participation
* 💸Discussions with funded Catalyst projects wanting to build on Hydra for support
* 📄Merged typos fix PR from @omahs

#### Plan for next week

* Complete Aiken commit validator script
* Complete Kupo integration
* 🧱Brick upgrade on TUI
* 🧹Clean backlog
* 🎭Prepare and rehearse demo and talk for Cardano Summit

## 2023-10-05

### Aiken Commit script

* Mutation tests for commit are passing whereas they should fail: When we change the error code for commit validator, the `Commit` mutations succeeds even though we change the error code
  * Actually the commit validator is not run in the `Commit` transaction :facepalm:
  * running `Abort` or `CollectCom` yields the same problem though
  * We don't have a mutation check in place for the `νCommit` validator !!

We want to write mutation tests in the `Abort` and `CollectCom` context to check this validator is correctly run
* We add a test to the `Abort` for the commit validator by not burning the ST. The validators fail for different reasons

Checking that νCommit fails in `CollectCom` tx because the ST is not present requires changing the value

Now we want to tackle the same for the `νInitial` script which fails for same reason in the `Abort` tx case -> we realise it's a bit cumbersome and perhaps it would be just fine to check a _list of errors_ and use the same mutation to handle several cases.

Rebased the aiken branch on top of our PR for commit/initial mutation tests, those fail because the expected error is not there.
* This seems to be related to a comment made in passing by SN that aiken strips away all error strings from the produced script by default
* Need to add the `-k` flag to aiken to keep the failure strings around because otherwise the tests fail

Checking impact of new script:
* We checked with the benchmarks that the maximum parties for `Abort` and `CollectCom` transactions is 10.
* The size of the `νCommit` script is about 1100 bytes and using inlined strings in the `traceError` does not change it significantly.
  What has a drastic impact is the complete removal of `trace` but this means we don't get any error reporting.
* Note that even though the benchmarks say we can host 10 parties, the `StateSpec` tests fail above 7 parties.

### Tactical - Discussion about [Chess game proposal](https://github.com/input-output-hk/hydra/issues/1098)

* Could be ready for Dubaï if we start now, at least with very simple TUI?
* We want to clean the backlog first and work on this showcase
* Q: Is this worthwhile for attendees at Dubaï? People might be more interested in payments, auctions,... In Lausanne people were asking questions about those topics

* What would chess prove that auctions wouldn't prove?
  * Auctions is not runnable rn, nobody is working on it anymore
  * ➡️Running a game requires moving scripts to/from the Head
  * Chess is an application that's usable by "anyone"
  * It would prove real-world applications can be built on top of Hydra
  * It would lead us to do the work to have it running and used by us

## 2023-10-02

### AB and SB on Network Resilience

* We realised that PR
  [#1074](https://github.com/input-output-hk/hydra/pull/1074) has been
  red since last Wednesday: The hydra-cluster benchmarks are not able
  to run to completion anymore. There's been quite a few new commits
  pushed on top of a red CI without us noticing and it's now really
  hard to troubleshoot
* We `git bisect` the `cabal bench hydra-cluster` run that's failing and find the ["guilty" commit](https://github.com/input-output-hk/hydra/commit/2e3aed05f95619f79f44c51399028d8a796c728f): This commit fixed the "network stress test" but broke the benchmark
* Trying to troubleshoot the issue, experimenting with different
  changes but without much success initially. It seems the problems
  has its root in the _messages garbage collection_ and the way it
  updates various counters but it's unclear why. Reverting the change
  on top of the branch's HEAD does not help and we either can run the
  benchmarks or make the test pass but not both
* Investigating a bit more we found several issues:
  * The benchmark can fail because of a "resend storm": The nodes are
    so busy resending messages they cannot make progress anymore. This
    appears to be caused by the fact we don't check anymore for node's
    quiescence before deciding to resend messages
  * The tests can fail because one or the other peer stops earlier and
    the other peer's timeout is triggered
  * Somewhere in the refactoring atomicity of counter updates has been
    lost which means that messages can be "lost" by duplicating their
    ids (this is observable sometimes in the benchmark runs)
  * The underlying network layer runs one thread per connection which
    means `callback` **must** be threadsafe. This is not observable in
    the stress test because we have only 2 parties and the network
    layer is a simple queue

* Created another [draft
  PR](https://github.com/input-output-hk/hydra/pull/1094) to restart
  from a sane base, eg. the last commit with a [green
  benchmarks](https://github.com/input-output-hk/hydra/commit/15bf75f7f38df4e7a899ee61f3057a2d4418acf8),
  then slowly cherry-picked and adapted various changes:
  * Remove all messages garbage collection logic which seems
    problematic. Also, tracking messages with an `IntMap` seems a bit
    odd as a `Sequence` with an offset would be simpler and more
    efficient but we leave that aside rn
  * Only check messages resending when the peer is quiescent, eg. upon a `Ping`
  * Update comments

# September 2023

## 2023-09-28

### PG and SB on Network resilience testing

- Open a head

- Issue NewTx and see snapshot confirmed

- Use this iptables rule to shutdown Alice's incoming connections

sudo iptables -A OUTPUT -p tcp -d 127.0.0.1 --dport 5001 -j REJECT --reject-with tcp-reset

- Alice doesn't see anybody connecting.

- Other peers see Alice is still connected because we didn't kill outgoing
  connection.

- While alice incoming connection is down we issue a new tx from bob or carol.

- Transaction submitted successfully for Bob and Carol but Alice ofcourse don't
  see this one.

- Delete the iptables rule for alice:

sudo iptables -D OUTPUT -p tcp -d 127.0.0.1 --dport 5001 -j REJECT --reject-with tcp-reset


- Bob and Carol reconnect to Alice but Alice doesn't see this new tx
  submitted which is expected and no snapshot is produced.

- This already should be fine to demonstrate that the network is not working
  optimally on current master.

- When we submit new txs from Bob and Carol we observe that no new snapshots get confirmed and
the Head can be closed with the first snapshot seen by all parties.

- When we do all of the steps above on `network_model` branch we do observe Alice is able to _recover_
once it is up and the snapshots continue to get confirmed.

## 2023-09-25

### Tactical notes

+ FT:
  - Plutus errors when trying to load file commit.commit.uplc as compiled code

+ PG:
  - Can’t compile anymore 🙁 -> we need libblst dependency
  - Close the head? Yes, let’s close

+ SN:
  - Monthly report reminder (let’s finish soon)
     - Next monthly? All clear?
     - Arnaud owns the organization
     - Franco does the demo
  - Weekly updates
     - Just do on cardano-updates
     - Aiken script discussion - shall we have one already?
        -> Grooming Tomorrow

### SN on Kupo

- Many tests depend on a busy node, which is synchronizing (or already a
  synchronized) quite some history. How to replicate that for Hydra?
- For a basic, idle Hydra head we can use `hydra-cluster --devnet`
- For a busy Hydra head we could maybe use a variant of the `cabal bench
  hydra-cluster` run?

## 2023-09-22

### SN/SB on Hydra support in Kupo

- To test kupo integration, a `hydra-cluster` command that spins up a devnet,
  opens a head with some `UTxO` inside and just waits, would be handy to have.
- Hacking it quickly into `hydra-cluster --devnet` at first and see what is
  needed for a more polished command line later.
- `waitSlot` requires `/checkpoints` to be non-empty, however these only
  get set `onRollForward` and our hydra client currently only does that on
  `SnapshotConfirmed`. Do it also on `HeadIsOpen`?
- We wanted to represent the `HeadIsOpen` as the `GenesisPoint`, but the
  `insertCheckpoints` functions fails on using `pointToRow` as it seemingly does
  not expect to insert a genesis point to the db.
- By producing a "normal block" point we get a checkpoint at slot 0
- Now to add some utxo, we must handle `TxValid` messages, store them and yield
  a "hydra block" with those transactions whenever we see a `SnapshotConfirmed`
  message
  + We introduced a `TransactionStore` handle which captures the idea of pushing
    transactions and retrieving them later by id (exactly once)
  + Could/should unit test this

## 2023-09-21

### SB on Network Resilience delete sent messages

- Now that we have a sceleton working let's improve by deleting old/already
  seen messages.

- Algorithm should be fairly simple - we need a data structure to record if all
  parties have seen the message we sent out. When this happens we need to
  remove this message index.

- Since right now messages are stored in a `Vector` we should convert it to
  `Map` probably since vector will re-index after deleting messages and we will
  loose track of our indices.

- Let's start by writing a test TM

## 2023-09-14

### SB on Network Resilience

- I noticed that the benchmarks are failing on our CI.

- That is understandable since our code is just POC and suboptimal. We are
  still trying to access list indices and keep the messages in the Sequence
  data structure.

- Rewrote Reliability module to use `Vectors` everywhere and I am noticing
  improvement when running the benchmarks but the code is still slow.

- Don't want to profile this code but at the same time I'd like to see green
  CI.

- Perhaps I could do some simple improvement by deleting already seen messages
  and see if that helps enough so we see green benchmarks.

### SB on Running Hydraw

- We currently use PG's instance of Hydraw and it is broken for a week now so
  we are not able to run the daily checks on tactical meetings.

- Head is still alive and we can do `NewTx` using the tui clients.

- Trying to run Hydraw on my aws instance the container just wont start:

OCI runtime execution failed executable "hydraw" not found in $PATH

- We use `entrypoint` in our docker-compose.

- Since I am familiar with docker (not) I ofcourse try to bruteforce my way out
of this.

- I compiled hydraw on the machine and copied the binary to the folder I added
as the new volume in the docker-compose file.

- Still it doesn't work so I dig around the internet searching for something
that could help me out.

- I switched from using `entrypoint` to `command` and voila it works!

- Testing that I can actually draw and I see pixels! Yay!

- Updating tactical document to use this new instance.

- Tempted to rewrite hydraw frontend in haskell completely and see if this
solves our concurrency issues.

## 2023-09-13

### SB on Network Resilience

- Back to the laboratory! What we want is to be able to say that the new code
outperforms the current master when it comes to network reliability.

- We had the idea of using the _pumba_ software to stress test our hydra-nodes
docker images but experienced problems building the docker image:

Error relocating /nix/store/0kng9q4xk57w2f399vmxxlb15881m55w-plutus-ledger-api-lib-plutus-ledger-api-x86_64-unknown-linux-musl-1.7.0.1/lib/x86_64-linux


- To resolve this we forked the project and run the CI for building docker
  images and this takes some sweet time (> 1 hour)

- In the meantime I will go through running the demo locally one more time and
  make sure all the steps are correct between two runs on two different
  branches (yestarday I had spun one of the tui clients wrong)

- There is also option to write a test and use some linux binary capable of
  interacting with the network to stop the connectivity on one of the nodes and
  assert it is able to catch up with the messages send while it was down.

- Running the demo with three nodes I created three new txs and see three
  snapshots confirmed. sweet! Now it's time to bring alice's node down:

sudo iptables -A INPUT -p tcp -d 127.0.0.1 --dport 5001 -j DROP sudo iptables -A OUTPUT -p tcp -d 127.0.0.1 --dport 5001 -j DROP


- After this change I don't see anybody connected to alice any more. Let's
submit a new tx from bob to alice (while alice is down).

- This yields _Transaction submitted successfully_ in the tui client.

- Now let's do one more tx, from carol to bob.

- This succeeds and now I see two new tx submissions in the history log for bob
and carol:

2023-09-13 08:56:20.750442372 UTC | Transaction submitted successfully! 2023-09-13 08:55:19.772470448 UTC | Transaction submitted successfully! 2023-09-13 08:51:10.766242255 UTC | Snapshot #3 confirmed.

while alice only sees _Snapshot #3 confirmed_.

- I expect from the current master that alice will not be able to catch up and
see these new txs and I could close using a Snapshot #3 now.

- Let's bring alice's connectivity up again:

sudo iptables -D INPUT -p tcp -d 127.0.0.1 --dport 5001 -j DROP sudo iptables -D OUTPUT -p tcp -d 127.0.0.1 --dport 5001 -j DROP

- Ok this didnt work since I see that alice caught up and saw the txs while the
node was down.

- Now let's try to submit a tx from alice's node while it is down and see if
this breaks the networking.

- I sent out a tx from alice (while it was down) to bob and see in the tui log
for all parties that tx submitted successfully.

- Seems like I will not be able to break this networking that easily using
ouroboros-network. Let's switch the network stack to use udp instead and try
the same. But first let's bring the alices node up to see new snapshot
confirmed.

- Ok, interesting. Now I see the Snapshot #6 confirmed by bob and carol but not
from alice. Might be the tui problem?

- Let's try to close from alice and see which snapshot get's used.

- So the head got contested and we closed using a Snapshot #6 instead.
- Trying to fanout from alice I see this error:

2023-09-13 09:10:17.08378109 UTC | An error happened while trying to post a transaction on-chain: ScriptFailedInWallet {redeemerPtr = "RdmrPtr Spend 0", failureReason = "ValidationFailure (ValidationFailedV2 (CekError An error has occurred: User error:\nThe machine terminated because of an error, either from a built-in function or from an explicit use of 'error'.) ["H25","PT5"] PlutusDebug Omitted)"}

- Let's see what is this `H25` error about...

- So this error happens in `checkFanout` function and it seems like the utxo
hash we are trying to close with (that is coming from the `Closed` datum) is
not the same as the outputs we get from a fanout transaction.

hasSameUTxOHash = traceIfFalse $(errorCode FannedOutUtxoHashNotEqualToClosedUtxoHash) $ fannedOutUtxoHash == utxoHash

fannedOutUtxoHash = hashTxOuts $ take numberOfFanoutOutputs txInfoOutputs

- There is a different error when trying to fanout from bob or carol `H26`
which is `LowerBoundBeforeContestationDeadline`.

- This happens on fanout in the head contract here:

 afterContestationDeadline =
   case ivFrom (txInfoValidRange txInfo) of
     LowerBound (Finite time) _ ->
       traceIfFalse $(errorCode LowerBoundBeforeContestationDeadline) $
         time > contestationDeadline
So it seems that the fanout tx lower validity bound is after the
contestation deadline and for us to contest this lower bound needs to be
after the deadline. So something is off there.

- Ok this is nice. I could run the same demo on the proper branch and at least
make sure this fanout error is not happening there before trying udp network.

- Since FT is around we decided to give a shot to do a docker demo option using
pumba software.

- We tried running the master version of the demo restarting alice's container
every 10 seconds. So submitting a new txs from two different clients and
restarting alice node constantly leads to closing using the last known
snapshot and ignoring the new txs that alice didn't see.

- This is not optimal so let's see what happens when we run the same using the
new code.

- The results are pretty similar and I witnessed some of the txs being ignored
by some client and closing only with the last known snapshot.

- Now I could run the nix version of the demo to make sure I don't end up with
a stuck head like this morning.

- It is very hard to test run these networking changes.

- The outcome of the same steps that I did this morning on master branch and
now replayed them on the network-model branch lead to the head actually not
getting stuck. So I could fanout without any problems and didn't run into the
hashes missmatch or tx lower bound problems like before.

- Now the thing is: how do I convince myself this was not some interminent thing
that happened acidentally? :headbangingagainstthewall:

- Still not happy.

- Let's try to write a test that disables networking on one of the nodes for some time
and try to reason about and also look at the logs to see `Reliability` logs.

## 2023-09-12

### SB on Network Resilience

- Continuation from the yesterdays work: I am not happy since I couldn't break
_master_ branch with the iptables changes I did yesterday on the _network
resilience_ branch. So I didn't prove anything and we need to witness that
this network changes are actually making the network more resilient.

- Arnaud had another idea: replace ouroboros-network we are using with UDP.
This should make it a bit easy to make the main branch fail while our new
code should still continue to work.

- There was some prior work/experiment related to UDP network so I'll just plug
this in and run the demo again.

- Ok after plugging the udp network layer I am running the demo on code from master.

- It is a bit hard to simulate network going down since if I kill `outgoing`
connection on a node then the other node sending a tx gets killed with the
permission denied error.

hydra-node: RunServerException {ioException = Network.Socket.sendBuf: permission denied (Operation not permitted), host = 127.0.0.1, port = 4003}


- Let's try to kill `incoming` connections and see the results.
  `sudo iptables -A INPUT -p udp -d 127.0.0.1 --dport 5001 -j DROP`

- I can't say the things I do are reliable way of stress testing this thing at
  all. ~~When killing incoming connnections on node 1 I somehow cause node 2
  and 3 to not see each other~~ (I had spun the tui clients wrongly derp). I
  was trying to submit new txs from all three clients and observe no new
  snapshots created with occassional tx errors or successfully submitted txs
  but definitely no new snapshots. My hope is to not experience this with the
  new code we wrote. Just sorry I can't make a reliable set of steps since it
  is hard to get everything right.

- I can close with the latest snapshot and also fanout.

- One note from FT - there is a [pumba](https://github.com/alexei-led/pumba)
  docker stress test app that we could use. I think this is something I'd like
  to try tomorrow also.

- Now I am running the new code using the same udp network.

- What I observe after killing the incoming connection to one node is that
  transaction is successfully created but no snapshot. As soon as I connect the
  client I see the new snapshot and it seems that the message resending works
  and all nodes see the tx created while one was down!

- I need to re-do all these steps again and make sure to run the same steps on
  both versions of the code in order to get something deterministic.

## 2023-09-11

### SB on Network Resilience

- Had some fun with AB working on the network resilience problem.

- What we aim to do is improve robustness of the hydra networking layer so that
  we are able to detect which messages reached which party and resend them if
  necessary.

- Current situation is that all tests are green and we are trying to _break_
  this code if we can.

- For this purpose I am running the demo using nix. When three nodes are spun
  we use the tui to interact between them and I will add some iptables rules to
  make one of the nodes drop incoming/outgoing messages.

- I have the head opened and I can send some new txs and see the snapshots get
  confirmed.

- One note is that when I try to send `NewTx` for some reason I am seeing only
  2 recipients and not three (perhaps some iptables rules leftover while I was
  blocking the incoming/outgoing connections?)

- I am now going to drop the outgoing connections for `Alice` on port 5001:

sudo iptables -A OUTPUT -p tcp -d 127.0.0.1 --dport 5001 -j DROP


- This should prevent `Alice` from seeing the `NewTx`.

- I see in the `Alices` logs a new tx but no snapshot confirmation which is good.

- Other two nodes already confirmed the snapshot no 5.

- I expect when I remove the added rule that `Alice` will also confirm snapshot no 5.

- And indeed `Alice` confirms snapshot no 5! great!

- Ok, now I will try to block both incoming and outgoing connections on port
  5002 (Bob's node). After some delay when I enable the port's connectivity I
  expect Bob's node to catch up. I'll send two transactions while Bob is down.

sudo iptables -A OUTPUT -p tcp -d 127.0.0.1 --dport 5002 -j DROP sudo iptables -A INPUT -p tcp -d 127.0.0.1 --dport 5002 -j DROP


- After I sent the first `NewTx` I see `Alice` and `Carol` received this tx while
bob didn't see anything (as expected). The same happens for another tx I just did.

- Let's enable Bob's node connectivity and observe what happens.

- I observe new snapshots 6 and 7 confirmed! Whoo-ho!

- Now it is time to take a look at the newly added logs and make sure they make
  sense.

- First thing I noticed is that we need some knowledge to figure out which peer is which in
the list of acknowledged messages:

{"timestamp":"2023-09-11T15:07:05.82541267Z","threadId":85,"namespace":"HydraNode-"1"","message":{"reliability":{"localCounter":[0,1,0],"tag":"BroadcastCounter"},"tag":"Reliability"}} {"timestamp":"2023-09-11T15:07:05.83087419Z","threadId":85,"namespace":"HydraNode-"1"","message":{"reliability":{"localCounter":[2,2,0],"tag":"BroadcastCounter"},"tag":"Reliability"}} {"timestamp":"2023-09-11T15:07:05.831095188Z","threadId":3131,"namespace":"HydraNode-"1"","message":{"reliability":{"acknowledged":[2,1,1],"localCounter":[2,2,0],"missing":[2],"party":{"vkey":"f68e5624f885d521d2f43c3959a0de70496d5464bd3171aba8248f50d5d72b41"},"tag":"Resending"},"tag":"Reliability"}} {"timestamp":"2023-09-11T15:07:05.831505733Z","threadId":3131,"namespace":"HydraNode-"1"","message":{"reliability":{"acknowledged":[2,1,1],"localCounter":[2,2,1],"missing":[2],"party":{"vkey":"f68e5624f885d521d2f43c3959a0de70496d5464bd3171aba8248f50d5d72b41"},"tag":"Resending"},"tag":"Reliability"}} {"timestamp":"2023-09-11T15:07:25.088206461Z","threadId":85,"namespace":"HydraNode-"1"","message":{"reliability":{"localCounter":[2,3,2],"tag":"BroadcastCounter"},"tag":"Reliability"}}


- So when looking at this `[0,1,0]` I don't have the knowledge of the party
  order in this list which is something we should improve.

- We could probably add a friendly name to the party so that when we log there
  is no need to search for the verification key through the logs, we could have
  a party name.


## 2023-09-06

### FT & SB on Refactor chain state

* Idle actually records information that we are in Idle just after Abort and Fanout.
  This is in fact the reason why we can't just say we should never be on a Maybe ChainState.

## 2023-09-04

### Network resilience discussion

* After spending some time discussing over
  https://github.com/input-output-hk/hydra/pull/1050 and
  https://hackmd.io/ir-9KyDcQZyB8cXm3vae1Q, we acknowledge that the
  topic is heavily loaded, with everyone having different ideas,
  expectations, assumptions, mental model, of what's needed and what
  should be implemented
* We tried to start writing a q-d based test to express what's the
  property we want, at this stage, from this feature, something along
  the line: All messages sent by Alice are ultimately received by Bob
  given Bob recovers from crashes. Ideally those tests should be
  indepednent of the actual Transport layer used and could run in
  `IOSim` which would allow us to explore much more possible
  interleavings
* This formulation highlighted, or permitted the emergence, of a
  disagreement or rather a misalignment, on what we really want:
  * This statement says nothing about Bob's actually processing of the
    messages received from Alice, so in essence it expresses a desired
    propert of the system at something like OSI Layer 5 (eg. session)
  * We might want to express a (stronger?) property at the level of
    the `HeadLogic`, eg. OSI Layer 7, like _All network messages
    emitted by Alice are ultimately processed by Bob_
  * Theses properties might depend on assumptions about the behaviour
    of lower levels (eg. might require messages are delivered
    reliably, whatever their content may be) that can be mocked for
    testing purposes, and tested separately. Eg. they are not
    exclusive of each others'
* A lot of things depend about the overall "vision" of where hydra and
  hydra-node should be going, in which environment they should be able
  to operate, and how they should be able to form a cluster...

### Ensemble on stateless observation

- Using the `ModelSpec` as a driver, started by updating the `onRollForward`
  callback to use a new `ResolvedTx`
- After fixing compilation issues and implementing an `undefined`, we realized
  we cannot just directly use the `utxo` of the mockchain to `resolveTx`
- Surprised that `resolveTx` fails after `applyTransactions`. Reason: we were
  evaluating transactions in the wrong order and a `reverse` was missing.
- Model fails to resolve transactions still. After some digging, it is the
  competing `collectComTx`, where each nodes tries to collect. Our `resolveTx`
  is now acting like a ledger and we have \"unknown TxIn\" for the \"losing\"
  collect transactions after the \"winning\" tx spent the head output.

# August 2023

## 2023-08-31

### Tactical

- SN:
  - Had a chat with NMKR about book.io
  - The protocol logic approach on ensuring liveness reminds me of: https://github.com/input-output-hk/hydra/issues/612#issuecomment-1415197707
- SB:
  - Reminder: Cannot find the link for the security advisory on changing the contestation period Open -> Close transition
- PG:
  - Restarting hydraw did not help
  - See last exploration in the logbook
- FT:
  - What should we do with the state after a head gets closed?
   + create a backup and start fresh
   + delete it and start fresh
   + nothing: continue appending on top of the history
       If so, then don’t we have to have the headId on every event?
       or is it enough to group head state events based on reference points, like init and final states?

### SB on Support inline datums in the commit endpoint

- We have a new user issue which mentiones they would like to be able to use a
  script with inline datums.

- Currently when we construct the datum in commit tx we use `TxOutDatumInTx`
  pattern to construct `TxOutDatum`.

- I thought of supporting directly `TxOutDatum` but I don't see json or cbor
  instances for this type.

- So let's start with writing a new end-to-end scenario where we will try to
  hit the commit endpoint using a script utxo with inline datum to see a red
  test.

- Seems like writing a test for this would not work so easily since we
  basically need to first change our api since it is restrictive in this sense
  (we basically need to construct `ScriptInfo` type).

- Maybe we can alter our api so if you provide the `datum` field of
  `ScriptInfo` we set the `ScriptDatumForTxIn` with the datum hash and if it is
  not present then you assume the datum is inlined and we build the witness
  using `InlineScriptDatum`?

- In this case we would also check the sent `TxOut` to match on the
  `TxOutDatum` and in case we get `TxOutDatumHash` then the `datum` in the
  `ScriptInfo` needs to be populated. If we get `TxOutDatumInline` we proceede
  creating the witness using `InlineScriptDatum`.

- Not sure if we need to tackle right now `NoScriptDatumForMint`?

- I made initial changes and now let's run the test to see what failure do we get.

- We get the `MissingDatum` error from our api, sweet, we got ourselves a failing test.

- I checked also that the tests that use `TxOutDatumHash` also work.

- Now it is time to rewrite the failing test to use the inline datum when
  constructing the request and see it green.

- Moving the datum construction outside of `createOutputAtAddress` and
  providing the wanted datum type before calling this function yields a green
  test.

- I am kinda surprised since I was expecting to see the errors comming from a
  commit tx further in `Tx` but might happen that we just reuse the witness we
  constructed. I am starting the investigation.

- Ok, it seems we are just re-using the provided witnesses in the commit utxo
  so no changes are needed in how we handle the datums inside of `commitTx`.

## 2023-08-30

### Tactical
- DF:
  + Is there an open issue for incremental commits? Yes, it is in our roadmap but now moved to 0.13 release.
- PG:
  + How to move forward with network resilience?

### FT/PG on recovering from connection errors with rollbacks

A strategy to recover from a connection error could be for all the peers to rollback to the last agreed snapshot and forget about everything that happened since then. Long story short: this can't work. But let's share the exploration here.

A peer has two types of TCP connections: outgoing connections to every peer and incoming connection from every peer.

When a peer receives a new incoming connection it will always rollback, setting its ledger to the state of the last signed snapshot.

When a peer (re-)open an outgoing connection, it will rollback and re-open all its other outgoing connections. This will make the other peers to rollback thanks to the previous rule.

With this strategy:

* we do not need specific messages or state management
* We ensure cohesive view of the world between peers
* Pending transactions need to be posted again by the clients

Here is an example with Bob crashing:

```mermaid
sequenceDiagram
    actor A as Alice
    actor B as Bob
    actor C as Carol

    note over A: leader

    B ->> A: ReqTx T1
    B ->> C: ReqTx T1

    note over B: crash

    A ->> C : ReqSn 1 [T1]
    A -->> B : ReqSn 1 [T1]
    note over A: connection from Bob down
    note over C: connection from Bob down

    note over B: restart
    note over B: Bob rolls back to ledger snapshot 0

    note over A: connection from Bob up
    note over A: Alice rolls back to ledger snapshot 0

    note over C: connection from Bob up
    note over C: Carol rolls back to ledger snapshot 0

With Bob crashing, we can not be sure which messages where handled by Bob and which where not. Since Bob missed reqSn for snapshot 1, the head could well be stuk with Bob waiting forever to receive this reqSn. But with this solution, all the nodes are back to the state they all agreed upon with snapshot 0 and can restart safely from there.

Split brain

What happens in case of split brain? Here, the connection from Alice to Bob is dead for some reason but all the other connections and the peers are all fine.

sequenceDiagram
    actor A as Alice
    actor B as Bob
    actor C as Carol

    note over A: leader

    B ->> A: ReqTx T1
    B ->> C: ReqTx T1

    A ->> C : ReqSn 1 [T1]

    A -->> B : ReqSn 1 [T1]
    note over A: connection to Bob down
    note over A: Alice rolls back to ledger snapshot 0

    note over A: connection to Bob up
    note over B: connection from Alice up
    note over B: Bob rolls back to ledger snapshot 0


    note over A: connection to Carol re-initialized
    note over C: connection from Alice up
    note over C: Carol rolls back to ledger snapshot 0

    B ->> A: ReqTx T2
    B ->> C: ReqTx T2
Loading

All looks fine and well but focus on the last ReqTx messages from Bob. Would this message reach Carol just before Carol decides to rollback then the head would be stuck again. We did not see no easy way to ensure this never happens.

AckSn not received

What if we did not receive the AckSn of snapshot signed by everybody?

sequenceDiagram
    actor A as Alice
    actor B as Bob
    actor C as Carol

    note over A: leader

    B ->> A: ReqTx T1
    B ->> C: ReqTx T1

    A ->> C : ReqSn 1 [T1]
    A ->> B : ReqSn 1 [T1]
    B ->> A : AckSn 1
    B ->> C : AckSn 1
    A ->> B : AckSn 1
    A ->> C : AckSn 1
    C ->> A : AckSn 1
    note over B: crashes
    C -->> B : AckSn 1

    note over B: nobody will agree to revert to snapshot 0 know that they've seen snapshot 1
Loading

When Bob comes online again, it needs to see the AckSn from Carol or it'll be stuck forever, and so the head, for this message. But we only want to rely on rolling back to the previous agreed snapshot. But it happened that Carol and Alice have seen the AckSn from all the peers so, for these nodes, the snapshot is valid and they should never agree to cancel this snapshot and go back to the previous one as it could be an attack vector.

The only way out of this situation is to, either close the head or replay message AckSn 1 from Carol to Bob. By replaying messages we can solve our problems without relying on rollback strategy so let's forget about this.

FT/PG Recovering from connection errors by replaying messages leveraging AckSn

Each node has a notion of the last signed snapshot. Knowing that snapshot n has been signed by everyone is enough to know that all peers have received every messages up until the reqSn for snapshot n. We can leverage this information to only replay the messages we sent after this reqSn.

Doing that we should consider that it's ok to send a message twice to a peer, that is the messages are idempotent. This seem to be the case with the current implementation and, if not, the changes seem quite lightweight.

So when Alice connects to Bob she will push to Bob all the messages she sent since the last signed snapshot she knows of (including AckSn for snapshot n-1).

Note that it also means that the node should send the messages to itself too on startup.

With this approach:

  • we leverage on the existing protocol information
  • we don't have to store any state of any peer as we only rely on our view of the head state
  • on startup, we could reconstruct the messages to send to our peers (and ourself) just by looking at the persisted event store
sequenceDiagram
    actor A as Alice
    actor B as Bob
    actor C as Carol

    note over A: leader

    B ->> A: ReqTx T1
    B ->> C: ReqTx T1
    B ->> B: ReqTx T1

    note over B: crash

    A ->> C : ReqSn 1 [T1]
    A -->> B : ReqSn 1 [T1]
    note over A: connection to Bob down
    note over C: connection to Bob down

    note over B: restart

    note over A: connection to Bob up
    note over C: connection to Bob up

    B ->> B: ReqTx T1
    A ->> B : ReqSn 1 [T1]
Loading

In this case, Bob sent the message ReqTx T1 so it needs to send it to itself again on startup (when it reconnects to itslef) to ensure it handles this transaction nicely. That would involve some persistence here but note that this list of messages could well be computed from the event stream used to restore the node on startup.

Split brain

Let's see what happens when connection form Alice to Bob is interrupted and then restored:

sequenceDiagram
    actor A as Alice
    actor B as Bob
    actor C as Carol

    note over C: leader

    A ->> B: reqTx T20
    A ->> C: reqTx T20

    C ->> A: reqSn 9 [T20]
    C ->> B: reqSn 9 [T20]

    C ->> B: ackSn 9
    C ->> A: ackSn 9
    A ->> B: ackSn 9
    A ->> C: ackSn 9
    B ->> A: ackSn 9
    B ->> C: ackSn 9

    note over A: leader

    C ->> A: ReqTx T21
    C ->> B: ReqTx T21

    A ->> C : ReqSn 10 [T21]

    A -->> B : ReqSn 10 [T21]
    note over A: connection to Bob down

    note over A: connection to Bob up
    A ->> B: ackSn 9
    A ->> C: ReqSn 10 [T21]
    A -->> B: ReqSn 10 [T21]
Loading

Here, all looks fine.

AckSn not received

What happens if, for some reason, some ackSn message is not received?

sequenceDiagram
    actor A as Alice
    actor B as Bob
    actor C as Carol

    note over A: leader

    B ->> A: reqTx T1
    B ->> C: reqTx T1

    A ->> B: reqSn 10 [T1]
    A ->> C: reqSn 10 [T1]

    A ->> B: ackSn 10
    A ->> C: ackSn 10
    B ->> A: ackSn 10
    B ->> C: ackSn 10
    C ->> A: ackSn 10

    note over B: crashes

    C -->> B: ackSn 10

    note over B: restart
    note over B: connection to Carol is up
    note over B: connection to Alice is up
    note over A: connection to Bob is up
    note over C: connection to Bob is up

    A ->> B: ackSn 10
    C ->> B: ackSn 10

    B ->> A: reqTx T1
    B ->> C: reqTx T1
    B ->> A: ackSn 10
    B ->> C: ackSn 10

Loading

Here, note that Bob also sends its messages concerning snapshot 10 althoug Carol and Alice don't care about them.

reqTx for the next snapshot

Let's consider the case where a node would send a reqTx just before receiving a reqSn, meaning the transaction would be included in the next snapshot, not this one. Here we ignore messages sent by Bob on restart for clarity:

sequenceDiagram
    actor A as Alice
    actor B as Bob
    actor C as Carol

    note over A: leader

    A ->> B: reqTx T1
    A ->> C: reqTx T1

    A ->> B: reqSn 10 [T1]
    A ->> C: reqSn 10 [T1]

    A ->> B: ackSn 10
    A ->> C: ackSn 10
    B ->> A: ackSn 10
    B ->> C: ackSn 10

    note over B: crashes

    C ->> A: reqTx T2
    C -->> B: reqTx T2

    C ->> A: ackSn 10
    C -->> B: ackSn 10

    note over B: restart
    note over B: connection to Carol is up
    note over B: connection to Alice is up
    note over A: connection to Bob is up
    note over C: connection to Bob is up

    A ->> B: ackSn 10

    C ->> B: reqTx T2
    C ->> B: ackSn 10
Loading

Since Alice knows that snapshot 10 has been signed, she knows that Bob has seen all her messages, at least, up until AckSn 10.

But note how Carol has to also send reqTx T2 to Bob although it's been sent before her AckSn. This is needed because T2 is not included in snapshot 10 but should be in a later one so Carol has no way of knowing that Bob received this message.

Saying that all peers have received all your messages up until your AckSn of the last signed snapshot is an approximation. What is important here is that you send your peers all the messages you would have sent to them to construct the next snapshot.

2023-08-29

Notes about RareEvo

  • Discussion w/ book.io on supporting their use case on Cardano w/ layer 2
    • They want to represent ownership of every single book as a NFT, which implies minting and tracking a lot of NFTs (eg. 10s of millions potentially)
    • Come from the publishing industry, partner with publishers, printers, book distributors
    • We discussed various ideas, generally converging towards having the NFTs in the L2 and posting proof on the L1
    • Biggest question would be: Why do that on a blockchain or a L2? Who would be running nodes? Is there an interest in including other parties in the process/network?
    • We agreed a deeper dive in the envisioned business process would be useful before tackling the tech challenge
  • Got into a similar discussion with some guy already met in Lausanne who is the wine business, and a musician (http://cullah.com) who wants to handle rights and payments distribution for music
  • Workshop feedback:
    • We did not really gather formal feedback at the end of the workshop as we spend a significant amount of time on Hydra auctions
    • There was a lot of attendants at the beginning and we even have Charles joining, but as soon as we went into the tech/hands-on part of the workshop, we only kept about 10-12 people
    • 2 pairs managed to open a head, one locally and the other one on remote VMs
    • We did not provide pre-built binaries for x86-darwin which 2 participants actually had!
    • Overall, the setup part took the most time and was the most complicated, as expected, but most people at least managed to have a cardano-node up and running and primed with Mithril snapshot!
    • We probably want to improve how we distribute our software and make it easier to install on a variety of platforms. Wireguard install is a good reference point, even though we probably won't be supporting so many platforms but major ones should be in our reach
    • Being able to spin a node as a service would greatly simplify setup in such workshops => demeter.run or provide a simpler homemade solution
    • Having a head explorer would have been quite useful to share what's going on
  • There would be an interest in having a dedicated event for developers which would host workshops and talks from actual devs and tech people
    • At least 2 people mentioned that to me
  • It would have been great to demo HydraPay but unfortunately noone from Obsidian was around

2023-08-28

Tactical

  • AB:

    • Security issues disclosure/release strategy
      • Following the node strategy?
      • GHSA is good
      • Restricting ourselves to talk about it and hiding fixes - is it really worth it?
      • We are not in 1.0.0 yet
      • Write down a first, simple strategy in SECURITY..md and discuss later
  • FT:

    • Are we going to groom about network resilience?
      • Yes!
  • SN

  • Monthly report

    • What’s left to do?
    • Noticed that we are re-summarizing features -> feels redundant
    • Maybe only focus on the engineering details (show off / brag) again
  • RareEvo insights

    • Big expectations & excitement
    • Workshop: good attendance
    • Interesting / challenging use cases

FT, PG and SB on: Network Resilience #188

Proposal

To tackle this issue, we propose to adapt the current architecture to a pull-based approach:

The current network code architecture relies on the broadcast and callback handles to, respectively, send and receive messages to and from peers.

The proposed pull-based architecture will expose the following interface:

  • publishMessage just stores the message locally and makes it available to the peers to pull. Of course, we still need to authenticate the peers to avoid external parties to pull messages from us.
  • onMessageCallback is activated by some thread that is in charge of pulling messages from peers. Again, we keep the authentication layer to ensure the messages we pull are still legit.

To ensure robustness on the writing side, published messages are persisted on disk before being enqueued in memory. So if the the node restarts, it can recover the state of the queues. Optionally, we can define a retention period in order to garbage collect old messages that have already been pulled by all peers.

Also, to ensure robustness on the reading side, the index (sequence number) of the pulled messages by party is stored on disk so that, upon a peer restart and request to pull messages, the server knows from where to start sending. So is the server, which maintains the state of its clients, e.g., knows which messages have been sent.

The ouroboros-framework can be used to implement one or more mini protocols (spec). The cardano-node has a bunch of them, and they are all pull-based; the ChainSync protocol in particular is very close to the above-proposed pull-based architecture.

Notes
  • Nothing against the pull-bases protocol though! Maybe realizing it through protobufs, grpc, http/websockets,...? something more open?
  • the pulling process only makes sense after the head is open
  • while the --start-chain-from option only makes sense before head is open
  • special care needs to be taken for those transactions with TTL
Concers
  • do we need synchronization between chain slots and messages? (i.e.: to support timed transactions)
  • can we throw away the Heartbeat and/or the Authentication middleware using ouroboros-network-protocols?

SN on incremental commits

  • Read up on (compact) sparse merkle trees (C-SMT)

  • Any of the SMT variants can prove inclusion/exclusion of values (or keys)

  • Assuming SMT holding UTxO, out-ref as keys or $H(i, o)$ as key

  • Increment: 0. Head is open, $U_0$ locked, off-chain busy transacting

    1. incrementTx
    • evolves on-chain head state $\eta_0$ with an inputs spending UTxO $U_\alpha$, and
    • given some inclusion proofs for each entry in $U_\alpha$,
    • leading to head state $\eta_1$
    1. Head participants observe tx with added $U_\alpha$
    2. Snapshot leader requests inclusion: $ReqSn(sn, txids, U_\alpha)$
    3. All pariticipants acknwoledge snapshot (some $\bar{U}$ incl. $U_\alpha$) by checking that they also observed the increment $U_\alpha$
  • Decrement: 0. Head is open, $\eta_0$ locked, off-chain busy transacting

    1. Snapshot leader requests exclusion of some UTxO $U_\omega$: $ReqSn(sn, txids, U_\alpha, U_\omega)$
    2. All participants acknowledge $\bar{U}$ excl. $U_\omega$
    3. decrementTx
    • evolves head state $\eta_0$,
    • given an off-chain certificate $\xi_1$ + exclusion proofs for each entry in $U_\omega$ for new (SMT root) $\eta_1$,
    • leading to head state $\eta_1$ + outputs equivalent to $U_\omega$

2023-08-23

SN on hydra-pay

  • Locking funds into hydra pay requires more than 30 ADA at address.
  • Both wallet need to fund and prepare UTxO before a head is opened

SB on P2P Network

  • One possible improvement over our fire and forget strategy for Ouroboros network stack is implementing a P2P network between Hydra parties.

  • Looking into what exists regading Haskell p2p implentations just to get the feel for it.

  • Ended up copying over p2p implementation using pipes library to test it out.

  • After some exploration I decided to just implement a network protocol over tcp as a solution idea.

  • Things should not be very complex imo, so hydra-node should run a tcp server on a specified port as well as clients for each of the peers. We use the server to send the messages to our peers and clients to receive incoming messages.

  • Network.Simple.TCP exposes functions needed to do this and what I want to do is plug it in and see what comes out of this experiment.

  • What we want in the end is to have at least some assurance the messages would reach the peers and tcp should provide this better than fire-and-forget flavor of Ouroboros.

  • One thing I would also like to explore is using ouroboros p2p network between the nodes but I am not sure at this point how involved this is and wouldn't want to spend too much time on it right now.

  • Ended up just using functions from the Network.Simple.TCP to do this experiment.

  • Had some problems tweaking async calls to spin up the server and clients but it worked nicely at least for hydra-node tests I added.

  • These tests are basically the same as the one in the NetworkSpec except I am now calling the new withP2PNetwork function instead.

  • hydra-cluster tests now fail consistently (example test is "two heads on the same network do not conflict") with a surprising message from cardano-node:

DiffusionError thread killed

  • So I suspect there is a conflict between this tcp network I am now using and cardano node.

  • I tried searching the cardano-node codebase to try and figure out what is happening but without any luck.

  • I decided to run the smoke-tests on preview to see if the actual Head state machine goes through all possible states before Fanning out and what I observed is that indeed all works but then it seems that hydra-node is re-spawn again! After this there are no errors in the logs but this is a good lead.

2023-08-22

PG on broken Hydraw

Setting up a local hydraw instance so I can see more traces. The hydraw will connect to my remote hydra-node through ssh tunneling:

# Open ssh tunnel
#> ssh -L 4001:localhost:4001 hydra
# spawn hydraw locally
#> export HYDRA_API_HOST=localhost:4001
#> export HYDRAW_CARDANO_SIGNING_KEY=[external_wallet.sk](http://external_wallet.sk/)
#> cabal run hydraw
# Let’s try to break it:
#> for i in 1 2 3 4 5 6 7 8 9 ; do (curl localhost:1337/paint/3/$i/82/216/42 &); done
➜  hydra git:(master) ✗ OKOKSomething went wrongOKSomething went wrongSomething went wrongSomething went wrong

We made 9 http requests, some are OK and some are not. Looking at the logs we see:

Unexpected server answer:  {"headId":"6b8d1fc1693c91e0a96f64bf0bc5233bdb5342920fdcd6205c77d731","seq":687,"tag":"TxValid","timestamp":"2023-08-22T14:47:46.352235292Z","transaction":{"auxiliaryData":"d90103a100a10e850308185218d8182a","body":{"auxiliaryDataHash":"53c2f64365c74ba4773e34fcec26ec6004e55e39ef5da5427c06a8c9f6ef8a1f","fees":0,"inputs":["594310b83ca98e7f26e4a6cba4f8c7de6c1b236b3486c8340aaa198925892f8e#0"],"outputs":[{"address":"addr_test1vq82h3tch52unc35c5q75hjsxu2a3qnhgyktc79ttgu7juswjr3hu","datum":null,"datumhash":null,"inlineDatum":null,"referenceScript":null,"value":{"lovelace":9693287010}}]},"id":"65ee9f5fdbad91977ce451a6387fc1c80d13295176ba4a0230801b055fd27af6","isValid":true,"witnesses":{"keys":["825820d9733ae9b9183cfc393960c7ea2d13368fa0c7a50169694e3c0a4487cbb67e42584072f8e33ba7d270562dffd537bf5985615ed9ca0d97ba3af2c24a52e1b672d4791ea92c073109e1000954986262400ce952fd3270f4aa8e405db1ac1d7b0d2201"]}}}
CallStack (from HasCallStack):
  error, called at src/Relude/Debug.hs:289:11 in rld-1.2.0.0-fb84d480:Relude.Debug
  error, called at src/Hydra/Painter.hs:43:21 in hydraw-0.0.1-inplace:Hydra.Painter
(3,3) -> (82,216,42)

Although it looks like it’s reproducing the issue, I think it’s another problem, maybe related and, in any case, it’s a transient problem that only appears while trying to paint one pixel but then the next pixel works just fine: the hydraw software is not so much adapted for concurrent requests and here, what we see, is a hydraw thread receiving notifications for another thread’s posted transaction from the hydra-node that it does not expect and so it just fails.

On another hand, it appears that I am not able to break my remote hydraw anymore. Not sure what happened. Hydraw being just a demo application, I’ll let it be as is for now.

ensemble on broken Hydraw

When trying to paint a pixel on Hydraw we observe an HTTP/500 error. Restarting the Hydraw container does not fix the issue.

In the log of Hydraw we can see ParseException "not enough bytes" but not sure if it's related

The next morning, it just happens to work although nothing as change except time passing.

Then, using Hydraw a bit, it breaks again and we see the following errors in logs:

hydra_setup2-hydraw-1  | 2023-08-22T07:31:30.538716000Z CloseRequest 1002 "Protocol Error"
hydra_setup2-hydraw-1  | 2023-08-22T07:31:30.570880000Z ParseException "Unknown opcode: 6"

Later, we can see:

hydra_setup2-hydraw-1  | 2023-08-22T07:32:22.861741000Z ConnectionClosed
hydra_setup2-hydraw-1  | 2023-08-22T07:32:32.597767000Z ParseException "Unknown opcode: 6"
hydra_setup2-hydraw-1  | 2023-08-22T07:32:41.108658000Z ParseException "Unknown opcode: 6"
hydra_setup2-hydraw-1  | 2023-08-22T07:34:58.020600000Z ParseException "Unknown opcode: 6"
hydra_setup2-hydraw-1  | 2023-08-22T07:35:28.232343000Z ParseException "Unknown opcode: 6"
hydra_setup2-hydraw-1  | 2023-08-22T07:38:34.414077000Z ParseException "Control Frames must not be fragmented!"

Restarting both hydra-node and hydraw fixes the problem this time but later, restarting both of them did not fix the problem although the application was, again, working later in the day.

Clicking fast in hydraw seems to reproduce the issue.

2023-08-21

SN on stateless observation

  • Idea: onRollForward not only provides Tx, but a UTxO which contains it's inputs.
  • Instead of changing all interfaces, let’s work from the inside out by using getKnownUTxO chainState and see what UTxO we are missing in the end.
  • A UTxO alone will not be enough, but a “spendable” UTxO
  • When moving away from state-ful types to a more generic SpendableUTxO, the tx construction can fail now. For example: fanout will need to take a HeadId and the SpendableUTxO which is not necessarily including the right head output.
  • It's a bit annoying that the HeadId is not enough. We also need the seedTxIn (at least on fanout) and this one is not (yet) represented on the head logic layer.

SB on Chain state in head state not updated on replayed observation

  • We started to work on this item which came out as a consequence of running the Head ourselves.

  • When you restart your hydra-node and use --start-chain-from flag ideally your starting state would actually be one you are expecting at that point.

  • What we observed happenning was that you would end up in Idle state but also have some existing previous state.

  • It is a bit hard to write a test for this.

2023-08-18

SB on Hydra-tools gen-hydra-key should be available in hydra-node

  • After doing a 0.12.0 release I wanted to pick something fun and easy to work on this Friday.

  • Moving one command to hydra-node is not a hard task so let's do it!

  • Write a failing test in hydra-node that mentions this new command line argument that we will port.

  • Move the command implementation to hydra-node

  • Remove the hydra-tools executable

  • This one was short and easy to do, wondering if I should have left it as a good first issue for somebody to fix it that would like to start to work with Hydra...

2023-08-17

Fanout on preprod

When trying to fanout on preprod we have some occurences were the fanout transaction is sent to too soon to the cardano-node. Probably room for some improvement in the TUI:

hydra_setup2-cardano-node-1  | [5811da14:cardano.node.ChainDB:Notice:44] [2023-08-17 13:31:47.21 UTC] Chain extended, new tip: ea7826a19e42af5f10ee3a86f8174da6c0219408feef78b0965b04d11853827d at slot 36595907
hydra_setup2-cardano-node-1  | [5811da14:cardano.node.Mempool:Info:17183] [2023-08-17 13:32:24.91 UTC] fromList [("err",Object (fromList [("kind",String "ExpiredUTxO"),("slot",Number 3.6595908e7),("validityInterval",Object (fromList [("invalidBefore",Number 3.6595931e7)]))])),("kind",String "TraceMempoolRejectedTx"),("mempoolSize",Object (fromList [("bytes",Number 0.0),("numTxs",Number 0.0)])),("tx",Object (fromList [("txid",String "99cab2df")]))]
hydra_setup2-cardano-node-1  | [5811da14:cardano.node.Mempool:Info:17183] [2023-08-17 13:32:29.16 UTC] fromList [("err",Object (fromList [("kind",String "ExpiredUTxO"),("slot",Number 3.6595908e7),("validityInterval",Object (fromList [("invalidBefore",Number 3.6595931e7)]))])),("kind",String "TraceMempoolRejectedTx"),("mempoolSize",Object (fromList [("bytes",Number 0.0),("numTxs",Number 0.0)])),("tx",Object (fromList [("txid",String "99cab2df")]))]
hydra_setup2-cardano-node-1  | [5811da14:cardano.node.Mempool:Info:17183] [2023-08-17 13:32:37.37 UTC] fromList [("err",Object (fromList [("kind",String "ExpiredUTxO"),("slot",Number 3.6595908e7),("validityInterval",Object (fromList [("invalidBefore",Number 3.6595931e7)]))])),("kind",String "TraceMempoolRejectedTx"),("mempoolSize",Object (fromList [("bytes",Number 0.0),("numTxs",Number 0.0)])),("tx",Object (fromList [("txid",String "99cab2df")]))]
hydra_setup2-cardano-node-1  | [5811da14:cardano.node.Mempool:Info:17183] [2023-08-17 13:32:43.22 UTC] fromList [("err",Object (fromList [("kind",String "ExpiredUTxO"),("slot",Number 3.6595908e7),("validityInterval",Object (fromList [("invalidBefore",Number 3.6595931e7)]))])),("kind",String "TraceMempoolRejectedTx"),("mempoolSize",Object (fromList [("bytes",Number 0.0),("numTxs",Number 0.0)])),("tx",Object (fromList [("txid",String "99cab2df")]))]
hydra_setup2-cardano-node-1  | [5811da14:cardano.node.ChainDB:Notice:44] [2023-08-17 13:32:52.18 UTC] Chain extended, new tip: f60c920e41fcfbb3b4bf26399288020771e9cdb8f055c60076bf2ef15495e051 at slot 36595972
hydra_setup2-cardano-node-1  | [5811da14:cardano.node.Mempool:Info:17183] [2023-08-17 13:33:06.79 UTC] fromList [("kind",String "TraceMempoolAddedTx"),("mempoolSize",Object (fromList [("bytes",Number 5092.0),("numTxs",Number 1.0)])),("tx",Object (fromList [("txid",String "99cab2df")]))]
hydra_setup2-cardano-node-1  | [5811da14:cardano.node.ChainDB:Notice:44] [2023-08-17 13:33:27.22 UTC] Chain extended, new tip: 807b72f35695e129939fd72fbbb2450d8baf7be3e494568c7e4b225a2303b174 at slot 36596007
hydra_setup2-cardano-node-1  | [5811da14:cardano.node.Mempool:Info:52] [2023-08-17 13:33:27.23 UTC] fromList [("kind",String "TraceMempoolRemoveTxs"),("mempoolSize",Object (fromList [("bytes",Number 0.0),("numTxs",Number 0.0)])),("txs",Array [Object (fromList [("txid",String "99cab2df")])])]
hydra_setup2-cardano-node-1  | [5811da14:cardano.node.ChainDB:Notice:44] [2023-08-17 13:33:59.05 UTC] Chain extended, new tip: 88d2b50548b51f174ad456f22b5c80584abc2ed62c94a6220184a09a7a506927 at slot 36596039

Close a head on preprod

Trying to close a head we observe some silent errors: we post a close transaction which is never included in any block but do not observe any error nowhere, nor in the hydra-node logs neither in the cardano-node logs. After several retries, it works.

For instance these close attempts did not succeed:

docker compose logs hydra-node | grep PostedTx
hydra_setup2-hydra-node-1  | {"timestamp":"2023-08-16T15:47:46.323264872Z","threadId":74,"namespace":"HydraNode-\"12043\"","message":{"directChain":{"tag":"PostedTx","txId":"9e89c7668edf06e2b79db6ccb5a9b42d7a593e0bc38126fa4ad3ba9341b855d2"},"tag":"DirectChain"}}
hydra_setup2-hydra-node-1  | {"timestamp":"2023-08-17T06:59:14.420182437Z","threadId":74,"namespace":"HydraNode-\"12043\"","message":{"directChain":{"tag":"PostedTx","txId":"b3851c92146bccbdcf65bc8178915c1955aeb1f133d7a80d5e0e0ee5e411585c"},"tag":"DirectChain"}}
hydra_setup2-hydra-node-1  | {"timestamp":"2023-08-17T07:26:04.842436886Z","threadId":74,"namespace":"HydraNode-\"12043\"","message":{"directChain":{"tag":"PostedTx","txId":"e177621205d169738a8a4b7416fbb6770335041e566839c6cc834646da09ed8b"},"tag":"DirectChain"}}

For instance, transaction e177621205d169738a8a4b7416fbb6770335041e566839c6cc834646da09ed8b is never included in any block. Looking into the cardano-node logs for it here is what we observe:

hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.Mempool:Info:3388] [2023-08-17 07:26:04.84 UTC] fromList [("kind",String "TraceMempoolAddedTx"),("mempoolSize",Object (fromList [("bytes",Number 1138.0),("numTxs",Number 1.0)])),("tx",Object (fromList [("txid",String "e1776212")]))]
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:27:24.40 UTC] Chain extended, new tip: df7490e22406d718f2f72b1f00dd79e50dccb19445d65fe750cc4eba32a9ef19 at slot 36574044
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.Mempool:Info:55] [2023-08-17 07:27:24.40 UTC] fromList [("kind",String "TraceMempoolRemoveTxs"),("mempoolSize",Object (fromList [("bytes",Number 0.0),("numTxs",Number 0.0)])),("txs",Array [Object (fromList [("txid",String "e1776212")])])]

We observe the same logs when we look for transaction eccddc46fe8ce5c3bf0dba895df8722ce492828e8214d71b82519a2479f049ad which did succeed:

hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.Mempool:Info:3388] [2023-08-17 07:37:36.02 UTC] fromList [("kind",String "TraceMempoolAddedTx"),("mempoolSize",Object (fromList [("bytes",Number 1138.0),("numTxs",Number 1.0)])),("tx",Object (fromList [("txid",String "eccddc46")]))]
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:37:49.19 UTC] Chain extended, new tip: affd5a34f5aeeaf89c4649be3f8807d95ea3d340090771f81b11c6c3d55f5e28 at slot 36574669
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.Mempool:Info:55] [2023-08-17 07:37:49.19 UTC] fromList [("kind",String "TraceMempoolRemoveTxs"),("mempoolSize",Object (fromList [("bytes",Number 0.0),("numTxs",Number 0.0)])),("txs",Array [Object (fromList [("txid",String "eccddc46")])])]

What we see in the cardano node is:

  • Transaction e1776212 (close attempt) is not included in block df7490e22406d718f2f72b1f00dd79e50dccb19445d65fe750cc4eba32a9ef19 but removed from mempool anyway.
  • Transaction eccddc46 (second close attempt) is included in block affd5a34f5aeeaf89c4649be3f8807d95ea3d340090771f81b11c6c3d55f5e28 and removed from mempool.

Here are the full logs of the cardano-node with both failed and successful close attempts logs:

hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:26:02.24 UTC] Chain extended, new tip: 89e26d06bda0957f774ac3bae3dbf6d13e3357b87133eeabdcc0939bbe223117 at slot 36573962
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.Mempool:Info:3388] [2023-08-17 07:26:04.84 UTC] fromList [("kind",String "TraceMempoolAddedTx"),("mempoolSize",Object (fromList [("bytes",Number 1138.0),("numTxs",Number 1.0)])),("tx",Object (fromList [("txid",String "e1776212")]))]
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:27:24.40 UTC] Chain extended, new tip: df7490e22406d718f2f72b1f00dd79e50dccb19445d65fe750cc4eba32a9ef19 at slot 36574044
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.Mempool:Info:55] [2023-08-17 07:27:24.40 UTC] fromList [("kind",String "TraceMempoolRemoveTxs"),("mempoolSize",Object (fromList [("bytes",Number 0.0),("numTxs",Number 0.0)])),("txs",Array [Object (fromList [("txid",String "e1776212")])])]
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:27:58.47 UTC] Chain extended, new tip: 65207ff048ff5c5aab8c78172dd158da32c731dd2454e4291deddc684f8dee6d at slot 36574078
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:28:01.33 UTC] Chain extended, new tip: 1acacb42f9f26b372a7e9fed1641d210f13c8ac0d44f405234336313ef208423 at slot 36574081
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:28:52.09 UTC] Chain extended, new tip: e9cd1cb3306d000de9f82f1de9ee4b2bcfa6b81f22854343e16cd8f566f248a4 at slot 36574132
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Info:51] [2023-08-17 07:28:59.96 UTC] Took ledger snapshot DiskSnapshot {dsNumber = 36519110, dsSuffix = Nothing} at b1c738e53453a1e66df0d25562425559d7676b9b65458171ec4931f8019875cc at slot 36519110
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:29:03.15 UTC] Chain extended, new tip: 426f9f26b832278b8e0ebc9fa1c99235cb9dcf842b0d5a9e0da9132657a99c7a at slot 36574143
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:29:24.25 UTC] Chain extended, new tip: ec3d57d80bc0f2c601e26be7efbcee16bb24a52c46d658a6a053a2f81fcec59a at slot 36574164
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:29:51.20 UTC] Chain extended, new tip: 2576a5693de7e5bf468b0b2bf0d7c0df3c52d779f11574bd5ea0ad90b46fcc62 at slot 36574191
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:29:57.30 UTC] Chain extended, new tip: 336e67d12b33e4067370a8d5197bb22b6894d0ded0830b2d856cf9099525d183 at slot 36574197
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:30:37.12 UTC] Chain extended, new tip: 0d85f8716a219d53f9c5c2857dbb021746c810ae45ed402c3fcbb5758b3d1532 at slot 36574237
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:31:24.30 UTC] Chain extended, new tip: 9754b0fa1d23f6891e6497aadf0f004799de003dae306aa1898ada6f5a8fd2df at slot 36574284
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:31:32.20 UTC] Chain extended, new tip: bc9f2a0217390e899bef179b0a71d49ecaacca918997852286161014cef87a55 at slot 36574292
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:31:56.09 UTC] Chain extended, new tip: 5eb30e945b806e7d5884d50d403d3e93db82b108484fb907362b05961cbabc0a at slot 36574316
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:32:20.13 UTC] Chain extended, new tip: 0f81005cf3441e3dc2bc76d967cddc28a1bd505ceb1b1ea65d13a9670d7fd474 at slot 36574340
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:32:25.14 UTC] Chain extended, new tip: c59abf63bc0d2b0490a5161dc7a40c86d16e7facd9c6188c6ab50f3a10c5f805 at slot 36574345
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:33:11.18 UTC] Chain extended, new tip: ce9eaf1feb43d692f5c9df05de1d21620f8659c2e4cd349cc04fe29172691c47 at slot 36574391
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:34:11.22 UTC] Chain extended, new tip: 46c8000d494a9de8bbf9258d2539405ef717f27247716ec49401aaa96a5706d3 at slot 36574451
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:34:14.11 UTC] Chain extended, new tip: e8d98a92773f033f9408b6de8ed20eb16c726fa2547bae2a11bcafbfea498ccb at slot 36574454
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:34:52.21 UTC] Chain extended, new tip: 44862146b40833f190d4731161b1efc2563b863740acdaeb5b2f81b1769f63c7 at slot 36574492
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:35:33.19 UTC] Chain extended, new tip: 427519899e21409058576383d761b48e37eb85a3d22e6676be6718b891fc4185 at slot 36574533
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:35:39.19 UTC] Chain extended, new tip: adda5a75733b8a897e59cb13a7ffbc706dd0f5e03a1299a1592325a2fccd4477 at slot 36574539
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:36:35.44 UTC] Chain extended, new tip: a9aa993a709ed0d92517d8bd7fe8cc03ded2f1c06ef16c8981ce2427597b792c at slot 36574595
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:36:56.37 UTC] Chain extended, new tip: e17bc809fa9dde6ab7cf7d182d666c500d1c590ec179b0a304db5c7c8153ea67 at slot 36574616
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.Mempool:Info:3388] [2023-08-17 07:37:36.02 UTC] fromList [("kind",String "TraceMempoolAddedTx"),("mempoolSize",Object (fromList [("bytes",Number 1138.0),("numTxs",Number 1.0)])),("tx",Object (fromList [("txid",String "eccddc46")]))]
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:37:49.19 UTC] Chain extended, new tip: affd5a34f5aeeaf89c4649be3f8807d95ea3d340090771f81b11c6c3d55f5e28 at slot 36574669
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.Mempool:Info:55] [2023-08-17 07:37:49.19 UTC] fromList [("kind",String "TraceMempoolRemoveTxs"),("mempoolSize",Object (fromList [("bytes",Number 0.0),("numTxs",Number 0.0)])),("txs",Array [Object (fromList [("txid",String "eccddc46")])])]
hydra_setup2-cardano-node-1  | [e4211c2e:cardano.node.ChainDB:Notice:47] [2023-08-17 07:38:01.22 UTC] Chain extended, new tip: ec26174c31df8a3362904d928b0fb0c49f78f44a7e64e2e8b37532375fe55f95 at slot 36574681

The close transaction is the only one exhibiting this issue and the only one with an end validity. We suspect it could be that it’s becoming invalid too soon for preprod but we don’t have any informations in the logs as to the validity period of the transaction so we need to investigate more to check that.

Open a head on preprod

We open a four member head on preprod. Long story short, one of the members of the head had different protocol parameters than the 3 others so that the first transaction submitted is rejected by this node, hence the node refuses to sign the snapshot that includes it. Since the other 3 nodes have accepted and signed the snapshot, the head is stuck. We fixed the misconfiguration issue but then the closing seems to be problematic. Need to see what's happening there.

  • docker image for hydra node: http://ghcr.io/input-output-hk/hydra-node@sha256:6234d12419d27456a13f68c34ed8c67468f79e3caab7dd8a3e66263023a96b43
  • docker image for cardano node: inputoutput/cardano-node:8.1.2
  • script tx id: e5eb53b913e274e4003692d7302f22355af43f839f7aa73cb5eb53510f564496

July 2023

2023-07-31

Tactical

2023-07-24

Ensemble on event sourced persistence

  • We started implementing the actual changes now incorporating the knowledge from previous spikes
  • Adding the first event onInitialChainCommitTx
    • How much logic, where? Let's keep it as close to current implementation.
    • Should we add the headId? We did use the one from state in current implementation, so no.
  • How should the aggregate function handle a commit when the head is actually not in Initial state concretely and generically? Decided policy: ignore the event and keep the function total
  • We realize that the HeadLogicSpec tests are asserting often on the state. While we could make these now assert on the StateChanged event (which would simplify things often), we do not have tests covering the aggregate function and hence these would be untested.
    • Can we come up with nice properties for aggregate?
    • We hope we would remove the combinatorial explosion we saw in the past

2023-07-21

SB on Remove commit from internal wallet

  • It is a nice Friday afternoon and I wanted to start investigating related code changes.

  • I'll start by leveraging the GHC here. So if I remove the Commit client input I should get compilation errors all over the place.

  • Looking at BehaviorSpec - can we alter the Commit occurences to use externalCommit function and commit externally?

  • externalCommit is part of the hydra-cluster package so we need to move it to hydra-node package and use it in hydra-cluster and hydra-tui.

  • Seems like we would need to also run the api server as part of this piece of code to be able to commit externally and keep the specs pretty much the same.

  • These tests are valuable and I wouldn't want to just nuke them because they are tied to internal commits.

  • It seems like the most tricky changes will need to land in the test code. I'll keep the changes related to library changes since we will need to do them anyway and than have a pairing session where we will think what is the best course of action for the tests.

2023-07-20

SB on [Return Protocol parameters in REST

endpoint](https://github.com/input-output-hk/hydra/issues/735)

  • To please our users hydra HTTP server needs to have one extra endpoint which provides protocol parameters to the callers.

  • We didn't groom this user issue but it seems pretty straight forward.

  • We could perhaps advertise issues like this on discord so somebody from the community takes a stab at this? Feels like you could do it even if not very familiar with hydra protocol but only some haskell?

  • Steps:

  • Writing a test that will issue a request to this new endpoint and see it fail.
  • Pass protocol parameters to the api code in the main.
  • This brings the question of why are we having so many arguments to the function that runs the websocket/rest server? Reader?
  • Simply json encode protocol parameters and return it from the server.
  • Adjust the test to make sure response body contains appropriate PP.

Tactical

  • PG:
  • Scaling tribe final plan review, what is it? Make sure we are aligned in our goals for the next quarter.
  • Muda of the day: #982
    • Description: trying to figure out why the seenSnapshot wouldn’t be the same thing in two different contexts
    • Category: motion
    • Hypotheses: none
    • Experiment: none

2023-07-19

Tactical

  • PG:
  • Automatically add draft PR in the in review column: done already?
  • SN:
  • Flip the metric to a "days without accident" -> "days without an incident" enabled. We will see the higher numbers now hopefully

ensemble on emit-snapshot refactoring perf issue

The emit snapshot refactoring PR gives poor performances.

For instance this is the output when running on master:

#> git switch master
...
#> cabal bench bench-e2e --benchmark-options="datasets datasets/3-nodes.json  --output-directory $(pwd)/benchmarks --timeout 1000s"
...
Running 1 benchmarks...
Benchmark bench-e2e: RUNNING...
Test logs available in: /tmp/bench-2ba5d061db070e82/1/test.log
Starting benchmark
Seeding network
Initializing Head
Comitting initialUTxO from dataset
HeadIsOpen
Client 1 (node 0): 0/300 (0.00%)
Client 2 (node 1): 0/300 (0.00%)
Client 3 (node 2): 0/300 (0.00%)
Client 1 (node 0): 62/300 (20.67%)
Client 2 (node 1): 63/300 (21.00%)
Client 3 (node 2): 65/300 (21.67%)
Client 1 (node 0): 125/300 (41.67%)
Client 2 (node 1): 131/300 (43.67%)
Client 3 (node 2): 131/300 (43.67%)
Client 1 (node 0): 191/300 (63.67%)
Client 2 (node 1): 197/300 (65.67%)
Client 3 (node 2): 192/300 (64.00%)
Client 1 (node 0): 249/300 (83.00%)
Client 2 (node 1): 257/300 (85.67%)
Client 3 (node 2): 261/300 (87.00%)
All transactions confirmed. Sweet!
All transactions confirmed. Sweet!
All transactions confirmed. Sweet!
Closing the Head
Finalizing the Head
Writing results to: /tmp/bench-2ba5d061db070e82/1/results.csv
Benchmark bench-e2e: FINISH

When running on the branch:

#> git switch ensemble/emit-snapshot
...
#> cabal bench bench-e2e --benchmark-options="datasets datasets/3-nodes.json  --output-directory $(pwd)/benchmarks --timeout 1000s"
...
Running 1 benchmarks...
Benchmark bench-e2e: RUNNING...
Test logs available in: /tmp/bench-6e16725638a73b65/1/test.log
Starting benchmark
Seeding network
Initializing Head
Comitting initialUTxO from dataset
HeadIsOpen
Client 2 (node 1): 0/300 (0.00%)
Client 3 (node 2): 0/300 (0.00%)
Client 1 (node 0): 0/300 (0.00%)
Client 2 (node 1): 51/300 (17.00%)
Client 3 (node 2): 50/300 (16.67%)
Client 1 (node 0): 52/300 (17.33%)
Client 2 (node 1): 84/300 (28.00%)
Client 3 (node 2): 86/300 (28.67%)
Client 1 (node 0): 90/300 (30.00%)
Client 2 (node 1): 109/300 (36.33%)
Client 3 (node 2): 112/300 (37.33%)
Client 1 (node 0): 116/300 (38.67%)
Client 2 (node 1): 134/300 (44.67%)
Client 3 (node 2): 136/300 (45.33%)
Client 1 (node 0): 137/300 (45.67%)
Client 2 (node 1): 149/300 (49.67%)
Client 3 (node 2): 154/300 (51.33%)
Client 1 (node 0): 153/300 (51.00%)
Client 2 (node 1): 166/300 (55.33%)
Client 3 (node 2): 169/300 (56.33%)
Client 1 (node 0): 169/300 (56.33%)
Client 2 (node 1): 181/300 (60.33%)
Client 3 (node 2): 185/300 (61.67%)
Client 1 (node 0): 183/300 (61.00%)
Client 2 (node 1): 194/300 (64.67%)
Client 3 (node 2): 197/300 (65.67%)
Client 1 (node 0): 198/300 (66.00%)
Client 2 (node 1): 206/300 (68.67%)
Client 1 (node 0): 210/300 (70.00%)
Client 3 (node 2): 210/300 (70.00%)
Client 2 (node 1): 217/300 (72.33%)
Client 3 (node 2): 221/300 (73.67%)
Client 1 (node 0): 222/300 (74.00%)
Client 2 (node 1): 226/300 (75.33%)
Client 1 (node 0): 233/300 (77.67%)
Client 3 (node 2): 231/300 (77.00%)
Client 2 (node 1): 236/300 (78.67%)
Client 1 (node 0): 243/300 (81.00%)
Client 3 (node 2): 241/300 (80.33%)
Client 1 (node 0): 253/300 (84.33%)
Client 3 (node 2): 251/300 (83.67%)
Client 2 (node 1): 246/300 (82.00%)
Client 1 (node 0): 263/300 (87.67%)
Client 3 (node 2): 261/300 (87.00%)
Client 2 (node 1): 255/300 (85.00%)
Client 3 (node 2): 269/300 (89.67%)
Client 1 (node 0): 272/300 (90.67%)
Client 2 (node 1): 263/300 (87.67%)
Client 3 (node 2): 278/300 (92.67%)
Client 1 (node 0): 279/300 (93.00%)
Client 2 (node 1): 272/300 (90.67%)
Client 1 (node 0): 286/300 (95.33%)
Client 2 (node 1): 281/300 (93.67%)
Client 3 (node 2): 286/300 (95.33%)
Client 2 (node 1): 288/300 (96.00%)
Client 1 (node 0): 292/300 (97.33%)
Client 3 (node 2): 294/300 (98.00%)
All transactions confirmed. Sweet!
All transactions confirmed. Sweet!
Client 2 (node 1): 296/300 (98.67%)
All transactions confirmed. Sweet!
Closing the Head
Finalizing the Head
Writing results to: /tmp/bench-6e16725638a73b65/1/results.csv

We tried:

  • Changing the order of the effects to mimic master: no improvement
  • Changing the if with a if/guard to mimic master: no improvement
  • Putting all the Effects in a single list instead of Combining them: no improvement
  • building with ghc-9.2.7: performance improve drastically but still not as good as on master

ensemble on emit-snapshot refactoring perf issue with GHC 2.9.7

We tried our new branch with GHC 2.9.7, the original GHC 2.9.7 branch being the reference implementation to compare performance.

Running the bench on abailly-iohk/ghc-9.2.7:

#> git switch abailly-iohk/ghc-9.2.7
...
#> cabal bench bench-e2e --benchmark-options="datasets datasets/3-nodes.json --output-directory $(pwd)/benchmarks --timeout 1000s"
...
Running 1 benchmarks...
Benchmark bench-e2e: RUNNING...
Test logs available in: /tmp/bench-fd90dc5a088b5aba/1/test.log
Starting benchmark
Seeding network

Initializing Head
Comitting initialUTxO from dataset
HeadIsOpen
Client 1 (node 0): 0/300 (0.00%)
Client 2 (node 1): 0/300 (0.00%)
Client 3 (node 2): 0/300 (0.00%)
Client 1 (node 0): 64/300 (21.33%)
Client 2 (node 1): 68/300 (22.67%)
Client 3 (node 2): 65/300 (21.67%)
Client 1 (node 0): 130/300 (43.33%)
Client 2 (node 1): 142/300 (47.33%)
Client 3 (node 2): 133/300 (44.33%)
Client 1 (node 0): 196/300 (65.33%)
Client 2 (node 1): 210/300 (70.00%)
Client 3 (node 2): 206/300 (68.67%)
Client 1 (node 0): 263/300 (87.67%)
Client 2 (node 1): 278/300 (92.67%)
Client 3 (node 2): 274/300 (91.33%)
All transactions confirmed. Sweet!
All transactions confirmed. Sweet!
All transactions confirmed. Sweet!
Closing the Head
Finalizing the Head
Writing results to: /tmp/bench-fd90dc5a088b5aba/1/results.csv
Benchmark bench-e2e: FINISH

Running the bench on the emit-snapshot branch after rebasing it to use GHC 9.2.7:

#> git switch ensemble/emit-snapshot
...
#> cabal bench bench-e2e --benchmark-options="datasets datasets/3-nodes.json --output-directory $(pwd)/benchmarks --timeout 1000s"
...
Running 1 benchmarks...
Benchmark bench-e2e: RUNNING...
Test logs available in: /tmp/bench-ff2c8c6fafeae1a0/1/test.log
Starting benchmark
Seeding network
Initializing Head
Comitting initialUTxO from dataset
HeadIsOpen
Client 2 (node 1): 0/300 (0.00%)
Client 1 (node 0): 0/300 (0.00%)
Client 3 (node 2): 0/300 (0.00%)
Client 2 (node 1): 55/300 (18.33%)
Client 1 (node 0): 56/300 (18.67%)
Client 3 (node 2): 58/300 (19.33%)
Client 2 (node 1): 106/300 (35.33%)
Client 1 (node 0): 92/300 (30.67%)
Client 3 (node 2): 108/300 (36.00%)
Client 2 (node 1): 149/300 (49.67%)
Client 1 (node 0): 123/300 (41.00%)
Client 3 (node 2): 149/300 (49.67%)
Client 2 (node 1): 181/300 (60.33%)
Client 1 (node 0): 145/300 (48.33%)
Client 3 (node 2): 180/300 (60.00%)
Client 2 (node 1): 214/300 (71.33%)
Client 1 (node 0): 174/300 (58.00%)
Client 3 (node 2): 211/300 (70.33%)
Client 2 (node 1): 245/300 (81.67%)
Client 1 (node 0): 204/300 (68.00%)
Client 3 (node 2): 240/300 (80.00%)
Client 2 (node 1): 274/300 (91.33%)
Client 1 (node 0): 228/300 (76.00%)
Client 3 (node 2): 264/300 (88.00%)
Client 1 (node 0): 254/300 (84.67%)
Client 3 (node 2): 281/300 (93.67%)
All transactions confirmed. Sweet!
All transactions confirmed. Sweet!
Client 1 (node 0): 286/300 (95.33%)
All transactions confirmed. Sweet!
Closing the Head

Finalizing the Head
Writing results to: /tmp/bench-ff2c8c6fafeae1a0/1/results.csv

2023-07-18

Tactical

2023-07-17

SN on protocol logic

  • Investigate how we could write a test to not drop transactions from allTxs upon ttl going to zero.
  • The key seems to be that we want to have conflicting transactions and each node see one of them first, the other one timing out (ttl).. only then, we should request a snapshot of them. In this situation, we would expect that the leader could decide which one is the valid transaction.
  • A first test was to pause a node processing events for some time, but that does not play out well.
  • Another idea would be to "disconnect" the simulated network and reconnect it, only delivering messages later. But that would model a behavior which is not present in the current networking stack (buffering and/or retransmission).
  • Another idea I had was to make two conflicting transactions first stay invalid until expiry (internally only kept in allTxs), then upon unlocking them, the leader to decide which one they want to have snapshotted. This is kind of a "resurrecting" semantic of previously deemed invalid transactions.. which is super weird to have!?

Tactical

FT: Why do we send ReqSn during AckSn handling?

  • If we remove it, all tests are still passing
  • Should also check if this is the case on master
  • There is also snapshot emission when handling ReqTx (the other situation)
  • Intuitively: we need to request snapshots on new transactions and also after a snapshot got signed by everyone
  • Snapshot emission logic may be sensitive to order and timing of L2 messages received
  • Maybe create tests where messages are artificially delayed or shuffled?

SB: Test failure on our new PR https://github.com/input-output-hk/hydra/actions/runs/5555603860/jobs/10180521542 SN: How to progress #974?

  • Updating the LaTeX to match implementation v.s. logic based on TTL
  • Using seenTxs to resolve would change high-level behavior to before (as we were resolving conflicts with ReqSn)

2023-07-13

Tactical

  • Had a team learning session about waste today, what's next?
    • Ideas
      • Something on nix
      • Specification / code equivalence
      • Agda - overview and some intuition
      • Explore more the cardano ecosystem (tools, products, ideas, catalyst proposals?)
      • Something on GHC?
    • Schedule
      • On demand
  • Draft PRs not covered by metrics .. I think they are waste, shall we track them?
    • Update the metric to include them?
    • We decided to put them in progress instead and tackle them now

2023-07-12

Tactical

  • What to do if the refactor to the emit snapshot logic does not go as expected?
    • I.e. what if we cannot take the decision on the not-yet-updated state
    • emitSnapshot was done “outside” of update at one point in the past
    • Emitting snapshot is also changing the state
    • Not strictly against modeling snapshotting as a event/command + logic acting on it

2023-07-11

Tactical

  • Worried about diverging the spec from implementation

    • #974 for instance
    • Parts of code and spec don’t look aligned now (snapshot emitting for instance)
    • The best we have now is reading spec and code line by line to check they align
    • Alternative to ease this alignment?
    • Formal tool (Agda) from which to derive Haskell code
    • Really small and clear Haskell core code directly used in the spec
  • Let's tackle the emitSnapshot concern

    • weird already on master, let’s simplify it to make persistency easier
    • Need to change the spec accordingly?
  • Remove commit client input vs. new endpoint to submit tx?

    • shall we do both?

2023-07-10

Tactical

2023-07-07

Ensemble session on #904

  • As always, it's been a bit painful to keep rebasing the PR on top of master as other stuff touching the Head logic was merged
    • We really should stop having this kind of long running branches and stick to PRs lasting at most a couple of days
    • There's no need to wait for some "feature" to be completed as long as it's not wired in the Node
  • While reviewing the code we uncover a couple issues related to multisignatures that lead us to submit a security report
    • We kept it in draft to experiment with the process of qualifying and handling these type of issues
    • There's a PR in a private dedicated repository that fix the issues. along with unit tests manifesting it
  • Docker builds take ages which increase the lead time for the CI to stratospheric values (eg > 1.5 hour)
  • Issue is ultimately merged 🎉

2023-07-06

Ensemble session on #727

  • With the heartbeat decorrelated from the Message tx, we are now ready to implement the separation between Signed and Authenticated message in the Authenticate network "middleware"
  • We can remove the Party from the Message and put it in the NetworkEvent. It's injected as part of the "wiring" of the various layers, in teh Main file
  • Change is pretty mechanical and ripples over various modules, mainly in HeadLogic and HeadLogicSpec
    • We should refactor the latter to introduce builders in order to reduce the number of occurences relying on the detials of how a NetworkEvent is built
  • Everything is fine but some tests end up being stuck (BehaviorSpec and ModelSpec) => need to fix the messages dispatching in the mock networks we are using

2023-07-05

Ensemble session on #969

  • We realise that to introduce Signed/Authenticated messages separation cleanly, we need to refactor Heartbeat which uses hardcoded Message tx and could be made more polymorphic.
  • We introduce a ConnectionMessage handle that will "listen" to Connected/Disconnected notifications as a first baby step, to disconnect from the HeadLogic.
    • As a first step, we simply use the same logic than for network messages, eg. wrapping the Connected/Disconnected messages into NetworkEvent and putting them in the queue
  • In the HeadLogic we see that those Message are just passed directly as ClientEffect and the HeadLogic has nothing to do with them, so we could do that directly in the Heartbeat handler.
    • We can then separate the Connected messages from the core network Message, which cleans up the HeadLogic of the handling
    • Last step is to wire in the direct transformation of Connectivity -> ServerOutput into the withHeartbeat
  • withHeartbeat now becomes independent of Message tx and can wrap arbitrary messages coming from the underlying transport layer, so its type becomes much cleaner:
    withHeartbeat ::
      NodeId ->
      NetworkComponent m (Heartbeat msg) a ->
      NetworkComponent m msg a
    

2023-07-03

SB on Smoke test failed to recollect funds#960

  • The issue appeared in our smoke tests where in the end, upon successful run we are sending back all the actor funds back to our faucet. This makes sense since we would like to reuse this sweet tADA.

  • All of the sudden we started experiencing things like:

FaucetFailedToBuildTx {reason = TxBodyErrorAdaBalanceTooSmall (TxOutInAnyEra BabbageEra (TxOut (AddressInEra (ShelleyAddressInEra ShelleyBasedEraBabbage) (ShelleyAddress Testnet (KeyHashObj (KeyHash "9783be7d3c54f11377966dfabc9284cd6c32fca1cd42ef0a4f1cc45b")) StakeRefNull)) (TxOutValue MultiAssetInBabbageEra (valueFromList [(AssetId "c0f8644a01a6bf5db02f4afe30d604975e63dd274f1098a1738e561d" "Mona Lisa by Leonardo da Vinci",20)])) TxOutDatumNone ReferenceScriptNone)) (Lovelace 1124910) (Lovelace 0)}
  • This started happening after our work on external commits and the way we generate utxos, so here we see that some "Mona Lisa" token is present in the output but this output has got no ada but we need to have at least 1124910 Lovelace.

  • Since we are selecting all of the lovelace when sending back funds to the faucet the most simple solution can be to just also send all arbitrary tokens back to the faucet.

  • More durable solution would be to burn the tokens we don't want to see in the faucet or to send them back to the actor.

Tactical

  • AB:

    • Meeting with SundaeSwap
      • Catalyst proposals for Hydra additions (overview of needed features)
  • PG:

    • Hydraw alternative
      • Should we run other hydraw instances? Arnaud will check if head is live.
    • 6/4 in review items
      • More reviews needed.

SB on Authenticate network messages

  • We had the whole ensemble today and complete change of plans occured.

  • Plan changed and what we did is implement a middleware in the newtworking part of hydra-node. This middleware would accept all incoming messages and verify them and at the same time sign all outgoing messages.

  • We did proper TDD and had fun implementing this feature in like an hour or two.

Clone this wiki locally