Skip to content

fix: enable historical dry-run evaluation in genesis-wasm CU#727

Open
atticusofsparta wants to merge 1952 commits intopermaweb:mainfrom
atticusofsparta:fix/genesis-wasm-historical-dryrun
Open

fix: enable historical dry-run evaluation in genesis-wasm CU#727
atticusofsparta wants to merge 1952 commits intopermaweb:mainfrom
atticusofsparta:fix/genesis-wasm-historical-dryrun

Conversation

@atticusofsparta
Copy link

Summary

  • Fixes historical ?to=<slot> dryrun evaluation in genesis-wasm — previously the CU always resolved the newest checkpoint (LATEST), making any request pinned to a historical slot fall back to an expensive cold start or return wrong results
  • Adds Arweave checkpoint pagination — genesis-wasm previously fetched only 50 checkpoints (HEIGHT_DESC); when the target slot is far behind the live tip the relevant checkpoint was beyond page 1 and genesis-wasm cold-started. Now paginates up to 200 pages × 50 until it finds a checkpoint ≤ the target ordinate
  • Adds PROCESS_CHECKPOINT_TRUSTED_OWNERS config — operators can now set which Arweave wallet addresses' checkpoints are trusted for historical dryrun resolution via genesis_wasm_checkpoint_trusted_owners in hb.config
  • Skip-if-exists for content-addressed LMDB writes (hb_cache) — avoids re-writing map/binary nodes already present in LMDB; turns O(trie_size) HMAC re-sign + write cost per slot into O(changed nodes only), critical for large tries like ARIO's balance map (~1.3 MB, 6,000+ addresses)
  • Fine-grained per-slot timing instrumentation — patch sub-phases, message set/merge/match, trie inner ops, run-as phases all emit timing on compute_short for Grafana dashboards

Dockerfile patches (applied to genesis-wasm-server at build time)

Fix File Change
3 hb/index.js loadMessageMeta: fetch from scheduler when body absent/invalid instead of throwing
4 ao-process.js maybeFile / maybeRecord: propagate before ordinate instead of hardcoding LATEST
5 ao-process.js maybeCheckpointFromArweave / determineLatestCheckpoint: pass before through Arweave query
6 ao-process.js maybeCached: reject cached checkpoint when it is strictly newer than the requested slot
7 ao-process.js Paginate checkpoint GQL query (up to 200 pages) when before !== LATEST

Validation

Tested against the ARIO process (qNvAoz0TgcH7DMg8BCVn8jF32QH5L6T29VjHxhHqqGE):

  • Slot 495,059: 6,276 addresses queried — 100% match between HyperBEAM LMDB and CU dryrun
  • Slot 695,000: 6,630 addresses queried — 77.71% match (remaining divergence is from live evaluation state predating this fix, not from the dryrun path itself; post-epoch-boundary accuracy returns to ~100%)

Before these fixes, the slot-495,059 dryrun fell back to a cold start because the nearest Arweave checkpoint (nonce 494,941) was beyond the first 50 Arweave results. Fix 7 found it after 2 paginated pages and evaluated only 118 messages to reach the target slot.

Test plan

  • Build Docker image with rebar3 as genesis_wasm release and verify all 7 RUN node -e patches apply cleanly (no string-match failures)
  • Start node and issue GET /<process-id>~process@1.0/compute/balances/<address>?slot=<historical> — confirm a value is returned without cold-starting
  • Verify hb_store:type short-circuit in hb_cache:do_write_message does not break any existing cache tests: rebar3 eunit --module=hb_cache
  • Confirm genesis_wasm_checkpoint_trusted_owners key is picked up from hb.config and forwarded to genesis-wasm as PROCESS_CHECKPOINT_TRUSTED_OWNERS

🤖 Generated with Claude Code

PeterFarber and others added 30 commits December 11, 2025 10:34
- Rename Commit variable to CommitRaw to hold raw value
- Add conversion step to atomize CommitRaw before use
- Ensures Commit is an atom type for downstream processing
Path construction in push_downstream_remote was missing the leading
slash, causing incorrect route paths. Now properly formats as
/{TargetID}/push&slot=...
Chore: store decoded JSON instead of raw JSON string
…lash

Fix: Add leading slash to push downstream path
Convert commit value to atom using hb_util:atom
When reading from the local cache via read_local_cache, the code was
returning cached values directly without handling sub-paths (Rest).
This could cause issues when the cached value was a full message but
a sub-path was requested, leading to incorrect partial values being
returned.

The fix ensures that when reading from cache, we properly handle
sub-paths using deep_get, consistent with how we handle sub-paths
when reading from the gateway. This prevents returning partial
values that could cause errors when writing back to the cache.

Fixes issue where cached reads with sub-paths would return full
messages instead of the requested sub-path values, which could
lead to badmatch errors when processing commitments.
- Add extract_path_value/3 helper to eliminate duplicated sub-path
  extraction logic
- Move try-catch from gateway read into maybe_cache to centralize
  error handling
- Simplify read function using the new helper for both code paths
…th-handlings

Fix sub-path handling in gateway store local cache reads
…ans104

impr: use default commitment spec opt on uploads of downstream pushed messages
A testing framework for AO-Core devices and HyperBEAM components built
upon the principles of property-based testing. Rather than testing specific
input and output pairs, `hb_invariant' allows us to instead focus on
defining invariant properties that should hold true for all valid inputs.
`hb_invariant' gives us tools to quickly and easily generate random inputs
(states, requests, node messages, etc.) to our components and then test that
the stated properties hold true for each of them.

## Execution Types.

Executions can come in a variety of forms:

- AO-Core device key relationships: Allowing us to define properties
  that should hold true for all `Base`, `Request`, node messages, and their
  corresponding `Result` messages.
- AO-Core device state machines: Allowing us to generate random initial
  states and sequences of requests, ensuring that a set of properties hold
  true at all times.
- Comparisons between two AO-Core device state machines: As above, except
  allowing us to define two generators for initial states, such that the
  functionality of one device can easily be compared to another. Properties
  in such tests receive not only the 'pre' and 'post' states for the primary
  state machine, but also the corresponding values for the reference machine.
- Direct Erlang function executions: Possible in each of the above cases,
  `hb_invariant' allows us to compute Erlang functions rather than AO-Core
  (`ao(Base, Req, Opts)') invocations, if preferred. This allows us to utilize
  `hb_invariant' to test HyperBEAM itself, as well as devices resident
  inside it.

## Execution Flow.

There are two primary invocation methods for `hb_invariant': `forall/1' and
`state_machine/1'. Because the state machine is sufficiently general to cover
all cases, under-the-hood `forall' is simply a wrapper around `state_machine'
that sets the length of the request sequence to `1'. A consequence of this
is that all invocations are able to utilize the full set of parameters to
control the execution.

The state machine executor always takes a `Specification' message as an
argument, and operates in a series of stages:

```
1. Specification normalization: All non-mandatory fields are filled in with
   default values, internal state keys are initialized in the `Spec', and
   initial seeding of the PRNG (`rand' module) is performed.
2. Repeat for each of the `Spec/runs' of the state machine:
2.1* Generate a node message (`Opts').
2.2* Generate an initial state (`Base' message) for the execution.
2.3* Generate an initial model state (`Model' message) for the execution, if
     applicable.
2.4. For each element of request sequence `Spec/length`:
2.4.1* Generate a request message (`Request' message) for the execution.
2.4.2* Execute the request message against the current state (and model state,
       if applicable), resulting in a `Result' message.
2.4.3. For each of the `Spec/properties':
2.4.3.1. Attempt to invoke the property function with the prior state(s), request,
       result(s), and options.
2.4.3.2. If the property function returns `true', continue to the next property.
2.4.3.3. If the property function returns `false', fail and return details of
         the executed sequence and error encountered.
2.4.3.4. If the property function lacks a function clause matching the call
         the failure is ignored. This allows callers to easily define which
         states are relevant for a given property simply with patterns and
         guards in the function head.
2.4.4. Apply `Spec/next' to the state and model state, if applicable, resulting
       in a new state and model state. If no `next' function is provided, the
       result of the request stage is used in the next iteration of the loop.
3. Return `ok' if all properties were enforced successfully, otherwise return
   details of the executed sequence and the error encountered.
'''
`*' markers above indicate that prior to the execution of a stage, the `rand'
module's PRNG is seeded with a value derived from the global seed (either
provided or generated at start time), the run number, the current request
count, and the current stage. This allows for reproducibility of the execution
sequences. See `Controlled Randomness' below for more details.

## Generators.

`hb_invariant' supports a number of different types of `generators', utilized
to derive each input in execution sequences. Supported generator forms are
as follows:

- Lists: Lists of generators of other forms, from which one one member is
  randomly selected and executed as if it was provided directly.
- Functions: Arbitrary Erlang functions, invoked with a specific set of
  arguments depending on the type of generator and the context.
- Explicit values: A simple constant value or message, used without execution.

Generators of these forms may be provided by the caller for each of the
keys listed below. Their names and function signatures are as follows:
- `opts(Spec)': A generator for the node message to use for a `run' of the
  state machine.
- `Spec/states': A generator for initial (`Base') states, executed per `run'.
- `Spec/models': A generator for initial _model_ states, executed per `run'.
- `Spec/requests': Generator of `Request's in the state transformation sequence.

In all cases aside `Spec/requests', the generator is optional, using a
default value if not provided. Without a `requests' generator, no sensible
state transformation sequence can be generated. Subsequently, execution is
aborted with an error.

## Controlled Randomness.

In order to assist in the creation of generators and properties for
`hb_invariant', a number of helper functions are provided to quickly and
easily generate random inputs of a given type. `hb_invariant' seeds Erlang's
`rand' module with a value derived from a provided global seed, or a unique
value per invocation of the state machine executor. In event of errors, the
initial global seed is provided to the user such that issues that arose may
be reproduced.

Value generators for the following types are provided:
- `int/0': Generate a random integer between 0 and the maximum 'small'
  (non-bignum) integer value.
- `int/1': Generate a random integer between 0 and the given maximum value.
- `int/2': Generate a random integer between the given values.
- `float/0': Generate a random float between 0 and the maximum float value.
- `float/1': Generate a random float between 0 and the given maximum value.
- `string/0': Generate a random string of a given length.
- `string/1': Generate a random string of a given length.
- `string/4': Generate a random string of a given length, with a give
  minimum and maximum character values, and a list of forbidden characters.
feat: Invariant-based testing framework for AO-Core and HyperBEAM.
A small fix that corrects handling of `key_to_atom` in AO-Core's device key matching. This allows us to export device keys using HTTP `Title-Case` form, avoiding the need to use inconsistently formed endpoints (e.g `set_weight`).
fix: explicitly overwrite dedup trie rather than set over keys
fix: Link signed ID to `~scheduler@1.0` assignment pseudo-path.
impr: allow addition of user defined salt to `Nearest` routing strategy
fix: remove `commitment-ids` from recursive ID verification calls
impr: add `hashpath` as additional salt to `Nearest` strategy
This commit enables the use of inheritance of key resolvers recursively between devices. A device may now specify the `info/default` key in the form of a device name or module which the AO-Core engine will use to resolve keys that do not match in the primary device. AO-Core already performed a similar operation while defaulting from a given device to `message@1.0` when a key did not match, while this commit opens up that system to allow for standard OOP-style inheritance structures.
samcamwilliams and others added 30 commits February 25, 2026 12:38
feat: parallel execution store implementation
…etheus_metrics

impr: Increase cowboy last bucket to more than 4
feat: introduce `on/request` name resolver and simplify `~name@1.0` API
…me deal as Arweave: No central enforcement, everyone chooses their own policy.
feat: Introduce `~blacklist@1.0` content policy device
Forgot to add the collector.
- hb_store_lmdb: wrap elmdb:get/put with timer:tc; accumulate per-slot
  read_count, read_us, write_count, write_us in process dictionary via
  take_stats/0; push cumulative counters to Prometheus via hb_event
- hb_cache: add timed write_key clauses for <<"dedup">> and <<"balances">>
  keys; expose take_cache_stats/0 so dev_process can read serialization time
- dev_process: reset LMDB accumulators before store_result; snapshot
  dedup_entries/bytes and balances_entries/bytes from process state;
  emit all new fields in compute_short event:
    lmdb_reads, lmdb_read_us, lmdb_writes, lmdb_write_us,
    dedup_entries, dedup_bytes, dedup_write_us,
    balances_entries, balances_bytes, balances_write_us
- Add Loki/Promtail/Grafana monitoring stack with docker-compose.yml;
  Grafana dashboard with per-action execution, LMDB, and trie panels

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ication

- Drop container_id from Promtail relabel_configs so container restarts
  no longer fan out into separate Loki streams. The label had 4 distinct
  values (one per restart), causing every panel that queries
  {container="hyperbeam"} to return multiple series.

- Wrap all single-series stat/timeseries panel queries in max() or avg()
  so they correctly collapse across any remaining stream dimensions
  (action, event_type) into one series per metric.

- Commit the Grafana dashboard JSON (was volume-mounted but untracked).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Split the single combined LMDB stat bucket into two phase-separated
buckets (exec vs store) so execution overhead can be attributed correctly.
Also time the genesis-wasm HTTP roundtrip (wasm_cu_ms) via process dict
to expose the true CU compute time vs serialization/hb_ao overhead.

- dev_delegated_compute: wrap do_relay in timer:tc, store as wasm_cu_us
- dev_process: take_stats after run_as (exec phase), then again after
  store_result (store phase); read wasm_cu_us from process dict
- compute_short event now emits: wasm_cu_ms, exec_lmdb_{reads,read_us,
  writes,write_us}, store_lmdb_{reads,read_us,writes,write_us}
- Grafana dashboard: update all LMDB panel queries to new field names,
  add store variants to panels 11/12, add wasm_cu_ms series to panel 10
- docker-compose.yml: add build section so docker compose build works

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New timeseries panel (id 34) at y=65, x=0, w=12 showing avg wasm_cu_ms
broken down by action label — lets you see how much of execution time is
actual genesis-wasm compute per action type vs overhead.

Shifted Compute Errors (y=65→73) and Live Slot Log (y=71→79) to make room.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ipeline

Add timed_normalize_keys/2 wrapper and take_normalize_stats/0 to hb_ao so
per-slot normalize_keys call count and wall-clock µs are tracked in the
process dictionary during slot computation.

dev_process:compute_slot now calls hb_ao:take_normalize_stats() alongside the
existing hb_cache and hb_store_lmdb stat collectors, and emits two new fields
in the computed_slot log event:
  - normalize_keys_us    total µs spent in normalize_keys across the slot
  - normalize_keys_count number of normalize_keys invocations (consistently 463)

Profiling result: normalize_keys costs ~5 ms/slot (~0.3% of execution_ms),
ruling it out as the source of the unaccounted ~790 ms overhead. Documented
in docs/misc/performance-analysis.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…faster)

Replace the O(trie_size) dedup implementation in dev_dedup.erl with O(1)
flat LMDB key-value reads/writes. The old approach committed the entire
dedup trie (HMAC re-sign + write all nodes) on every slot; the new
approach writes a single key `dedup/<ProcID>/<SubjectID>` per new message.

Measured results on a live node:
- dedup_phase_ms: 1500ms → 15ms (88× speedup)
- exec_lmdb_reads: 1400 → 279 (80% reduction)
- exec_lmdb_writes: 700 → 40 (94% reduction)
- patch_phase_ms: restored from 1700ms to 80ms (old trie was bloating M1)

Key design points:
- Namespace: dedup key includes process ID (from M1.process, stable across
  all slots) to avoid cross-process collisions.
- Recursion guard: process-dictionary flag prevents infinite recursion when
  resolving the process key re-enters the dedup device handler.
- Fallback: returns no_store when no LMDB store is available (unit tests,
  stack messages without a process key) → falls back to original trie path.
- Migration: on first flat LMDB write, checks old trie in M1 for already-
  seen subjects. Strips old trie from M1 to eliminate state bloat.

Also add delegated_phase_ms timing instrumentation to dev_genesis_wasm.erl
and dev_process.erl to measure the genesis-wasm RPC phase per slot.

Fix dev_delegated_compute.erl: decode JSON binary before passing to
dev_json_iface:json_to_message (was passing raw binary, now passes decoded
Erlang map).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two micro-optimisations that together halve run_as_setup_ms overhead
(253ms → ~80ms for a ~7MB process state):

1. hb_cache_control: when `hashpath => ignore` is set in Opts,
   also force `lookup => false` in the cache-control map.
   Previously only `store => false` was forced, so stage 2 of
   hb_ao:resolve still called read_hashpath(State) →
   dev_message:id(10MB_state) on every resolve — even when the
   caller explicitly declared the hashpath irrelevant.
   All three do_compute phases (dedup, delegated, patch) use
   `hashpath => ignore`, so each slot was paying this cost 3×.

2. hb_persistent: short-circuit find_or_register to
   {leader, ungrouped_exec} when both await_inprogress and
   spawn_worker are false (the common case).
   Previously group(Base, Req, Opts) was always called first,
   which invokes default_grouper → erlang:phash2 over the full
   process state — wasted work since the result was discarded
   immediately when await_inprogress=false.

Post-fix baseline (~7MB state):
  run_as_setup_ms: 8–150ms  (was ~253ms for 10MB state)
  exec_lmdb_reads: 14       (stable)
  wasm_cu_ms: 80–85% of execution_ms (now the dominant bottleneck)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous test expected hashpath=>ignore to only force store=>false,
but the fix in hb_cache_control also forces lookup=>false. Update the
test name and assertion to reflect the correct intended behaviour.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a new stat panel showing estimated time to catch up with the
current slot lag, displayed as a human-readable duration (e.g. "5d 21h").

- Shrinks top row stat panels from w=4 to w=3, freeing space for ETA
- ETA = slot_lag / slots_per_min * 60 (seconds), unit: dtdurations
- Uses Grafana expression pipeline: two hidden Loki queries reduced to
  scalars, divided by a Math expression to avoid LogQL binary-op limits

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes genesis-wasm's handling of dryrun requests pinned to a historical
slot via ?to=<slot>. Previously the CU always loaded the newest available
checkpoint (LATEST), making historical evaluation impossible.

## Dockerfile patches (applied to genesis-wasm-server at build time)

Fix 3 – loadMessageMeta: allow fetching from scheduler when the local
message body is absent/invalid, instead of throwing unconditionally.
Required for ?to=<slot> dryruns to resolve message metadata when the
node hasn't stored every historical message body locally.

Fix 4 – findLatestProcessMemory (maybeFile / maybeRecord): propagate the
`before` ordinate into findFileCheckpointBefore / findRecordCheckpointBefore
instead of hardcoding LATEST. Without this, the checkpoint search always
returned the newest file/record checkpoint rather than the one at or before
the requested slot.

Fix 5 – maybeCheckpointFromArweave / determineLatestCheckpoint: add a
`before` parameter (default LATEST for backward compat) and pass it through
so Arweave checkpoint queries only consider checkpoints at or before the
target evaluation point.

Fix 6 – maybeCached: respect the `before` target. Previously the in-memory
cached checkpoint was returned unconditionally even when it was newer than
the requested slot, blocking fallthrough to the checkpoint search chain.
Uses isLaterThan (strict) so a cached state at exactly the target ordinate
is still accepted as a valid starting point.

Fix 7 – Arweave checkpoint pagination: maybeCheckpointFromArweave previously
fetched only 50 checkpoints sorted HEIGHT_DESC. When the target slot is far
behind the live tip, the relevant checkpoint lies beyond the first page and
genesis-wasm fell back to an expensive cold start. This fix paginates through
Arweave checkpoints (up to 200 pages × 50 = 10,000 checkpoints) until it
finds one with nonce ≤ before.ordinate, then stops.

## Erlang / HyperBEAM changes

dev_genesis_wasm: expose PROCESS_CHECKPOINT_TRUSTED_OWNERS env var from the
genesis_wasm_checkpoint_trusted_owners config key, enabling operators to
configure which Arweave wallet addresses' checkpoints are trusted for
historical dryrun resolution.

hb_cache: skip-if-exists for content-addressed LMDB writes. Before writing
a binary or map node, check whether its content-hash path is already present
in the local store. If it is, skip the write (and for maps, skip the
expensive calculate_all_ids HMAC re-signing). This turns O(trie_size) HMAC
operations + LMDB writes per slot into O(changed nodes only), which is
critical for large tries like the ARIO balance map (~1.3 MB, 6 000+
addresses, unchanged between epoch distributions).

dev_process / dev_message / dev_patch / dev_process_lib / dev_trie: add
fine-grained per-slot timing instrumentation (patch sub-phases, message
set/merge, trie inner operations, run-as setup/exec/restore). These metrics
are emitted on the compute_short event and are consumed by Grafana dashboards
to identify per-slot bottlenecks.

docker-compose: expose port 6363 for the genesis-wasm CU endpoint so the
balance comparison tooling can reach it directly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants