fix: enable historical dry-run evaluation in genesis-wasm CU#727
Open
atticusofsparta wants to merge 1952 commits intopermaweb:mainfrom
Open
fix: enable historical dry-run evaluation in genesis-wasm CU#727atticusofsparta wants to merge 1952 commits intopermaweb:mainfrom
atticusofsparta wants to merge 1952 commits intopermaweb:mainfrom
Conversation
- Rename Commit variable to CommitRaw to hold raw value - Add conversion step to atomize CommitRaw before use - Ensures Commit is an atom type for downstream processing
Path construction in push_downstream_remote was missing the leading
slash, causing incorrect route paths. Now properly formats as
/{TargetID}/push&slot=...
Chore: store decoded JSON instead of raw JSON string
…lash Fix: Add leading slash to push downstream path
Convert commit value to atom using hb_util:atom
When reading from the local cache via read_local_cache, the code was returning cached values directly without handling sub-paths (Rest). This could cause issues when the cached value was a full message but a sub-path was requested, leading to incorrect partial values being returned. The fix ensures that when reading from cache, we properly handle sub-paths using deep_get, consistent with how we handle sub-paths when reading from the gateway. This prevents returning partial values that could cause errors when writing back to the cache. Fixes issue where cached reads with sub-paths would return full messages instead of the requested sub-path values, which could lead to badmatch errors when processing commitments.
- Add extract_path_value/3 helper to eliminate duplicated sub-path extraction logic - Move try-catch from gateway read into maybe_cache to centralize error handling - Simplify read function using the new helper for both code paths
…th-handlings Fix sub-path handling in gateway store local cache reads
…ans104 impr: use default commitment spec opt on uploads of downstream pushed messages
A testing framework for AO-Core devices and HyperBEAM components built
upon the principles of property-based testing. Rather than testing specific
input and output pairs, `hb_invariant' allows us to instead focus on
defining invariant properties that should hold true for all valid inputs.
`hb_invariant' gives us tools to quickly and easily generate random inputs
(states, requests, node messages, etc.) to our components and then test that
the stated properties hold true for each of them.
## Execution Types.
Executions can come in a variety of forms:
- AO-Core device key relationships: Allowing us to define properties
that should hold true for all `Base`, `Request`, node messages, and their
corresponding `Result` messages.
- AO-Core device state machines: Allowing us to generate random initial
states and sequences of requests, ensuring that a set of properties hold
true at all times.
- Comparisons between two AO-Core device state machines: As above, except
allowing us to define two generators for initial states, such that the
functionality of one device can easily be compared to another. Properties
in such tests receive not only the 'pre' and 'post' states for the primary
state machine, but also the corresponding values for the reference machine.
- Direct Erlang function executions: Possible in each of the above cases,
`hb_invariant' allows us to compute Erlang functions rather than AO-Core
(`ao(Base, Req, Opts)') invocations, if preferred. This allows us to utilize
`hb_invariant' to test HyperBEAM itself, as well as devices resident
inside it.
## Execution Flow.
There are two primary invocation methods for `hb_invariant': `forall/1' and
`state_machine/1'. Because the state machine is sufficiently general to cover
all cases, under-the-hood `forall' is simply a wrapper around `state_machine'
that sets the length of the request sequence to `1'. A consequence of this
is that all invocations are able to utilize the full set of parameters to
control the execution.
The state machine executor always takes a `Specification' message as an
argument, and operates in a series of stages:
```
1. Specification normalization: All non-mandatory fields are filled in with
default values, internal state keys are initialized in the `Spec', and
initial seeding of the PRNG (`rand' module) is performed.
2. Repeat for each of the `Spec/runs' of the state machine:
2.1* Generate a node message (`Opts').
2.2* Generate an initial state (`Base' message) for the execution.
2.3* Generate an initial model state (`Model' message) for the execution, if
applicable.
2.4. For each element of request sequence `Spec/length`:
2.4.1* Generate a request message (`Request' message) for the execution.
2.4.2* Execute the request message against the current state (and model state,
if applicable), resulting in a `Result' message.
2.4.3. For each of the `Spec/properties':
2.4.3.1. Attempt to invoke the property function with the prior state(s), request,
result(s), and options.
2.4.3.2. If the property function returns `true', continue to the next property.
2.4.3.3. If the property function returns `false', fail and return details of
the executed sequence and error encountered.
2.4.3.4. If the property function lacks a function clause matching the call
the failure is ignored. This allows callers to easily define which
states are relevant for a given property simply with patterns and
guards in the function head.
2.4.4. Apply `Spec/next' to the state and model state, if applicable, resulting
in a new state and model state. If no `next' function is provided, the
result of the request stage is used in the next iteration of the loop.
3. Return `ok' if all properties were enforced successfully, otherwise return
details of the executed sequence and the error encountered.
'''
`*' markers above indicate that prior to the execution of a stage, the `rand'
module's PRNG is seeded with a value derived from the global seed (either
provided or generated at start time), the run number, the current request
count, and the current stage. This allows for reproducibility of the execution
sequences. See `Controlled Randomness' below for more details.
## Generators.
`hb_invariant' supports a number of different types of `generators', utilized
to derive each input in execution sequences. Supported generator forms are
as follows:
- Lists: Lists of generators of other forms, from which one one member is
randomly selected and executed as if it was provided directly.
- Functions: Arbitrary Erlang functions, invoked with a specific set of
arguments depending on the type of generator and the context.
- Explicit values: A simple constant value or message, used without execution.
Generators of these forms may be provided by the caller for each of the
keys listed below. Their names and function signatures are as follows:
- `opts(Spec)': A generator for the node message to use for a `run' of the
state machine.
- `Spec/states': A generator for initial (`Base') states, executed per `run'.
- `Spec/models': A generator for initial _model_ states, executed per `run'.
- `Spec/requests': Generator of `Request's in the state transformation sequence.
In all cases aside `Spec/requests', the generator is optional, using a
default value if not provided. Without a `requests' generator, no sensible
state transformation sequence can be generated. Subsequently, execution is
aborted with an error.
## Controlled Randomness.
In order to assist in the creation of generators and properties for
`hb_invariant', a number of helper functions are provided to quickly and
easily generate random inputs of a given type. `hb_invariant' seeds Erlang's
`rand' module with a value derived from a provided global seed, or a unique
value per invocation of the state machine executor. In event of errors, the
initial global seed is provided to the user such that issues that arose may
be reproduced.
Value generators for the following types are provided:
- `int/0': Generate a random integer between 0 and the maximum 'small'
(non-bignum) integer value.
- `int/1': Generate a random integer between 0 and the given maximum value.
- `int/2': Generate a random integer between the given values.
- `float/0': Generate a random float between 0 and the maximum float value.
- `float/1': Generate a random float between 0 and the given maximum value.
- `string/0': Generate a random string of a given length.
- `string/1': Generate a random string of a given length.
- `string/4': Generate a random string of a given length, with a give
minimum and maximum character values, and a list of forbidden characters.
feat: Invariant-based testing framework for AO-Core and HyperBEAM.
A small fix that corrects handling of `key_to_atom` in AO-Core's device key matching. This allows us to export device keys using HTTP `Title-Case` form, avoiding the need to use inconsistently formed endpoints (e.g `set_weight`).
fix: Allow exporting device keys with `-`.
fix: explicitly overwrite dedup trie rather than set over keys
fix: Link signed ID to `~scheduler@1.0` assignment pseudo-path.
impr: allow addition of user defined salt to `Nearest` routing strategy
fix: remove `commitment-ids` from recursive ID verification calls
impr: add `hashpath` as additional salt to `Nearest` strategy
This commit enables the use of inheritance of key resolvers recursively between devices. A device may now specify the `info/default` key in the form of a device name or module which the AO-Core engine will use to resolve keys that do not match in the primary device. AO-Core already performed a similar operation while defaulting from a given device to `message@1.0` when a key did not match, while this commit opens up that system to allow for standard OOP-style inheritance structures.
feat: Device Inheritance
feat: parallel execution store implementation
…etheus_metrics impr: Increase cowboy last bucket to more than 4
feat: introduce `on/request` name resolver and simplify `~name@1.0` API
…me deal as Arweave: No central enforcement, everyone chooses their own policy.
feat: Introduce `~blacklist@1.0` content policy device
…kers impr: Add global workers
Forgot to add the collector.
fix: Ranch prometheus metrics
- hb_store_lmdb: wrap elmdb:get/put with timer:tc; accumulate per-slot
read_count, read_us, write_count, write_us in process dictionary via
take_stats/0; push cumulative counters to Prometheus via hb_event
- hb_cache: add timed write_key clauses for <<"dedup">> and <<"balances">>
keys; expose take_cache_stats/0 so dev_process can read serialization time
- dev_process: reset LMDB accumulators before store_result; snapshot
dedup_entries/bytes and balances_entries/bytes from process state;
emit all new fields in compute_short event:
lmdb_reads, lmdb_read_us, lmdb_writes, lmdb_write_us,
dedup_entries, dedup_bytes, dedup_write_us,
balances_entries, balances_bytes, balances_write_us
- Add Loki/Promtail/Grafana monitoring stack with docker-compose.yml;
Grafana dashboard with per-action execution, LMDB, and trie panels
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ication
- Drop container_id from Promtail relabel_configs so container restarts
no longer fan out into separate Loki streams. The label had 4 distinct
values (one per restart), causing every panel that queries
{container="hyperbeam"} to return multiple series.
- Wrap all single-series stat/timeseries panel queries in max() or avg()
so they correctly collapse across any remaining stream dimensions
(action, event_type) into one series per metric.
- Commit the Grafana dashboard JSON (was volume-mounted but untracked).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Split the single combined LMDB stat bucket into two phase-separated
buckets (exec vs store) so execution overhead can be attributed correctly.
Also time the genesis-wasm HTTP roundtrip (wasm_cu_ms) via process dict
to expose the true CU compute time vs serialization/hb_ao overhead.
- dev_delegated_compute: wrap do_relay in timer:tc, store as wasm_cu_us
- dev_process: take_stats after run_as (exec phase), then again after
store_result (store phase); read wasm_cu_us from process dict
- compute_short event now emits: wasm_cu_ms, exec_lmdb_{reads,read_us,
writes,write_us}, store_lmdb_{reads,read_us,writes,write_us}
- Grafana dashboard: update all LMDB panel queries to new field names,
add store variants to panels 11/12, add wasm_cu_ms series to panel 10
- docker-compose.yml: add build section so docker compose build works
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New timeseries panel (id 34) at y=65, x=0, w=12 showing avg wasm_cu_ms broken down by action label — lets you see how much of execution time is actual genesis-wasm compute per action type vs overhead. Shifted Compute Errors (y=65→73) and Live Slot Log (y=71→79) to make room. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ipeline Add timed_normalize_keys/2 wrapper and take_normalize_stats/0 to hb_ao so per-slot normalize_keys call count and wall-clock µs are tracked in the process dictionary during slot computation. dev_process:compute_slot now calls hb_ao:take_normalize_stats() alongside the existing hb_cache and hb_store_lmdb stat collectors, and emits two new fields in the computed_slot log event: - normalize_keys_us total µs spent in normalize_keys across the slot - normalize_keys_count number of normalize_keys invocations (consistently 463) Profiling result: normalize_keys costs ~5 ms/slot (~0.3% of execution_ms), ruling it out as the source of the unaccounted ~790 ms overhead. Documented in docs/misc/performance-analysis.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…faster) Replace the O(trie_size) dedup implementation in dev_dedup.erl with O(1) flat LMDB key-value reads/writes. The old approach committed the entire dedup trie (HMAC re-sign + write all nodes) on every slot; the new approach writes a single key `dedup/<ProcID>/<SubjectID>` per new message. Measured results on a live node: - dedup_phase_ms: 1500ms → 15ms (88× speedup) - exec_lmdb_reads: 1400 → 279 (80% reduction) - exec_lmdb_writes: 700 → 40 (94% reduction) - patch_phase_ms: restored from 1700ms to 80ms (old trie was bloating M1) Key design points: - Namespace: dedup key includes process ID (from M1.process, stable across all slots) to avoid cross-process collisions. - Recursion guard: process-dictionary flag prevents infinite recursion when resolving the process key re-enters the dedup device handler. - Fallback: returns no_store when no LMDB store is available (unit tests, stack messages without a process key) → falls back to original trie path. - Migration: on first flat LMDB write, checks old trie in M1 for already- seen subjects. Strips old trie from M1 to eliminate state bloat. Also add delegated_phase_ms timing instrumentation to dev_genesis_wasm.erl and dev_process.erl to measure the genesis-wasm RPC phase per slot. Fix dev_delegated_compute.erl: decode JSON binary before passing to dev_json_iface:json_to_message (was passing raw binary, now passes decoded Erlang map). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two micro-optimisations that together halve run_as_setup_ms overhead
(253ms → ~80ms for a ~7MB process state):
1. hb_cache_control: when `hashpath => ignore` is set in Opts,
also force `lookup => false` in the cache-control map.
Previously only `store => false` was forced, so stage 2 of
hb_ao:resolve still called read_hashpath(State) →
dev_message:id(10MB_state) on every resolve — even when the
caller explicitly declared the hashpath irrelevant.
All three do_compute phases (dedup, delegated, patch) use
`hashpath => ignore`, so each slot was paying this cost 3×.
2. hb_persistent: short-circuit find_or_register to
{leader, ungrouped_exec} when both await_inprogress and
spawn_worker are false (the common case).
Previously group(Base, Req, Opts) was always called first,
which invokes default_grouper → erlang:phash2 over the full
process state — wasted work since the result was discarded
immediately when await_inprogress=false.
Post-fix baseline (~7MB state):
run_as_setup_ms: 8–150ms (was ~253ms for 10MB state)
exec_lmdb_reads: 14 (stable)
wasm_cu_ms: 80–85% of execution_ms (now the dominant bottleneck)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous test expected hashpath=>ignore to only force store=>false, but the fix in hb_cache_control also forces lookup=>false. Update the test name and assertion to reflect the correct intended behaviour. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a new stat panel showing estimated time to catch up with the current slot lag, displayed as a human-readable duration (e.g. "5d 21h"). - Shrinks top row stat panels from w=4 to w=3, freeing space for ETA - ETA = slot_lag / slots_per_min * 60 (seconds), unit: dtdurations - Uses Grafana expression pipeline: two hidden Loki queries reduced to scalars, divided by a Math expression to avoid LogQL binary-op limits Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes genesis-wasm's handling of dryrun requests pinned to a historical slot via ?to=<slot>. Previously the CU always loaded the newest available checkpoint (LATEST), making historical evaluation impossible. ## Dockerfile patches (applied to genesis-wasm-server at build time) Fix 3 – loadMessageMeta: allow fetching from scheduler when the local message body is absent/invalid, instead of throwing unconditionally. Required for ?to=<slot> dryruns to resolve message metadata when the node hasn't stored every historical message body locally. Fix 4 – findLatestProcessMemory (maybeFile / maybeRecord): propagate the `before` ordinate into findFileCheckpointBefore / findRecordCheckpointBefore instead of hardcoding LATEST. Without this, the checkpoint search always returned the newest file/record checkpoint rather than the one at or before the requested slot. Fix 5 – maybeCheckpointFromArweave / determineLatestCheckpoint: add a `before` parameter (default LATEST for backward compat) and pass it through so Arweave checkpoint queries only consider checkpoints at or before the target evaluation point. Fix 6 – maybeCached: respect the `before` target. Previously the in-memory cached checkpoint was returned unconditionally even when it was newer than the requested slot, blocking fallthrough to the checkpoint search chain. Uses isLaterThan (strict) so a cached state at exactly the target ordinate is still accepted as a valid starting point. Fix 7 – Arweave checkpoint pagination: maybeCheckpointFromArweave previously fetched only 50 checkpoints sorted HEIGHT_DESC. When the target slot is far behind the live tip, the relevant checkpoint lies beyond the first page and genesis-wasm fell back to an expensive cold start. This fix paginates through Arweave checkpoints (up to 200 pages × 50 = 10,000 checkpoints) until it finds one with nonce ≤ before.ordinate, then stops. ## Erlang / HyperBEAM changes dev_genesis_wasm: expose PROCESS_CHECKPOINT_TRUSTED_OWNERS env var from the genesis_wasm_checkpoint_trusted_owners config key, enabling operators to configure which Arweave wallet addresses' checkpoints are trusted for historical dryrun resolution. hb_cache: skip-if-exists for content-addressed LMDB writes. Before writing a binary or map node, check whether its content-hash path is already present in the local store. If it is, skip the write (and for maps, skip the expensive calculate_all_ids HMAC re-signing). This turns O(trie_size) HMAC operations + LMDB writes per slot into O(changed nodes only), which is critical for large tries like the ARIO balance map (~1.3 MB, 6 000+ addresses, unchanged between epoch distributions). dev_process / dev_message / dev_patch / dev_process_lib / dev_trie: add fine-grained per-slot timing instrumentation (patch sub-phases, message set/merge, trie inner operations, run-as setup/exec/restore). These metrics are emitted on the compute_short event and are consumed by Grafana dashboards to identify per-slot bottlenecks. docker-compose: expose port 6363 for the genesis-wasm CU endpoint so the balance comparison tooling can reach it directly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
?to=<slot>dryrun evaluation in genesis-wasm — previously the CU always resolved the newest checkpoint (LATEST), making any request pinned to a historical slot fall back to an expensive cold start or return wrong resultsPROCESS_CHECKPOINT_TRUSTED_OWNERSconfig — operators can now set which Arweave wallet addresses' checkpoints are trusted for historical dryrun resolution viagenesis_wasm_checkpoint_trusted_ownersinhb.confighb_cache) — avoids re-writing map/binary nodes already present in LMDB; turns O(trie_size) HMAC re-sign + write cost per slot into O(changed nodes only), critical for large tries like ARIO's balance map (~1.3 MB, 6,000+ addresses)compute_shortfor Grafana dashboardsDockerfile patches (applied to genesis-wasm-server at build time)
hb/index.jsloadMessageMeta: fetch from scheduler when body absent/invalid instead of throwingao-process.jsmaybeFile/maybeRecord: propagatebeforeordinate instead of hardcodingLATESTao-process.jsmaybeCheckpointFromArweave/determineLatestCheckpoint: passbeforethrough Arweave queryao-process.jsmaybeCached: reject cached checkpoint when it is strictly newer than the requested slotao-process.jsbefore !== LATESTValidation
Tested against the ARIO process (
qNvAoz0TgcH7DMg8BCVn8jF32QH5L6T29VjHxhHqqGE):Before these fixes, the slot-495,059 dryrun fell back to a cold start because the nearest Arweave checkpoint (nonce 494,941) was beyond the first 50 Arweave results. Fix 7 found it after 2 paginated pages and evaluated only 118 messages to reach the target slot.
Test plan
rebar3 as genesis_wasm releaseand verify all 7RUN node -epatches apply cleanly (no string-match failures)GET /<process-id>~process@1.0/compute/balances/<address>?slot=<historical>— confirm a value is returned without cold-startinghb_store:typeshort-circuit inhb_cache:do_write_messagedoes not break any existing cache tests:rebar3 eunit --module=hb_cachegenesis_wasm_checkpoint_trusted_ownerskey is picked up fromhb.configand forwarded to genesis-wasm asPROCESS_CHECKPOINT_TRUSTED_OWNERS🤖 Generated with Claude Code