feat: new GH smoke test by cdev412 · Pull Request #165 · princeton-pli/hal-harness

cdev412 · 2026-03-19T22:39:35Z

No description provided.

cdev412 · 2026-03-22T08:24:35Z

Fixes:

 # FIXME: this doesn't respect the --max_tasks flag
    # Expected: --max_tasks 2 only downloads 2 capsules
    # Actual: --max_tasks 2 downloads all capsules, then runs only 2

cdev412 · 2026-03-22T08:42:59Z

Fix:

"hal/utils/virtual_machine_runner.py
In process_task, the broad except used to log and return {task_id: "ERROR: ..."}, so hal-eval kept going, ran evaluation, and could still log “Evaluation completed successfully”.

It now logs, then re-raises a RuntimeError chained from the original exception. The existing finally still runs first, so delete_virtual_machine_by_name is still called for that vm_name (same as before), then the error propagates:

asyncio.gather fails
AgentRunner.run / asyncio.run in the CLI raises
hal/cli.py hits the outer handler and sys.exit(1) — no success line.
So Azure issues like InvalidResourceReference are treated as fatal for the run, while cleanup is still attempted in finally.

Unchanged: paths that return a value without raising (e.g. timeout string, missing output.json warning dict) still behave as before."

example failing version: https://github.com/princeton-pli/hal-harness/actions/runs/23399407657/attempts/1

cdev412 · 2026-03-22T18:33:37Z

fixed issue of previous run still loading: new commit blocks old PR from loading

cdev412 · 2026-03-22T21:29:23Z

Open questions:

does the entire file need to be copied over at the end of the run? (maybe it could be mocked)? -- A: yes, done, went from 21m 59s to 18m13s
should we eliminate some steps in the e2e test or add this as an argument?
should we choose a different option/type of run instead of corebench-hard (definitely not easy, to save time)
Ask Andrew if the infra deploy is faster
do we want to have the corebench/benchmark files be set up on the localhost, then ported to the VM? Or are we maybe OK with having the VM have the files dumped on it directly? Pros: save cloud costs, save time; Cons: less observable, more complexity

Local runs (my internet is slow...)

can you summarize the time spent in each stage in results/corebench_easy/corebench_easy_coreagent_1774211987 ?

Here is a stage-by-stage breakdown from timestamps in corebench_easy_coreagent_1774211987.log (durations are wall-clock between adjacent log lines unless noted).

Stage	Start (approx.)	End (approx.)	Duration
Harness init (logging, CoreBench load, Weave init)	14:39:47.5	14:39:56.5	~9 s
Azure VM provision (SDK create until “created at …”)	14:39:56.5	14:40:28.1	~32 s
First-boot / cloud-init (wait until startup_complete)	14:40:28.1	14:43:26.9	178 s (~2m 58s), as logged
Brief gap before copy	14:43:26.9	14:43:29.7	~3 s
Temp task payload: local tar.gz	14:43:29.7	14:49:21.2	~352 s (~5m 52s)
Temp payload: SFTP upload (~709 MB)	14:49:21.2	15:13:00.6	~1,419 s (~23m 40s)
Temp payload: extract on VM	15:13:00.6	15:13:13.2	~13 s
Agent directory: tar + upload + extract	15:13:13.7	15:13:18.0	~4 s
.env + remote setup_vm.sh	15:13:20.2	15:15:10.5	~110 s (~1m 50s)
Agent run (exec dispatched → “task completed”)	15:15:10.5	15:15:49.5	~39 s
Copy results from VM (tar/get/extract locally)	15:15:49.5	15:31:37.5	~948 s (~15m 48s)
VM delete	15:31:37.5	15:32:21.4	~44 s
Evaluate / process / Weave fetch / finish	15:32:21.4	15:32:26.7	~5 s
End-to-end from first log line to “Evaluation completed successfully”: ~52m 40s (14:39:47 → 15:32:27).

Dominant costs in this run: transferring the ~676 MB task tarball up to the VM (~24 min) and pulling results down from the VM (~16 min). Together those two transfers are most of the wall time. Cloud-init (~3 min), local tarball build (~6 min), and setup_vm.sh (~2 min) are smaller but still material.

Inside the agent (from the trace), logged step times sum to about 23 s (steps 1–4: 10.46 + 1.19 + 8.93 + 2.26 s); the extra ~16 s versus the ~39 s “agent run” bucket is startup, polling, and logging overhead before the first poll sees completion.

Note: The summary at the end reports failed_tasks: 1 even though the harness finished; that reflects evaluation/scoring, not that the VM agent failed to produce an answer.

and in results/corebench_hard/corebench_hard_coreagent_1774211360?

Here is the same style of breakdown for results/corebench_hard/corebench_hard_coreagent_1774211360/corebench_hard_coreagent_1774211360.log, from the harness/VM log lines at the start and end of the run.

Stage	Start (approx.)	End (approx.)	Duration
Harness init (logging, CoreBench, Weave)	14:29:20.8	14:29:27.7	~7 s
Azure VM provision (create until “created at …”)	14:29:27.7	14:29:54.4	~27 s
First-boot / cloud-init	14:29:54.4	14:33:08.3	193 s (~3m 13s), as logged
Brief gap before file copy	14:33:08.3	14:33:09.0	~1 s
Temp task payload: local tar.gz	14:33:09.0	14:34:45.5	~97 s (~1m 37s)
Temp payload: SFTP upload (~178 MB compressed)	14:34:45.5	14:44:54.8	~609 s (~10m 9s)
Temp payload: extract on VM	14:44:54.8	14:44:58.3	~3.5 s
Agent directory: tar + upload + extract	14:44:58.6	14:45:01.7	~3 s
.env + remote setup_vm.sh	14:45:03.7	14:46:40.3	~97 s (~1m 37s)
Agent run (start → “task completed”)	14:46:40.3	14:55:46.9	~547 s (~9m 7s)
Copy results from VM	14:55:46.9	15:02:09.8	~383 s (~6m 23s)
VM delete	15:02:09.8	15:02:53.7	~44 s
Evaluate / process / finish	15:02:53.7	15:03:01.3	~8 s
End-to-end (first log line → “Evaluation completed successfully”): ~33m 41s (14:29:20 → 15:03:01).

Compared to corebench_easy: hard used a smaller upload (~187 MB vs ~709 MB), so upload + download were much shorter (~10m + ~6m vs ~24m + ~16m). Most of the wall time in this run is the agent itself (~9m) plus VM transfer/setup; easy was dominated by huge bidirectional transfers despite a short agent phase.

…nto rob/ae-32-id-most-important-parts-of-the-code-base-to-test-and-test

cdev412 · 2026-03-23T15:15:55Z

Ideas for reducing time to run:

Reduce number of Steps allotted to the test
Precompile the ZIP file so it's not transferred from provider to local then from local to VM; instead transfer from provider to VM
Use smaller benchmark
Warm/hot VMs for testing
Run only when ready to review, not on every commit

Current e2e components measured:

all of the pipeline deploys, including wandb runs
output JSON matches schema

TODO:

add tags for cost measurement
see how HuggingFace uploads might be useful to integrate here

schwartzadev

lgtm with comments - please add tests for new functionality and review the comments

nice work

schwartzadev · 2026-03-23T17:38:31Z

.github/workflows/e2e.yml

+    steps:
+      - uses: actions/checkout@v4
+
+      # Warm capsules avoid re-downloading from corebench.cs.princeton.edu every run.


please uncomment or delete

schwartzadev · 2026-03-23T17:40:33Z

docs/EXECUTION_FLOW.md

schwartzadev · 2026-03-23T17:43:15Z

hal/benchmarks/corebench.py

+            raise TypeError("core_test.json must contain a JSON array of tasks")
+
+        # Match hal-eval --max_tasks: only download/load capsules we will run (saves CI time).
+        if max_tasks is not None and max_tasks > 0 and max_tasks < len(dataset):


nit: this should just be if max_tasks is not None

max_tasks > 0 is silently ignored (expected: returns empty array; actual: returns the full array)
max_tasks < len(dataset) is redundant, see:

>>> l = [1,2,3,4,5,6] >>> l[:0] [] >>> l[:10] [1, 2, 3, 4, 5, 6] >>>

schwartzadev · 2026-03-23T17:44:12Z

hal/utils/virtual_machine_manager.py

+
+def _vm_env_var_from_host(name: str) -> str | None:
+    """Resolve a secret from the host process env or local .env (for VM payload files)."""
+    v = os.environ.get(name, "").strip()


please use more verbose variable names

schwartzadev · 2026-03-23T17:44:18Z

hal/utils/virtual_machine_manager.py

+        if raw:
+            s = str(raw).strip()
+            if s:
+                return s


please use more verbose variable names

schwartzadev · 2026-03-23T17:44:52Z

hal/utils/virtual_machine_manager.py

                f"/home/agent/{os.path.basename(destination_directory)}_back.tar.gz"
            )
            remote_home_directory = "/home/agent"
+            exclude = ""


Suggested change

exclude = ""

exclude_command_line_argument_string = ""

schwartzadev · 2026-03-23T17:45:57Z

hal/utils/virtual_machine_manager.py

                # Write run-specific env vars for static run_agent.py
                run_agent_env = f"RUN_ID={run_id}\nAGENT_FUNCTION={agent_function}\nTASK_ID={task_id}\n"
+                for name in _VM_RUN_AGENT_SECRET_NAMES:
+                    val = _vm_env_var_from_host(name)


nit: rename to 'secret_value' or similar

schwartzadev · 2026-03-23T17:47:29Z

tests/gh_actions/corebench_hard_upload.schema.json

what's this file doing? add a readme if this should be included in git

schwartzadev · 2026-03-23T17:48:01Z

tests/gh_actions/verify_corebench_hard_e2e.sh

i think this should be in .github?

…e-base-to-test-and-test

feat: new GH smoke test

21c8586

cdev412 had a problem deploying to hal-corebench-vm March 19, 2026 22:39 — with GitHub Actions Failure

fix: new run

bc29944

cdev412 had a problem deploying to hal-corebench-vm March 20, 2026 20:10 — with GitHub Actions Failure

feat: added new auth

0b06a94

cdev412 had a problem deploying to hal-corebench-vm March 20, 2026 21:01 — with GitHub Actions Failure

feat: new auth

d8c46cf

cdev412 had a problem deploying to hal-corebench-vm March 20, 2026 22:14 — with GitHub Actions Failure

feat: new auth

943bf6a

cdev412 had a problem deploying to hal-corebench-vm March 20, 2026 22:15 — with GitHub Actions Failure

feat: new auth

c36db3b

cdev412 had a problem deploying to test-environment March 20, 2026 22:17 — with GitHub Actions Failure

redo pipeline

9427c79

cdev412 had a problem deploying to test-environment March 22, 2026 07:58 — with GitHub Actions Failure

fix: decrypt

31926d5

cdev412 temporarily deployed to test-environment March 22, 2026 08:10 — with GitHub Actions Inactive

feat: number of max tasks respected

2eb4520

cdev412 temporarily deployed to test-environment March 22, 2026 08:26 — with GitHub Actions Inactive

fix: failed silently

a595333

cdev412 had a problem deploying to test-environment March 22, 2026 08:42 — with GitHub Actions Failure

cdev412 had a problem deploying to test-environment March 22, 2026 17:36 — with GitHub Actions Failure

cdev412 had a problem deploying to test-environment March 22, 2026 17:40 — with GitHub Actions Failure

cdev412 had a problem deploying to test-environment March 22, 2026 17:57 — with GitHub Actions Error

enabled claude code access to env var

e71506e

cdev412 had a problem deploying to test-environment March 22, 2026 17:59 — with GitHub Actions Failure

cdev412 temporarily deployed to test-environment March 22, 2026 17:59 — with GitHub Actions Inactive

fix venv pull API keys

785cef6

cdev412 had a problem deploying to test-environment March 22, 2026 19:09 — with GitHub Actions Failure

cdev412 had a problem deploying to test-environment March 22, 2026 19:09 — with GitHub Actions Error

fix missing API key, fix max tasks

fce3cc7

cdev412 had a problem deploying to test-environment March 22, 2026 20:32 — with GitHub Actions Failure

cdev412 temporarily deployed to test-environment March 22, 2026 20:32 — with GitHub Actions Inactive

tests for end of run

ba3cc81

cdev412 had a problem deploying to test-environment March 22, 2026 21:35 — with GitHub Actions Failure

cdev412 temporarily deployed to test-environment March 22, 2026 21:35 — with GitHub Actions Inactive

feat: stop downloading the environment if not needed

121bb56

cdev412 had a problem deploying to test-environment March 22, 2026 22:04 — with GitHub Actions Failure

cdev412 temporarily deployed to test-environment March 22, 2026 22:04 — with GitHub Actions Inactive

cdev412 added 2 commits March 23, 2026 09:12

fix: ruff format

402e15b

Merge branch 'main' of https://github.com/princeton-pli/hal-harness i…

41dde32

…nto rob/ae-32-id-most-important-parts-of-the-code-base-to-test-and-test

cdev412 temporarily deployed to test-environment March 23, 2026 15:13 — with GitHub Actions Inactive

cdev412 requested a review from schwartzadev March 23, 2026 15:16

schwartzadev approved these changes Mar 23, 2026

View reviewed changes

Merge branch 'main' into rob/ae-32-id-most-important-parts-of-the-cod…

d39bada

…e-base-to-test-and-test

cdev412 deployed to test-environment April 1, 2026 15:15 — with GitHub Actions Active

Conversation

cdev412 commented Mar 19, 2026

Uh oh!

cdev412 commented Mar 22, 2026

Uh oh!

cdev412 commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdev412 commented Mar 22, 2026

Uh oh!

cdev412 commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdev412 commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schwartzadev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cdev412 commented Mar 22, 2026 •

edited

Loading

cdev412 commented Mar 22, 2026 •

edited

Loading

cdev412 commented Mar 23, 2026 •

edited

Loading