Skip to content

feat: new GH smoke test#165

Open
cdev412 wants to merge 18 commits intomainfrom
rob/ae-32-id-most-important-parts-of-the-code-base-to-test-and-test
Open

feat: new GH smoke test#165
cdev412 wants to merge 18 commits intomainfrom
rob/ae-32-id-most-important-parts-of-the-code-base-to-test-and-test

Conversation

@cdev412
Copy link
Copy Markdown
Collaborator

@cdev412 cdev412 commented Mar 19, 2026

No description provided.

@cdev412 cdev412 temporarily deployed to test-environment March 22, 2026 08:10 — with GitHub Actions Inactive
@cdev412
Copy link
Copy Markdown
Collaborator Author

cdev412 commented Mar 22, 2026

Fixes:

 # FIXME: this doesn't respect the --max_tasks flag
    # Expected: --max_tasks 2 only downloads 2 capsules
    # Actual: --max_tasks 2 downloads all capsules, then runs only 2

@cdev412 cdev412 temporarily deployed to test-environment March 22, 2026 08:26 — with GitHub Actions Inactive
@cdev412
Copy link
Copy Markdown
Collaborator Author

cdev412 commented Mar 22, 2026

Fix:

"hal/utils/virtual_machine_runner.py
In process_task, the broad except used to log and return {task_id: "ERROR: ..."}, so hal-eval kept going, ran evaluation, and could still log “Evaluation completed successfully”.

It now logs, then re-raises a RuntimeError chained from the original exception. The existing finally still runs first, so delete_virtual_machine_by_name is still called for that vm_name (same as before), then the error propagates:

asyncio.gather fails
AgentRunner.run / asyncio.run in the CLI raises
hal/cli.py hits the outer handler and sys.exit(1) — no success line.
So Azure issues like InvalidResourceReference are treated as fatal for the run, while cleanup is still attempted in finally.

Unchanged: paths that return a value without raising (e.g. timeout string, missing output.json warning dict) still behave as before."

example failing version: https://github.com/princeton-pli/hal-harness/actions/runs/23399407657/attempts/1

@cdev412 cdev412 temporarily deployed to test-environment March 22, 2026 17:59 — with GitHub Actions Inactive
@cdev412
Copy link
Copy Markdown
Collaborator Author

cdev412 commented Mar 22, 2026

fixed issue of previous run still loading: new commit blocks old PR from loading

@cdev412 cdev412 temporarily deployed to test-environment March 22, 2026 20:32 — with GitHub Actions Inactive
@cdev412
Copy link
Copy Markdown
Collaborator Author

cdev412 commented Mar 22, 2026

Open questions:

  • does the entire file need to be copied over at the end of the run? (maybe it could be mocked)? -- A: yes, done, went from 21m 59s to 18m13s
  • should we eliminate some steps in the e2e test or add this as an argument?
  • should we choose a different option/type of run instead of corebench-hard (definitely not easy, to save time)
  • Ask Andrew if the infra deploy is faster
  • do we want to have the corebench/benchmark files be set up on the localhost, then ported to the VM? Or are we maybe OK with having the VM have the files dumped on it directly? Pros: save cloud costs, save time; Cons: less observable, more complexity

Local runs (my internet is slow...)

can you summarize the time spent in each stage in results/corebench_easy/corebench_easy_coreagent_1774211987 ?

Here is a stage-by-stage breakdown from timestamps in corebench_easy_coreagent_1774211987.log (durations are wall-clock between adjacent log lines unless noted).

Stage	Start (approx.)	End (approx.)	Duration
Harness init (logging, CoreBench load, Weave init)	14:39:47.5	14:39:56.5	~9 s
Azure VM provision (SDK create until “created at …”)	14:39:56.5	14:40:28.1	~32 s
First-boot / cloud-init (wait until startup_complete)	14:40:28.1	14:43:26.9	178 s (~2m 58s), as logged
Brief gap before copy	14:43:26.9	14:43:29.7	~3 s
Temp task payload: local tar.gz	14:43:29.7	14:49:21.2	~352 s (~5m 52s)
Temp payload: SFTP upload (~709 MB)	14:49:21.2	15:13:00.6	~1,419 s (~23m 40s)
Temp payload: extract on VM	15:13:00.6	15:13:13.2	~13 s
Agent directory: tar + upload + extract	15:13:13.7	15:13:18.0	~4 s
.env + remote setup_vm.sh	15:13:20.2	15:15:10.5	~110 s (~1m 50s)
Agent run (exec dispatched → “task completed”)	15:15:10.5	15:15:49.5	~39 s
Copy results from VM (tar/get/extract locally)	15:15:49.5	15:31:37.5	~948 s (~15m 48s)
VM delete	15:31:37.5	15:32:21.4	~44 s
Evaluate / process / Weave fetch / finish	15:32:21.4	15:32:26.7	~5 s
End-to-end from first log line to “Evaluation completed successfully”: ~52m 40s (14:39:47 → 15:32:27).

Dominant costs in this run: transferring the ~676 MB task tarball up to the VM (~24 min) and pulling results down from the VM (~16 min). Together those two transfers are most of the wall time. Cloud-init (~3 min), local tarball build (~6 min), and setup_vm.sh (~2 min) are smaller but still material.

Inside the agent (from the trace), logged step times sum to about 23 s (steps 1–4: 10.46 + 1.19 + 8.93 + 2.26 s); the extra ~16 s versus the ~39 s “agent run” bucket is startup, polling, and logging overhead before the first poll sees completion.

Note: The summary at the end reports failed_tasks: 1 even though the harness finished; that reflects evaluation/scoring, not that the VM agent failed to produce an answer.
and in results/corebench_hard/corebench_hard_coreagent_1774211360?

Here is the same style of breakdown for results/corebench_hard/corebench_hard_coreagent_1774211360/corebench_hard_coreagent_1774211360.log, from the harness/VM log lines at the start and end of the run.

Stage	Start (approx.)	End (approx.)	Duration
Harness init (logging, CoreBench, Weave)	14:29:20.8	14:29:27.7	~7 s
Azure VM provision (create until “created at …”)	14:29:27.7	14:29:54.4	~27 s
First-boot / cloud-init	14:29:54.4	14:33:08.3	193 s (~3m 13s), as logged
Brief gap before file copy	14:33:08.3	14:33:09.0	~1 s
Temp task payload: local tar.gz	14:33:09.0	14:34:45.5	~97 s (~1m 37s)
Temp payload: SFTP upload (~178 MB compressed)	14:34:45.5	14:44:54.8	~609 s (~10m 9s)
Temp payload: extract on VM	14:44:54.8	14:44:58.3	~3.5 s
Agent directory: tar + upload + extract	14:44:58.6	14:45:01.7	~3 s
.env + remote setup_vm.sh	14:45:03.7	14:46:40.3	~97 s (~1m 37s)
Agent run (start → “task completed”)	14:46:40.3	14:55:46.9	~547 s (~9m 7s)
Copy results from VM	14:55:46.9	15:02:09.8	~383 s (~6m 23s)
VM delete	15:02:09.8	15:02:53.7	~44 s
Evaluate / process / finish	15:02:53.7	15:03:01.3	~8 s
End-to-end (first log line → “Evaluation completed successfully”): ~33m 41s (14:29:20 → 15:03:01).

Compared to corebench_easy: hard used a smaller upload (~187 MB vs ~709 MB), so upload + download were much shorter (~10m + ~6m vs ~24m + ~16m). Most of the wall time in this run is the agent itself (~9m) plus VM transfer/setup; easy was dominated by huge bidirectional transfers despite a short agent phase.

@cdev412 cdev412 temporarily deployed to test-environment March 22, 2026 21:35 — with GitHub Actions Inactive
@cdev412 cdev412 temporarily deployed to test-environment March 22, 2026 22:04 — with GitHub Actions Inactive
cdev412 added 2 commits March 23, 2026 09:12
…nto rob/ae-32-id-most-important-parts-of-the-code-base-to-test-and-test
@cdev412 cdev412 temporarily deployed to test-environment March 23, 2026 15:13 — with GitHub Actions Inactive
@cdev412
Copy link
Copy Markdown
Collaborator Author

cdev412 commented Mar 23, 2026

Ideas for reducing time to run:

  • Reduce number of Steps allotted to the test
  • Precompile the ZIP file so it's not transferred from provider to local then from local to VM; instead transfer from provider to VM
  • Use smaller benchmark
  • Warm/hot VMs for testing
  • Run only when ready to review, not on every commit

Current e2e components measured:

  • all of the pipeline deploys, including wandb runs
  • output JSON matches schema

TODO:

  • add tags for cost measurement
  • see how HuggingFace uploads might be useful to integrate here

@cdev412 cdev412 requested a review from schwartzadev March 23, 2026 15:16
Copy link
Copy Markdown
Collaborator

@schwartzadev schwartzadev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with comments - please add tests for new functionality and review the comments

nice work

steps:
- uses: actions/checkout@v4

# Warm capsules avoid re-downloading from corebench.cs.princeton.edu every run.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please uncomment or delete

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm

raise TypeError("core_test.json must contain a JSON array of tasks")

# Match hal-eval --max_tasks: only download/load capsules we will run (saves CI time).
if max_tasks is not None and max_tasks > 0 and max_tasks < len(dataset):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should just be if max_tasks is not None

max_tasks > 0 is silently ignored (expected: returns empty array; actual: returns the full array)
max_tasks < len(dataset) is redundant, see:

>>> l = [1,2,3,4,5,6]
>>> l[:0]
[]
>>> l[:10]
[1, 2, 3, 4, 5, 6]
>>> 


def _vm_env_var_from_host(name: str) -> str | None:
"""Resolve a secret from the host process env or local .env (for VM payload files)."""
v = os.environ.get(name, "").strip()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use more verbose variable names

if raw:
s = str(raw).strip()
if s:
return s
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use more verbose variable names

f"/home/agent/{os.path.basename(destination_directory)}_back.tar.gz"
)
remote_home_directory = "/home/agent"
exclude = ""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
exclude = ""
exclude_command_line_argument_string = ""

# Write run-specific env vars for static run_agent.py
run_agent_env = f"RUN_ID={run_id}\nAGENT_FUNCTION={agent_function}\nTASK_ID={task_id}\n"
for name in _VM_RUN_AGENT_SECRET_NAMES:
val = _vm_env_var_from_host(name)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to 'secret_value' or similar

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this file doing? add a readme if this should be included in git

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should be in .github?

@cdev412 cdev412 deployed to test-environment April 1, 2026 15:15 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants