feat: add autoharness worker by rohitg00 · Pull Request #7 · iii-hq/workers

rohitg00 · 2026-04-06T11:20:27Z

Summary

Adds autoharness/ — a self-improving agent harness worker on iii-engine
Meta-agent modifies an agent harness (system prompt, tools, orchestration), runs Harbor benchmarks, orchestrator auto-keeps or discards based on score
Adaptive search transitions between explore/exploit/combine/ablation based on experiment history
Inspired by kevinrgu/autoagent, built from scratch on iii-engine primitives (same relationship as karpathy/autoresearch → n-autoresearch)

What's included

Component	Description
`orchestrator/orchestrator.py`	Python worker: 26 functions, 26 HTTP triggers
`orchestrator/test_orchestrator.py`	50 integration tests
`agent.py`	Agent harness template (OpenAI Agents SDK)
`program.md`	Meta-agent instructions for the experiment loop
`bench.sh`	One-command benchmark runner
`plot_progress.py`	Multi-dataset progress chart generator
`tasks/`	3 sample Harbor tasks (hello-world, fizzbuzz, file-organizer)
`iii-config.yaml`	Engine config (State, REST API, PubSub, Cron, OTel)
`Dockerfile.base`	Base container image for task execution

Function groups (26 total)

experiment::* (7) — lifecycle, keep/discard, crash tracking, near-miss detection
task::* (5) — Harbor benchmark execution with concurrency
search::* (4) — adaptive strategy (explore/exploit/combine/ablation)
harness::* (5) — read, diff, snapshot, restore
report::* (5) — summary, leaderboard, task-level diff, TSV export

Tested

50/50 integration tests passing against live iii-engine
Real Harbor benchmark run: 3/3 tasks, Mean: 1.000 with claude-code agent
Progress chart generation verified

Test plan

iii --config iii-config.yaml starts engine
python3 orchestrator/orchestrator.py registers 26 functions
python3 orchestrator/test_orchestrator.py passes 50/50
./bench.sh <tag> runs full benchmark cycle
harbor run -p tasks/ -a claude-code --env-file .env passes 3/3
python3 plot_progress.py --tag <tag> generates chart

Summary by CodeRabbit

New Features
- Self-improving agent harness with orchestrator APIs, experiment lifecycle, adaptive search, task batching, benchmark runner, and progress plotting
New Content
- Sample tasks added: hello-world, fizzbuzz, file-organizer
Documentation
- Full README, program/workflow guide, and quick-start instructions
Tests
- Integration tests for orchestrator and verifier tests for tasks
Chores
- Base container config, ignore rules, packaging metadata, and Apache 2.0 license

Autonomous agent engineering worker on iii-engine. A meta-agent modifies an agent harness (system prompt, tools, orchestration), runs Harbor benchmarks, and the orchestrator auto-keeps or discards based on score. Adaptive search transitions between explore/exploit/combine/ablation. - 26 iii functions across 5 groups (experiment, task, search, harness, report) - 26 HTTP triggers for REST API at localhost:3111 - Adaptive search strategy with near-miss detection - Crash recovery with consecutive tracking and auto-abort - Per-task failure analysis and experiment diffing - Harness snapshots for instant save/restore - Multi-dataset progress chart generator (plot_progress.py) - 3 sample Harbor benchmark tasks (hello-world, fizzbuzz, file-organizer) - 50 integration tests, all passing Inspired by kevinrgu/autoagent, built from scratch on iii-engine primitives (same relationship as karpathy/autoresearch → n-autoresearch).

coderabbitai · 2026-04-06T11:20:40Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds an autoharness: a meta-agent harness, an orchestrator worker exposing REST/SDK endpoints, task fixtures and verifiers, CLI tooling (bench, plotting), runtime/config (iii-config, Docker base), docs/tests/packaging, and a small Rust test assertion tweak.

Changes

Cohort / File(s)	Summary
Repository metadata & container config `autoharness/.dockerignore`, `autoharness/.gitignore`, `autoharness/Dockerfile.base`, `autoharness/pyproject.toml`, `LICENSE`	Adds ignore rules, a base Docker image (non-root task user, apt/node tooling), Python packaging metadata and dependencies, and Apache-2.0 license.
iii-engine runtime config `autoharness/iii-config.yaml`	Adds iii-engine multi-module configuration (file-backed KV, REST API on `127.0.0.1:3111` with CORS, local PubSub, Cron, OTEL in-memory exporter).
Core agent harness `autoharness/agent.py`	New meta-agent harness with editable config (`SYSTEM_PROMPT`, `MODEL`, `MAX_TURNS`), shell tool `run_shell`, runner adapter to ATIF-like trajectory, task-run entrypoint, and CLI/`__main__` entry.
Orchestrator & worker `autoharness/orchestrator/orchestrator.py`	New orchestrator registering ~26 HTTP/worker endpoints for experiment lifecycle, task run/batch/scores, search strategy/adapt/suggest, harness read/diff/snapshot/restore, reporting, KV helpers, crash tracking, and startup/shutdown flow.
Orchestrator tests `autoharness/orchestrator/test_orchestrator.py`	Integration test suite exercising orchestrator APIs: lifecycle flows, crash handling/abort, search adaptation, reporting, harness snapshot/restore, path traversal and validation checks.
Benchmarking & plotting tools `autoharness/bench.sh`, `autoharness/plot_progress.py`	Adds a bench runner to register/run experiments against local API and a matplotlib CLI to plot tagged experiment progress and running-best traces.
Program & README docs `autoharness/README.md`, `autoharness/program.md`	Adds operational docs describing the meta-agent loop, API surface, experiment lifecycle, search strategies, harness editing rules, Quick Start, and monitoring endpoints.
Tasks: fixtures, configs, verifiers `autoharness/tasks/hello-world/...`, `autoharness/tasks/fizzbuzz/...`, `autoharness/tasks/file-organizer/...`	Adds three sample tasks with `instruction.md`, `task.toml`, environment/Dockerfile (file-organizer), and verifier scripts that emit `/logs/verifier/reward.txt`.
Task tests/utilities `autoharness/tasks/.../tests/test.sh`	Adds/verifies task-specific Bash verifier scripts that check task outputs and write `/logs/verifier/reward.txt`.
Minor unrelated change `image-resize/src/manifest.rs`	Test updated to assert crate version via `env!(\"CARGO_PKG_VERSION\")` instead of a hardcoded version string.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant API as "iii-engine REST<br/>(HTTP)"
    participant Orch as "Orchestrator<br/>Worker"
    participant KV as "State KV"
    participant Agent as "agent.py<br/>Meta-Agent"
    participant Task as "Task Container"

    User->>API: POST /api/experiment/setup (tag)
    API->>Orch: dispatch setup
    Orch->>KV: persist tag metadata
    KV-->>Orch: ack

    User->>API: POST /api/search/suggest
    Orch->>KV: read history & failures
    KV-->>Orch: history
    Orch-->>API: suggestion

    User->>API: POST /api/experiment/register
    Orch->>KV: create experiment record
    KV-->>Orch: experiment_id

    User->>API: POST /api/task/batch (concurrency)
    Orch->>Task: run task N (concurrent)
    Task->>Agent: invoke harness (instruction.md)
    Agent-->>Task: trajectory + logs + score
    Task->>Orch: result + tails
    Orch->>KV: store task result
    KV-->>Orch: ack

    User->>API: POST /api/experiment/complete (results)
    Orch->>KV: compute best / near-miss / crash counters
    KV-->>Orch: previous best
    Orch-->>API: keep/discard/abort decision

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

"🐰
I tweak the prompt, I patch the code,
I hop through tests where errors sow—
Experiments run, the scores unfold,
A little rabbit reads the log.
Commit, snapshot, chart the climb."

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.66% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add autoharness worker' directly and accurately describes the primary change—the addition of a complete autoharness worker system to the repository, which is the main focus of this comprehensive pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/autoharness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ences, update counts

coderabbitai

Actionable comments posted: 20

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@autoharness/agent.py`:
- Around line 41-47: run_shell currently runs subprocesses in the process CWD so
file ops hit the wrong place; update run_shell to execute commands in the task
workspace by passing the task directory as the subprocess cwd (e.g., accept an
optional cwd param or obtain the agent/task workspace and supply cwd=task_dir to
subprocess.run). Apply the same change to the other subprocess.run usages
referenced around lines 114-123 and 138-145 so all shell invocations (the
run_shell function and the other subprocess.run calls) use the task workspace as
cwd rather than the process CWD.

In `@autoharness/bench.sh`:
- Around line 32-42: The script currently attempts to start the orchestrator
from workers/orchestrator but the actual worker is at
orchestrator/orchestrator.py; update the bootstrapping sequence in bench.sh so
after cd "$(dirname "$0")" you change into the correct directory (relative path
from autoharness to the orchestrator, e.g. ../orchestrator) before launching
python3 orchestrator.py &, then capture its PID into ORCH_PID as before; ensure
the sleep and subsequent cd ../.. (or adjusted return path) are updated so the
script returns to the original working location and III_PID/ORCH_PID behavior is
unchanged.

In `@autoharness/Dockerfile.base`:
- Around line 15-18: The Dockerfile currently runs as root (uses RUN mkdir -p
/task/logs, WORKDIR /task, CMD ["sleep","infinity"]); create a dedicated
unprivileged user (e.g., group/user name like taskuser/taskgroup with a stable
non-root UID/GID such as 1000), chown the working directory and /task/logs to
that user, set USER to that unprivileged account, and ensure HOME is set
appropriately so generated scripts run with least privilege; update the
Dockerfile to add the user/group, change ownership of /task and /task/logs, and
switch to that user before the CMD.

In `@autoharness/iii-config.yaml`:
- Line 15: The CORS setting allowed_origins: ["*"] is too permissive for the
local control-plane API; update the configuration to list only trusted origins
(e.g., specific host(s) and ports) or remove/disable CORS if browser access
isn't needed. Locate the allowed_origins key in autoharness/iii-config.yaml and
replace the wildcard entry with a concrete array of allowed origins (or
comment/remove the setting), making sure the values match the exact origin
strings used by your UI or services.

In `@autoharness/orchestrator/orchestrator.py`:
- Around line 275-295: When writing near-miss entries, don't rely on the local
`best` captured at insert time; instead re-fetch the current best for the
experiment's tag and use that when calling `_is_near_miss` before `kv.set` into
`SCOPES["near_misses"]`. Update the logic around the `_is_near_miss(improved,
best, delta_passed, delta_score)` checks (and their counterparts near the other
occurrences) to call the KV to get the up-to-date best (e.g., `await
kv.get(SCOPES["best"], exp["tag"])`) and compute `delta_*` relative to that
current best so only true near-misses are appended to `SCOPES["near_misses"]`.
- Around line 438-459: Ensure we don't read stale artifacts by verifying the log
files were written during this run: when checking reward_file (task_path /
"logs" / "reward.txt") and traj_file (task_path / "logs" / "agent" /
"trajectory.json"), first confirm .exists() AND that their modification time
(e.g., Path.stat().st_mtime) is >= the run start timestamp (start) before
attempting to read/parse; if the mtime is older than start, treat the file as
absent so score/trajectory remain default. Apply this check where reward_file
and traj_file are used in orchestrator.py.
- Around line 611-620: The adapt function is taking an unordered slice
tag_exps[-10:] which is nondeterministic because kv.list(SCOPES["experiments"])
isn't ordered; instead, sort tag_exps by a reliable timestamp field (e.g.,
"created_at", "started_at", or "timestamp") in ascending chronological order (or
descending then take first 10) before computing recent and keep_rate. Update the
code around adapt / tag_exps / recent to perform a stable sort (falling back to
a default value for missing timestamps) and then compute recent =
sorted_tag_exps[-10:] (or equivalent) so keep_rate is computed from the actual
most-recent experiments.
- Around line 239-346: Both complete() and crash() currently mutate aggregate
state even if exp["status"] is already terminal, causing double-counting and
inconsistent best/crash records; add an early guard in both functions (before
entering async with _tag_lock and before mutating tag/best/crashes) that checks
exp["status"] against terminal states (e.g., "keep","discard","crash") and
returns immediately (no-op) if already finished. Ensure the guard uses the
loaded exp object from kv.get(SCOPES["experiments"], eid) and prevents any
updates to SCOPES["tags"], SCOPES["best"], or SCOPES["crashes"] when skipping,
so retries/concurrent calls don't change aggregates.

In `@autoharness/orchestrator/test_orchestrator.py`:
- Around line 21-25: The test helper api() currently constructs headers without
any Authorization, so tests for harness::snapshot and harness::restore cannot
authenticate when AUTOAGENT_AUTH_TOKEN is set; update api(path, data=None,
method="POST") to read os.environ.get("AUTOAGENT_AUTH_TOKEN") and, if present,
add an "Authorization": f"Bearer {token}" entry to the headers dict (keeping
existing Content-Type behavior), and ensure the same pattern is applied for
other api usages including the snapshot/restore test calls (functions/methods
invoking api for snapshot and restore).

In `@autoharness/plot_progress.py`:
- Around line 117-118: The code assumes fetch_json(..."/api/experiment/history")
returns experiments at the top level but the API responds with
{"statusCode":..., "body":...}; update the handling after fetch_json so you
unwrap the response body before reading experiments: retrieve the inner payload
from data.get("body") (and JSON-decode it if it's a string) and then set
experiments = payload if it's a list else payload.get("experiments", []); adjust
the variables around data and experiments (and keep using args.api and the
"/api/experiment/history" call) so the chart reads the actual experiments array.

In `@autoharness/program.md`:
- Around line 64-83: The documentation hardcodes "agent.py" in Steps 2 and 3
which breaks variant runs when HARNESS_PATH is set (e.g.,
HARNESS_PATH=agent-claude.py); update the instructions and example commands to
use the configured HARNESS_PATH variable instead of the literal agent.py so the
meta-agent edits and commits the actual harness file (reference symbols:
HARNESS_PATH, agent.py, COMMIT, the git add/commit and curl register snippet);
replace occurrences of agent.py in the editable-step and register payload with
the HARNESS_PATH variable or its shell interpolation ($HARNESS_PATH) so the loop
consistently targets the configured harness path.

In `@autoharness/README.md`:
- Around line 33-37: The README claims a `workers/orchestrator/` location and
"33 functions" but the code now registers 26 functions in
`autoharness/orchestrator/orchestrator.py`; update all README references
(including the occurrences around the previously called-out sections) to point
to `orchestrator/` (or the correct relative path) and change the advertised
route/function count from 33 to 26, and adjust any descriptive text that lists
or enumerates those endpoints (e.g., the blocks around lines referenced in the
review: the initial listing, the later summaries, and the detailed section at
392-405) so the directory path and counts match the actual implementation and
any example commands or links refer to
`autoharness/orchestrator/orchestrator.py`.
- Around line 29-30: The README's quick-start credential snippet only sets
ANTHROPIC_API_KEY but the default harness described is agent.py which uses the
gpt-5/OpenAI path; update the README so the credentials match the described
default: either change the quick-start to set the OpenAI/`gpt-5` provider key
(e.g., the variable name used by agent.py) or explicitly instruct readers to
switch to `agent-claude.py` before using ANTHROPIC_API_KEY; reference the files
`agent.py`, `agent-claude.py`, and the `gpt-5` provider in the updated text so
the credential guidance is unambiguous.

In `@autoharness/tasks/file-organizer/tests/test.sh`:
- Around line 17-29: The test currently only verifies destination existence via
the check calls and then writes reward based on SCORE/TOTAL; update test.sh to
also assert that the original root copies of the moved files no longer exist
before granting full score. Add absence checks for the original basenames
(report.pdf, photo.jpg, image.png, notes.txt, readme.txt, data.csv, budget.csv)
— e.g., after the existing check "documents/report.pdf" etc. verify that the
corresponding root paths are missing and decrement or fail the check (adjust
SCORE) if any original file still exists; ensure logic around SCORE, TOTAL and
the final reward write remains consistent so full reward only when both
destination exists and originals are removed.

In `@autoharness/tasks/fizzbuzz/environment/Dockerfile`:
- Line 1: Add a non-root user directive to ensure containers don't run as root:
update the Dockerfile (or the base Dockerfile used by this image) to create a
dedicated non-privileged user and set USER to that account (e.g., create a
user/group with adduser/useradd and then include USER nobody or USER
<username>). Ensure any files/directories the container needs are chowned or
have appropriate permissions for that user so runtime steps in this Dockerfile
(FROM autoagent-base:latest) continue to work without root.

In `@autoharness/tasks/fizzbuzz/instruction.md`:
- Line 3: Update the instruction text that currently instructs "When run with
`python fizzbuzz.py`" to use the same interpreter as the verifier by changing it
to "When run with `python3 fizzbuzz.py`"; locate the phrase in
autoharness/tasks/fizzbuzz/instruction.md (the sentence starting with "When run
with") and replace "python" with "python3" so wording is consistent with the
verifier.

In `@autoharness/tasks/fizzbuzz/tests/test.sh`:
- Line 2: The test.sh uses "set -e" which can abort the script before emitting
the reward if "python3 fizzbuzz.py" exits non-zero; change the script to remove
or override the immediate-exit behavior and add a trap or explicit capture of
the exit code around the "python3 fizzbuzz.py" invocation so the script always
writes the reward/result file (e.g., capture "$?" into a variable after running
python3 fizzbuzz.py, write the reward via the existing reward emission commands,
then exit with the captured status), and apply the same pattern for the other
occurrences referenced (lines around 12-13 and line 32) to ensure crash-safe
reward emission.
- Around line 22-30: The test currently only checks LINE1, LINE3, LINE5, LINE15
which is too weak and can be gamed; replace these spot checks with a full 1..100
verification: loop i from 1 to 100, extract the ith line from ACTUAL (e.g., via
sed -n "${i}p"), compute the expected FizzBuzz value for i ("Fizz" if i%3==0,
"Buzz" if i%5==0, "FizzBuzz" if both, otherwise the number), compare the
expected string to the extracted line, and on mismatch print a clear message
including i, expected, and actual and set PASS=0; ensure you stop or continue
based on existing test semantics and handle any trailing whitespace/newline
differences when comparing.

In `@autoharness/tasks/hello-world/instruction.md`:
- Around line 3-5: The fenced code block containing Hello, World! is missing a
language hint; update the opening fence to include a language (e.g., add "text"
after the triple backticks) so the block becomes ```text and the code block now
satisfies MD040 linting for autoharness/tasks/hello-world/instruction.md.

In `@autoharness/tasks/hello-world/tests/test.sh`:
- Around line 4-13: The verifier currently uses command substitution
(EXPECT/ACTUAL) which strips trailing newlines; instead create a byte-exact
expected file (e.g., write the exact "Hello, World!\n" into a temporary expected
file) and perform a binary comparison (use cmp -s or diff --no-deref) against
/task/hello.txt to enforce exact newline semantics; update the logic that sets
EXPECTED/ACTUAL and the if-condition to use the binary compare result to decide
writing 1/0 to /logs/verifier/reward.txt and printing PASS/FAIL.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f4a36929-95d2-486d-a00a-35ffe0802876

📥 Commits

Reviewing files that changed from the base of the PR and between 2c8376d and 2ba8bd6.

⛔ Files ignored due to path filters (1)

autoharness/progress.png is excluded by !**/*.png

📒 Files selected for processing (24)

autoharness/.dockerignore
autoharness/.gitignore
autoharness/Dockerfile.base
autoharness/README.md
autoharness/agent.py
autoharness/bench.sh
autoharness/iii-config.yaml
autoharness/orchestrator/orchestrator.py
autoharness/orchestrator/test_orchestrator.py
autoharness/plot_progress.py
autoharness/program.md
autoharness/pyproject.toml
autoharness/tasks/file-organizer/environment/Dockerfile
autoharness/tasks/file-organizer/instruction.md
autoharness/tasks/file-organizer/task.toml
autoharness/tasks/file-organizer/tests/test.sh
autoharness/tasks/fizzbuzz/environment/Dockerfile
autoharness/tasks/fizzbuzz/instruction.md
autoharness/tasks/fizzbuzz/task.toml
autoharness/tasks/fizzbuzz/tests/test.sh
autoharness/tasks/hello-world/environment/Dockerfile
autoharness/tasks/hello-world/instruction.md
autoharness/tasks/hello-world/task.toml
autoharness/tasks/hello-world/tests/test.sh

autoharness/agent.py

coderabbitai · 2026-04-06T11:41:15Z

autoharness/bench.sh

+    cd "$(dirname "$0")"
+
+    iii --config iii-config.yaml &
+    III_PID=$!
+    sleep 2
+
+    cd workers/orchestrator
+    python3 orchestrator.py &
+    ORCH_PID=$!
+    sleep 3
+    cd ../..


⚠️ Potential issue | 🔴 Critical

Bootstrap the orchestrator from the real path.

After cd "$(dirname "$0")", this script descends into workers/orchestrator, but the worker added in this PR lives under orchestrator/orchestrator.py. The auto-start path will fail before any benchmark runs.

💡 Suggested fix

- cd workers/orchestrator + cd orchestrator python3 orchestrator.py & ORCH_PID=$! sleep 3 - cd ../.. + cd ..

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cd "$(dirname "$0")"

iii --config iii-config.yaml &

III_PID=$!

sleep 2

cd workers/orchestrator

python3 orchestrator.py &

ORCH_PID=$!

sleep 3

cd ../..

cd "$(dirname "$0")"

iii --config iii-config.yaml &

III_PID=$!

sleep 2

cd orchestrator

python3 orchestrator.py &

ORCH_PID=$!

sleep 3

cd ..

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@autoharness/bench.sh` around lines 32 - 42, The script currently attempts to start the orchestrator from workers/orchestrator but the actual worker is at orchestrator/orchestrator.py; update the bootstrapping sequence in bench.sh so after cd "$(dirname "$0")" you change into the correct directory (relative path from autoharness to the orchestrator, e.g. ../orchestrator) before launching python3 orchestrator.py &, then capture its PID into ORCH_PID as before; ensure the sleep and subsequent cd ../.. (or adjusted return path) are updated so the script returns to the original working location and III_PID/ORCH_PID behavior is unchanged.

autoharness/Dockerfile.base

autoharness/iii-config.yaml

autoharness/orchestrator/orchestrator.py

autoharness/tasks/fizzbuzz/instruction.md

autoharness/tasks/fizzbuzz/tests/test.sh

autoharness/tasks/hello-world/instruction.md

autoharness/tasks/hello-world/tests/test.sh

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

autoharness/README.md (1)
118-127: ⚠️ Potential issue | 🟠 Major

Clarify default provider credentials in Quick Start to avoid first-run failures.

Line 125-Line 127 only documents ANTHROPIC_API_KEY, while the README’s default harness description centers agent.py. If agent.py follows the OpenAI Agents SDK path, users will run with the wrong credential by default. Please either document the OpenAI key in Quick Start or explicitly instruct switching to the Anthropic harness before running claude.
Proposed README patch
 Requirements: Docker, Python 3.10+, [iii-engine](https://github.com/iii-hq/iii-engine), [Harbor](https://github.com/laude-institute/harbor), and whatever model-provider credentials your agent harness requires.
 
 ```bash
 uv tool install harbor
 
 cd autoharness
 
 cat > .env << 'EOF'
-ANTHROPIC_API_KEY=sk-ant-...
+# If using agent.py (OpenAI Agents SDK path):
+OPENAI_API_KEY=sk-...
+
+# If using a Claude-based harness:
+ANTHROPIC_API_KEY=sk-ant-...
 EOF
@@
 ```bash
-claude -p "Read program.md and let's kick off a new experiment!"
+claude -p "Read program.md and let's kick off a new experiment!"  # Claude harness path
</details>


Also applies to: 145-147

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @autoharness/README.md around lines 118 - 127, Update the Quick Start env
setup in README to include both OPENAI_API_KEY and ANTHROPIC_API_KEY examples
and/or add a short note instructing users to switch to the Anthropic harness
before running the "claude" command; specifically, modify the .env snippet
referenced near agent.py / the default harness and the subsequent example
invoking "claude" so it either shows OPENAI_API_KEY by default with a comment
that Claude requires ANTHROPIC_API_KEY, or include both keys with clear comments
indicating which harness (agent.py/OpenAI Agents SDK vs Claude/Anthropic) they
correspond to.
</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @autoharness/README.md:

Line 43: Multiple fenced code blocks in the README are missing language
identifiers which triggers MD040; update each triple-backtick block shown in the
diff by adding appropriate language tags (e.g., use "text" for plain ASCII-art
or directory listings and "http" for HTTP request/response examples) so the
blocks become text or http; search for the unlabeled ``` blocks around the
examples referenced (ASCII header, tasks/my-task/, and the various API examples
like POST /api/experiment/setup, GET /api/task/list, POST /api/search/suggest,
GET /api/harness/read, POST /api/report/summary, and autoharness/) and add the
corresponding language identifier to each opening fence.

Duplicate comments:
In @autoharness/README.md:

Around line 118-127: Update the Quick Start env setup in README to include
both OPENAI_API_KEY and ANTHROPIC_API_KEY examples and/or add a short note
instructing users to switch to the Anthropic harness before running the "claude"
command; specifically, modify the .env snippet referenced near agent.py / the
default harness and the subsequent example invoking "claude" so it either shows
OPENAI_API_KEY by default with a comment that Claude requires ANTHROPIC_API_KEY,
or include both keys with clear comments indicating which harness
(agent.py/OpenAI Agents SDK vs Claude/Anthropic) they correspond to.
</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ]  Push a commit to this branch (recommended)
- [ ]  Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `7b76cc75-66f0-44c3-9bc5-4ed261f6ca33`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 2ba8bd6b9a8166d8db348a7add3d1fa09b585593 and cabad42abe1daecd28e3589330cc028f8d508fc7.

</details>

<details>
<summary>⛔ Files ignored due to path filters (1)</summary>

* `image-resize/Cargo.lock` is excluded by `!**/*.lock`

</details>

<details>
<summary>📒 Files selected for processing (3)</summary>

* `LICENSE`
* `autoharness/README.md`
* `image-resize/src/manifest.rs`

</details>

<details>
<summary>✅ Files skipped from review due to trivial changes (2)</summary>

* image-resize/src/manifest.rs
* LICENSE

</details>

</details>

autoharness/README.md

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (5)

autoharness/orchestrator/orchestrator.py (1)

864-864: Remove unnecessary f-string prefix.

This f-string has no placeholders.

💡 Proposed fix

-            return _err({"error": f"Tag not found"}, 404)
+            return _err({"error": "Tag not found"}, 404)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@autoharness/orchestrator/orchestrator.py` at line 864, The return statement
currently uses an unnecessary f-string for a static message (the expression
_err({"error": f"Tag not found"}, 404)); change it to use a plain string literal
for the "Tag not found" message so the call to _err uses {"error": "Tag not
found"} with the same 404 status, locating the change in orchestrator.py where
the return statement is defined.

autoharness/plot_progress.py (2)

36-41: Consider adding URL scheme validation for security.

The fetch_json function opens arbitrary URLs without validating the scheme. While this is a CLI tool with user-provided --api, adding a check for http:// or https:// schemes would prevent accidental file:// access.

💡 Optional hardening

 def fetch_json(url, data=None):
+    if not url.startswith(("http://", "https://")):
+        raise ValueError(f"URL must use http or https scheme: {url}")
     body = json.dumps(data).encode() if data else None
     headers = {"Content-Type": "application/json"} if body else {}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@autoharness/plot_progress.py` around lines 36 - 41, The fetch_json function
currently accepts arbitrary URL schemes which can allow dangerous accesses
(e.g., file://); update fetch_json to parse the provided url (use
urllib.parse.urlparse) at the start of the function and validate that
parsed.scheme is either "http" or "https", raising a ValueError (or similar) if
not allowed; keep the rest of the logic intact so only requests with http/https
proceed.

88-88: Add strict=True to zip() for defensive coding.

The zip(kept_x, kept_y, kept_labels) call assumes all three lists have equal length. Adding strict=True would catch any unexpected length mismatches during development.

💡 Proposed fix

-        for x, y, label in zip(kept_x, kept_y, kept_labels):
+        for x, y, label in zip(kept_x, kept_y, kept_labels, strict=True):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@autoharness/plot_progress.py` at line 88, Add defensive checking to the
iteration by changing the zip call that iterates over kept_x, kept_y,
kept_labels to use zip(..., strict=True); locate the loop using the variables
kept_x, kept_y, kept_labels in plot_progress.py (the for x, y, label in zip(...)
loop) and update it to pass strict=True so a ValueError is raised if the lists
differ in length.

autoharness/agent.py (1)

51-54: Unused token/cost tracking placeholders.

total_tokens and estimated_cost are hardcoded to 0 and printed but never actually calculated. If token tracking is planned, consider either implementing it or removing these lines to avoid confusion.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@autoharness/agent.py` around lines 51 - 54, Remove or implement the unused
token/cost placeholders: the variables total_tokens and estimated_cost (and
their print statements) in autoharness/agent.py are hardcoded to 0 and never
updated; either delete these variables and the two print(...) calls to avoid
misleading output, or update the code path that handles LLM requests (e.g., the
function/method that calls the model/receives responses) to accumulate token
usage into total_tokens and compute estimated_cost (using your tokenizer/usage
metadata or response.usage) and then log those computed values instead of the
hardcoded ones; ensure you update all references to total_tokens/estimated_cost
so no unused variables remain.

autoharness/orchestrator/test_orchestrator.py (1)

353-354: Remove unnecessary f-string prefix.

These strings contain no placeholders, so the f prefix is unnecessary.

💡 Proposed fix

-        print(f"  Start: iii --config iii-config.yaml")
-        print(f"  Then:  cd workers/orchestrator && uv run orchestrator.py")
+        print("  Start: iii --config iii-config.yaml")
+        print("  Then:  cd orchestrator && uv run orchestrator.py")

Note: The path in line 354 should also be orchestrator not workers/orchestrator to match the actual directory structure.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@autoharness/orchestrator/test_orchestrator.py` around lines 353 - 354, Remove
the unnecessary f-string prefixes on the two print calls so they are plain
strings (i.e., change print(f"...") to print("...")), and correct the path in
the second print to use "orchestrator" instead of "workers/orchestrator" by
updating the print statement that currently reads print(f"  Then:  cd
workers/orchestrator && uv run orchestrator.py") accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@autoharness/agent.py`:
- Line 37: The MODEL constant currently uses an incomplete identifier "gpt-5";
update the MODEL value to a full, available gpt-5 variant (for example
"gpt-5.4", "gpt-5.4-mini", or "gpt-5.4-nano") by replacing the string assigned
to MODEL in autoharness.agent (symbol: MODEL) with the appropriate complete
model name for your usage.

In `@autoharness/orchestrator/orchestrator.py`:
- Around line 78-84: The _is_near_miss helper currently uses delta_score >
-NEAR_MISS_THRESHOLD which only catches scores slightly worse than best; update
the score check to capture experiments within the threshold in either direction
by replacing that clause with abs(delta_score) <= NEAR_MISS_THRESHOLD (keeping
the existing checks: not improved, best is not None, and abs(delta_passed) <= 1)
so near-misses include runs close to the current best whether slightly better or
worse.
- Around line 380-386: The near_misses handler (async def near_misses) currently
returns stored entries from kv.list(SCOPES["near_misses"]) without comparing
them to the current best, so stale entries can bias search::suggest_direction;
update near_misses to fetch the current best (e.g., from kv or best-record
function), compute each entry's delta relative to that best and re-filter/sort
using the live threshold before returning, or alternatively implement pruning
logic in the routine that records a new best (the function that writes to
SCOPES["near_misses"]) to remove entries whose recomputed delta_score exceeds
threshold—ensure you reference and update entries via the same keys returned by
kv.list and preserve the existing response shape {"near_misses": ..., "total":
...}.

---

Nitpick comments:
In `@autoharness/agent.py`:
- Around line 51-54: Remove or implement the unused token/cost placeholders: the
variables total_tokens and estimated_cost (and their print statements) in
autoharness/agent.py are hardcoded to 0 and never updated; either delete these
variables and the two print(...) calls to avoid misleading output, or update the
code path that handles LLM requests (e.g., the function/method that calls the
model/receives responses) to accumulate token usage into total_tokens and
compute estimated_cost (using your tokenizer/usage metadata or response.usage)
and then log those computed values instead of the hardcoded ones; ensure you
update all references to total_tokens/estimated_cost so no unused variables
remain.

In `@autoharness/orchestrator/orchestrator.py`:
- Line 864: The return statement currently uses an unnecessary f-string for a
static message (the expression _err({"error": f"Tag not found"}, 404)); change
it to use a plain string literal for the "Tag not found" message so the call to
_err uses {"error": "Tag not found"} with the same 404 status, locating the
change in orchestrator.py where the return statement is defined.

In `@autoharness/orchestrator/test_orchestrator.py`:
- Around line 353-354: Remove the unnecessary f-string prefixes on the two print
calls so they are plain strings (i.e., change print(f"...") to print("...")),
and correct the path in the second print to use "orchestrator" instead of
"workers/orchestrator" by updating the print statement that currently reads
print(f"  Then:  cd workers/orchestrator && uv run orchestrator.py")
accordingly.

In `@autoharness/plot_progress.py`:
- Around line 36-41: The fetch_json function currently accepts arbitrary URL
schemes which can allow dangerous accesses (e.g., file://); update fetch_json to
parse the provided url (use urllib.parse.urlparse) at the start of the function
and validate that parsed.scheme is either "http" or "https", raising a
ValueError (or similar) if not allowed; keep the rest of the logic intact so
only requests with http/https proceed.
- Line 88: Add defensive checking to the iteration by changing the zip call that
iterates over kept_x, kept_y, kept_labels to use zip(..., strict=True); locate
the loop using the variables kept_x, kept_y, kept_labels in plot_progress.py
(the for x, y, label in zip(...) loop) and update it to pass strict=True so a
ValueError is raised if the lists differ in length.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 16d4926f-451c-4f2a-bc06-6e2d9307c6e9

📥 Commits

Reviewing files that changed from the base of the PR and between cabad42 and aa4929b.

📒 Files selected for processing (14)

autoharness/Dockerfile.base
autoharness/README.md
autoharness/agent.py
autoharness/bench.sh
autoharness/iii-config.yaml
autoharness/orchestrator/orchestrator.py
autoharness/orchestrator/test_orchestrator.py
autoharness/plot_progress.py
autoharness/program.md
autoharness/tasks/file-organizer/tests/test.sh
autoharness/tasks/fizzbuzz/instruction.md
autoharness/tasks/fizzbuzz/tests/test.sh
autoharness/tasks/hello-world/instruction.md
autoharness/tasks/hello-world/tests/test.sh

✅ Files skipped from review due to trivial changes (6)

autoharness/tasks/hello-world/instruction.md
autoharness/tasks/fizzbuzz/instruction.md
autoharness/Dockerfile.base
autoharness/iii-config.yaml
autoharness/tasks/fizzbuzz/tests/test.sh
autoharness/program.md

🚧 Files skipped from review as they are similar to previous changes (2)

autoharness/tasks/hello-world/tests/test.sh
autoharness/tasks/file-organizer/tests/test.sh

coderabbitai · 2026-04-06T15:45:21Z

autoharness/agent.py

+- Verify your solution before finishing.
+"""
+
+MODEL = "gpt-5"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

What are the currently available OpenAI GPT models for the Agents SDK?

💡 Result:

For the OpenAI Agents SDK (which uses the Responses API), the currently available GPT models listed in the API docs are: gpt-5.4, gpt-5.4-mini, and gpt-5.4-nano. [1]

Source(s): [1]

🏁 Script executed:

cat -n autoharness/agent.py | sed -n '30,45p'

Repository: iii-hq/workers

Length of output: 529

Update model identifier to a complete, available version.

MODEL = "gpt-5" is an incomplete model identifier. Update to a complete gpt-5 variant like "gpt-5.4", "gpt-5.4-mini", or "gpt-5.4-nano", whichever is appropriate for your use case.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@autoharness/agent.py` at line 37, The MODEL constant currently uses an incomplete identifier "gpt-5"; update the MODEL value to a full, available gpt-5 variant (for example "gpt-5.4", "gpt-5.4-mini", or "gpt-5.4-nano") by replacing the string assigned to MODEL in autoharness.agent (symbol: MODEL) with the appropriate complete model name for your usage.

Intentional — MODEL lives in the editable section (the whole point of autoharness is that the meta-agent modifies this). "gpt-5" is a valid OpenAI alias that resolves to the latest gpt-5 variant. Pinning to a specific sub-version like "gpt-5.4" would make the baseline less portable. The meta-agent can (and should) experiment with model selection as part of the loop — it's one of the 12 tracked categories (model_selection).

@rohitg00, that's a well-reasoned design choice. OpenAI's own docs confirm that gpt-5 is a valid alias with snapshots available, so using the short alias rather than a pinned version makes the baseline more forward-compatible. The meta-agent's ability to experiment with model_selection as a tracked category is a great demonstration of the self-improving loop in action. Withdrawing my concern.

🐇✨

✏️ Learnings added

Learnt from: rohitg00 Repo: iii-hq/workers PR: 7 File: autoharness/agent.py:37-37 Timestamp: 2026-04-06T16:21:56.017Z Learning: In `autoharness/agent.py`, `MODEL = "gpt-5"` in the editable section is intentional. The `gpt-5` string is a valid OpenAI alias that resolves to the latest gpt-5 variant. It is deliberately unpinned to keep the baseline portable; the meta-agent is expected to experiment with model selection (tracked under the `model_selection` category) as part of the autoharness experiment loop. Do not flag this as an incomplete or invalid model identifier.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

autoharness/orchestrator/orchestrator.py

coderabbitai

🧹 Nitpick comments (2)

autoharness/orchestrator/orchestrator.py (1)
875-875: Remove unnecessary f-string prefix.

The string has no placeholders, so the f prefix is extraneous.
🧹 Suggested fix
-            return _err({"error": f"Tag not found"}, 404)
+            return _err({"error": "Tag not found"}, 404)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@autoharness/orchestrator/orchestrator.py` at line 875, The return statement
is using an unnecessary f-string for a static message; update the call to _err
so the error value uses a plain string instead of an f-string (replace f"Tag not
found" with "Tag not found") in the return _err({"error": ...}, 404) expression
to remove the extraneous prefix.
autoharness/README.md (1)
15-21: Minor grammar polish (optional).

Static analysis suggests:

Line 15: "single file harness" → "single-file harness" (compound adjective)

Line 66: "auto keep/discard" → "auto-keep/discard"

The repetitive "It doesn't" pattern at line 21 is intentional emphasis and reads fine.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@autoharness/README.md` around lines 15 - 21, Update the README phrasing:
replace the phrase "single file harness" with the compound adjective
"single-file harness" and change "auto keep/discard" to "auto-keep/discard" to
fix the grammar; locate these exact strings in autoharness/README.md (the
occurrences currently shown as "single file harness" and "auto keep/discard")
and apply the hyphenation in those two places.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@autoharness/orchestrator/orchestrator.py`:
- Line 875: The return statement is using an unnecessary f-string for a static
message; update the call to _err so the error value uses a plain string instead
of an f-string (replace f"Tag not found" with "Tag not found") in the return
_err({"error": ...}, 404) expression to remove the extraneous prefix.

In `@autoharness/README.md`:
- Around line 15-21: Update the README phrasing: replace the phrase "single file
harness" with the compound adjective "single-file harness" and change "auto
keep/discard" to "auto-keep/discard" to fix the grammar; locate these exact
strings in autoharness/README.md (the occurrences currently shown as "single
file harness" and "auto keep/discard") and apply the hyphenation in those two
places.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 24e2266a-b23d-4950-b4f1-5af38ac0c3c1

📥 Commits

Reviewing files that changed from the base of the PR and between aa4929b and a036132.

📒 Files selected for processing (2)

autoharness/README.md
autoharness/orchestrator/orchestrator.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (6)

autoharness/tasks/file-organizer/environment/Dockerfile (1)

1-1: Avoid mutable latest for the base image.

Using autoharness-base:latest makes task environments non-reproducible across runs. The base image (autoharness/Dockerfile.base) should be pinned to a specific version tag to ensure consistent behavior.

Proposed change

-FROM autoharness-base:latest
+ARG AUTOHARNESS_BASE_IMAGE=autoharness-base:2026-04-06
+FROM ${AUTOHARNESS_BASE_IMAGE}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@autoharness/tasks/file-organizer/environment/Dockerfile` at line 1, The
Dockerfile uses an unpinned base image tag "autoharness-base:latest", which
makes builds non-reproducible; update the FROM instruction in the Dockerfile
(the line containing FROM autoharness-base:latest) to reference a specific,
immutable tag (e.g., autoharness-base:vX.Y.Z) or introduce a build ARG like
BASE_IMAGE_TAG and use FROM autoharness-base:${BASE_IMAGE_TAG} so the base image
is pinned and can be overridden during builds.

autoharness/orchestrator/test_orchestrator.py (1)

353-354: Remove extraneous f prefix from strings without placeholders.

These strings don't contain any format placeholders, so the f prefix is unnecessary.

✨ Fix f-string warnings

-        print(f"  Start: iii --config iii-config.yaml")
-        print(f"  Then:  cd workers/orchestrator && uv run orchestrator.py")
+        print("  Start: iii --config iii-config.yaml")
+        print("  Then:  cd orchestrator && uv run orchestrator.py")

Note: Also updated the path from workers/orchestrator to orchestrator to match the actual directory structure.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@autoharness/orchestrator/test_orchestrator.py` around lines 353 - 354, The
two print statements using f-strings in test_orchestrator.py are unnecessary
since there are no placeholders; remove the leading "f" from the literals in the
print calls that output "  Start: iii --config iii-config.yaml" and "  Then:  cd
workers/orchestrator && uv run orchestrator.py" and also update the second
string to use the correct path ("orchestrator" instead of
"workers/orchestrator") so the prints read as plain string literals reflecting
the real directory; locate the calls by searching for the print invocations in
test_orchestrator.py (the lines that currently start with print(f"  Start:") and
print(f"  Then:")) and make the edits there.

autoharness/plot_progress.py (2)

36-41: Add error handling for HTTP failures.

fetch_json will raise an unhandled exception on HTTP errors (4xx/5xx) or network failures. Consider catching urllib.error.HTTPError and urllib.error.URLError to provide clearer error messages.

🛡️ Add error handling

 def fetch_json(url, data=None):
     body = json.dumps(data).encode() if data else None
     headers = {"Content-Type": "application/json"} if body else {}
     req = urllib.request.Request(url, data=body, headers=headers, method="POST" if body else "GET")
-    with urllib.request.urlopen(req, timeout=10) as resp:
-        return json.loads(resp.read())
+    try:
+        with urllib.request.urlopen(req, timeout=10) as resp:
+            return json.loads(resp.read())
+    except urllib.error.HTTPError as e:
+        raise SystemExit(f"HTTP {e.code} from {url}: {e.read().decode()[:200]}")
+    except urllib.error.URLError as e:
+        raise SystemExit(f"Cannot reach {url}: {e.reason}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@autoharness/plot_progress.py` around lines 36 - 41, fetch_json currently lets
urllib exceptions propagate; wrap the urllib.request.urlopen call in a
try/except inside fetch_json and catch urllib.error.HTTPError and
urllib.error.URLError (optionally a generic Exception fallback). For HTTPError
(from urllib.error import HTTPError), include the status code, reason and any
response body when available in the raised/logged message; for URLError include
the underlying reason (e.g., timeout, DNS). Re-raise a clearer exception (or
raise RuntimeError) with context including the URL and method and attach the
original exception so callers can inspect it; update the fetch_json function to
perform this handling.

88-88: Add strict=True to zip() for safer iteration.

The zip() call iterates over kept_x, kept_y, and kept_labels which are constructed together and should always have matching lengths. Adding strict=True will catch any future bugs where lengths diverge.

✨ Proposed fix

-        for x, y, label in zip(kept_x, kept_y, kept_labels):
+        for x, y, label in zip(kept_x, kept_y, kept_labels, strict=True):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@autoharness/plot_progress.py` at line 88, The loop that unpacks kept_x,
kept_y, and kept_labels using zip(kept_x, kept_y, kept_labels) should use
strict=True to assert all three sequences have identical lengths; update the
for-loop that iterates over kept_x/kept_y/kept_labels (the line with for x, y,
label in zip(...)) to call zip with strict=True to fail fast if lengths diverge.

autoharness/orchestrator/orchestrator.py (1)

875-875: Remove extraneous f prefix.

This f-string has no placeholders.

-            return _err({"error": f"Tag not found"}, 404)
+            return _err({"error": "Tag not found"}, 404)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@autoharness/orchestrator/orchestrator.py` at line 875, The return statement
currently calls _err({"error": f"Tag not found"}, 404) with an unnecessary
f-string; update the call to use a plain string literal for the error message
(e.g., {"error": "Tag not found"}) so remove the leading f in the dictionary
value used in the _err(...) return.

autoharness/agent.py (1)

51-54: Debug prints always output zero values.

These print statements output hardcoded zeros for total_tokens and estimated_cost. If token/cost tracking is intended, this needs implementation; otherwise, consider removing these debug artifacts.

🧹 Remove placeholder debug prints

         output = result.stdout + result.stderr
-        total_tokens = 0
-        estimated_cost = 0.0
-        print(f"total_tokens:{total_tokens}")
-        print(f"estimated_cost:{estimated_cost}")
         return output[-10000:] if len(output) > 10000 else output

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@autoharness/agent.py` around lines 51 - 54, The two debug print statements
printing hardcoded zeros for total_tokens and estimated_cost should be removed
(or replaced with real tracking). Delete the lines printing total_tokens and
estimated_cost in autoharness/agent.py, and if token/cost tracking is desired,
implement logic to compute and update the total_tokens and estimated_cost
variables where request/response tokens are counted (update the same
total_tokens and estimated_cost variables) and only log or print them after they
reflect actual values.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@autoharness/orchestrator/orchestrator.py`:
- Around line 638-639: adapt() and suggest() are currently reading raw
near-misses via kv.list(SCOPES["near_misses"]) and only filtering by tag, so
they can return entries that no longer meet the active threshold; extract a
shared helper (e.g., get_valid_near_misses(best, tag) or
filter_near_misses_against_best) that takes the current best result and tag,
calls kv.list(SCOPES["near_misses"]) and re-applies the same threshold/distance
re-filtering logic used by the near_misses endpoint, then update adapt() and
suggest() to call this helper (instead of the simple tag-only filter) so both
functions only use near-misses that are still within the valid threshold
relative to the current best.

---

Nitpick comments:
In `@autoharness/agent.py`:
- Around line 51-54: The two debug print statements printing hardcoded zeros for
total_tokens and estimated_cost should be removed (or replaced with real
tracking). Delete the lines printing total_tokens and estimated_cost in
autoharness/agent.py, and if token/cost tracking is desired, implement logic to
compute and update the total_tokens and estimated_cost variables where
request/response tokens are counted (update the same total_tokens and
estimated_cost variables) and only log or print them after they reflect actual
values.

In `@autoharness/orchestrator/orchestrator.py`:
- Line 875: The return statement currently calls _err({"error": f"Tag not
found"}, 404) with an unnecessary f-string; update the call to use a plain
string literal for the error message (e.g., {"error": "Tag not found"}) so
remove the leading f in the dictionary value used in the _err(...) return.

In `@autoharness/orchestrator/test_orchestrator.py`:
- Around line 353-354: The two print statements using f-strings in
test_orchestrator.py are unnecessary since there are no placeholders; remove the
leading "f" from the literals in the print calls that output "  Start: iii
--config iii-config.yaml" and "  Then:  cd workers/orchestrator && uv run
orchestrator.py" and also update the second string to use the correct path
("orchestrator" instead of "workers/orchestrator") so the prints read as plain
string literals reflecting the real directory; locate the calls by searching for
the print invocations in test_orchestrator.py (the lines that currently start
with print(f"  Start:") and print(f"  Then:")) and make the edits there.

In `@autoharness/plot_progress.py`:
- Around line 36-41: fetch_json currently lets urllib exceptions propagate; wrap
the urllib.request.urlopen call in a try/except inside fetch_json and catch
urllib.error.HTTPError and urllib.error.URLError (optionally a generic Exception
fallback). For HTTPError (from urllib.error import HTTPError), include the
status code, reason and any response body when available in the raised/logged
message; for URLError include the underlying reason (e.g., timeout, DNS).
Re-raise a clearer exception (or raise RuntimeError) with context including the
URL and method and attach the original exception so callers can inspect it;
update the fetch_json function to perform this handling.
- Line 88: The loop that unpacks kept_x, kept_y, and kept_labels using
zip(kept_x, kept_y, kept_labels) should use strict=True to assert all three
sequences have identical lengths; update the for-loop that iterates over
kept_x/kept_y/kept_labels (the line with for x, y, label in zip(...)) to call
zip with strict=True to fail fast if lengths diverge.

In `@autoharness/tasks/file-organizer/environment/Dockerfile`:
- Line 1: The Dockerfile uses an unpinned base image tag
"autoharness-base:latest", which makes builds non-reproducible; update the FROM
instruction in the Dockerfile (the line containing FROM autoharness-base:latest)
to reference a specific, immutable tag (e.g., autoharness-base:vX.Y.Z) or
introduce a build ARG like BASE_IMAGE_TAG and use FROM
autoharness-base:${BASE_IMAGE_TAG} so the base image is pinned and can be
overridden during builds.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fbf124c8-cb3f-4e7a-a13a-34d6ef72eea6

📥 Commits

Reviewing files that changed from the base of the PR and between a036132 and 7e0adfa.

📒 Files selected for processing (18)

autoharness/Dockerfile.base
autoharness/README.md
autoharness/agent.py
autoharness/bench.sh
autoharness/iii-config.yaml
autoharness/orchestrator/orchestrator.py
autoharness/orchestrator/test_orchestrator.py
autoharness/plot_progress.py
autoharness/program.md
autoharness/tasks/file-organizer/environment/Dockerfile
autoharness/tasks/file-organizer/task.toml
autoharness/tasks/file-organizer/tests/test.sh
autoharness/tasks/fizzbuzz/instruction.md
autoharness/tasks/fizzbuzz/task.toml
autoharness/tasks/fizzbuzz/tests/test.sh
autoharness/tasks/hello-world/instruction.md
autoharness/tasks/hello-world/task.toml
autoharness/tasks/hello-world/tests/test.sh

✅ Files skipped from review due to trivial changes (10)

autoharness/tasks/hello-world/instruction.md
autoharness/tasks/fizzbuzz/instruction.md
autoharness/tasks/hello-world/tests/test.sh
autoharness/Dockerfile.base
autoharness/iii-config.yaml
autoharness/tasks/hello-world/task.toml
autoharness/tasks/file-organizer/tests/test.sh
autoharness/tasks/file-organizer/task.toml
autoharness/program.md
autoharness/tasks/fizzbuzz/task.toml

🚧 Files skipped from review as they are similar to previous changes (1)

autoharness/tasks/fizzbuzz/tests/test.sh

coderabbitai · 2026-04-06T17:08:22Z

autoharness/orchestrator/orchestrator.py

+        all_nm = await kv.list(SCOPES["near_misses"])
+        near_misses = [n for n in all_nm if n.get("tag") == inp.get("tag")]


⚠️ Potential issue | 🟡 Minor

Internal near-miss consumers bypass threshold re-filtering.

While the near_misses endpoint (lines 380-397) correctly re-filters against the current best, both adapt() (lines 638-639) and suggest() (lines 677-678) fetch raw near-misses filtered only by tag. This means combine-mode selection and suggestion generation may use stale entries that have drifted outside the threshold.

🔧 Proposed fix — extract a shared helper

+async def _get_filtered_near_misses(kv, tag): + """Get near-misses re-filtered against current best threshold.""" + current_best = await kv.get(SCOPES["best"], tag) + all_nm = await kv.list(SCOPES["near_misses"]) + result = [] + for n in all_nm: + if n.get("tag") != tag: + continue + if current_best: + delta_p = n.get("passed", 0) - current_best["passed"] + delta_s = n.get("aggregate_score", 0) - current_best["aggregate_score"] + if abs(delta_p) > 1 or abs(delta_s) > NEAR_MISS_THRESHOLD: + continue + result.append(n) + return result # In adapt(): - all_nm = await kv.list(SCOPES["near_misses"]) - near_misses = [n for n in all_nm if n.get("tag") == inp.get("tag")] + near_misses = await _get_filtered_near_misses(kv, inp.get("tag")) # In suggest(): - all_nm = await kv.list(SCOPES["near_misses"]) - near_misses = [n for n in all_nm if n.get("tag") == inp.get("tag")] + near_misses = await _get_filtered_near_misses(kv, inp.get("tag"))

Also applies to: 677-678

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@autoharness/orchestrator/orchestrator.py` around lines 638 - 639, adapt() and suggest() are currently reading raw near-misses via kv.list(SCOPES["near_misses"]) and only filtering by tag, so they can return entries that no longer meet the active threshold; extract a shared helper (e.g., get_valid_near_misses(best, tag) or filter_near_misses_against_best) that takes the current best result and tag, calls kv.list(SCOPES["near_misses"]) and re-applies the same threshold/distance re-filtering logic used by the near_misses endpoint, then update adapt() and suggest() to call this helper (instead of the simple tag-only filter) so both functions only use near-misses that are still within the valid threshold relative to the current best.

rohitg00 added 5 commits April 6, 2026 12:23

chore: add Apache-2.0 license to autoharness

f75fe0c

fix: use CARGO_PKG_VERSION in manifest test instead of hardcoded string

b5c57ec

chore: remove redundant task Dockerfiles that only inherit base image

e122b4d

chore: move LICENSE to repo root

e2fa91c

docs: rewrite README for autoharness — fix naming, remove stale refer…

1aa86c9

…ences, update counts

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

autoharness/README.md Outdated Show resolved Hide resolved

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

docs: add one-liner description below heading

7e0adfa

rohitg00 force-pushed the feat/autoharness branch from a036132 to 7e0adfa Compare April 6, 2026 17:03

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

		all_nm = await kv.list(SCOPES["near_misses"])
		near_misses = [n for n in all_nm if n.get("tag") == inp.get("tag")]

Conversation

rohitg00 commented Apr 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Function groups (26 total)

Tested

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohitg00 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rohitg00 commented Apr 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 6, 2026 •

edited

Loading

coderabbitai bot Apr 6, 2026 •

edited

Loading