Conversation
Autonomous agent engineering worker on iii-engine. A meta-agent modifies an agent harness (system prompt, tools, orchestration), runs Harbor benchmarks, and the orchestrator auto-keeps or discards based on score. Adaptive search transitions between explore/exploit/combine/ablation. - 26 iii functions across 5 groups (experiment, task, search, harness, report) - 26 HTTP triggers for REST API at localhost:3111 - Adaptive search strategy with near-miss detection - Crash recovery with consecutive tracking and auto-abort - Per-task failure analysis and experiment diffing - Harness snapshots for instant save/restore - Multi-dataset progress chart generator (plot_progress.py) - 3 sample Harbor benchmark tasks (hello-world, fizzbuzz, file-organizer) - 50 integration tests, all passing Inspired by kevinrgu/autoagent, built from scratch on iii-engine primitives (same relationship as karpathy/autoresearch → n-autoresearch).
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an autoharness: a meta-agent harness, an orchestrator worker exposing REST/SDK endpoints, task fixtures and verifiers, CLI tooling (bench, plotting), runtime/config (iii-config, Docker base), docs/tests/packaging, and a small Rust test assertion tweak. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant API as "iii-engine REST<br/>(HTTP)"
participant Orch as "Orchestrator<br/>Worker"
participant KV as "State KV"
participant Agent as "agent.py<br/>Meta-Agent"
participant Task as "Task Container"
User->>API: POST /api/experiment/setup (tag)
API->>Orch: dispatch setup
Orch->>KV: persist tag metadata
KV-->>Orch: ack
User->>API: POST /api/search/suggest
Orch->>KV: read history & failures
KV-->>Orch: history
Orch-->>API: suggestion
User->>API: POST /api/experiment/register
Orch->>KV: create experiment record
KV-->>Orch: experiment_id
User->>API: POST /api/task/batch (concurrency)
Orch->>Task: run task N (concurrent)
Task->>Agent: invoke harness (instruction.md)
Agent-->>Task: trajectory + logs + score
Task->>Orch: result + tails
Orch->>KV: store task result
KV-->>Orch: ack
User->>API: POST /api/experiment/complete (results)
Orch->>KV: compute best / near-miss / crash counters
KV-->>Orch: previous best
Orch-->>API: keep/discard/abort decision
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ences, update counts
There was a problem hiding this comment.
Actionable comments posted: 20
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@autoharness/agent.py`:
- Around line 41-47: run_shell currently runs subprocesses in the process CWD so
file ops hit the wrong place; update run_shell to execute commands in the task
workspace by passing the task directory as the subprocess cwd (e.g., accept an
optional cwd param or obtain the agent/task workspace and supply cwd=task_dir to
subprocess.run). Apply the same change to the other subprocess.run usages
referenced around lines 114-123 and 138-145 so all shell invocations (the
run_shell function and the other subprocess.run calls) use the task workspace as
cwd rather than the process CWD.
In `@autoharness/bench.sh`:
- Around line 32-42: The script currently attempts to start the orchestrator
from workers/orchestrator but the actual worker is at
orchestrator/orchestrator.py; update the bootstrapping sequence in bench.sh so
after cd "$(dirname "$0")" you change into the correct directory (relative path
from autoharness to the orchestrator, e.g. ../orchestrator) before launching
python3 orchestrator.py &, then capture its PID into ORCH_PID as before; ensure
the sleep and subsequent cd ../.. (or adjusted return path) are updated so the
script returns to the original working location and III_PID/ORCH_PID behavior is
unchanged.
In `@autoharness/Dockerfile.base`:
- Around line 15-18: The Dockerfile currently runs as root (uses RUN mkdir -p
/task/logs, WORKDIR /task, CMD ["sleep","infinity"]); create a dedicated
unprivileged user (e.g., group/user name like taskuser/taskgroup with a stable
non-root UID/GID such as 1000), chown the working directory and /task/logs to
that user, set USER to that unprivileged account, and ensure HOME is set
appropriately so generated scripts run with least privilege; update the
Dockerfile to add the user/group, change ownership of /task and /task/logs, and
switch to that user before the CMD.
In `@autoharness/iii-config.yaml`:
- Line 15: The CORS setting allowed_origins: ["*"] is too permissive for the
local control-plane API; update the configuration to list only trusted origins
(e.g., specific host(s) and ports) or remove/disable CORS if browser access
isn't needed. Locate the allowed_origins key in autoharness/iii-config.yaml and
replace the wildcard entry with a concrete array of allowed origins (or
comment/remove the setting), making sure the values match the exact origin
strings used by your UI or services.
In `@autoharness/orchestrator/orchestrator.py`:
- Around line 275-295: When writing near-miss entries, don't rely on the local
`best` captured at insert time; instead re-fetch the current best for the
experiment's tag and use that when calling `_is_near_miss` before `kv.set` into
`SCOPES["near_misses"]`. Update the logic around the `_is_near_miss(improved,
best, delta_passed, delta_score)` checks (and their counterparts near the other
occurrences) to call the KV to get the up-to-date best (e.g., `await
kv.get(SCOPES["best"], exp["tag"])`) and compute `delta_*` relative to that
current best so only true near-misses are appended to `SCOPES["near_misses"]`.
- Around line 438-459: Ensure we don't read stale artifacts by verifying the log
files were written during this run: when checking reward_file (task_path /
"logs" / "reward.txt") and traj_file (task_path / "logs" / "agent" /
"trajectory.json"), first confirm .exists() AND that their modification time
(e.g., Path.stat().st_mtime) is >= the run start timestamp (start) before
attempting to read/parse; if the mtime is older than start, treat the file as
absent so score/trajectory remain default. Apply this check where reward_file
and traj_file are used in orchestrator.py.
- Around line 611-620: The adapt function is taking an unordered slice
tag_exps[-10:] which is nondeterministic because kv.list(SCOPES["experiments"])
isn't ordered; instead, sort tag_exps by a reliable timestamp field (e.g.,
"created_at", "started_at", or "timestamp") in ascending chronological order (or
descending then take first 10) before computing recent and keep_rate. Update the
code around adapt / tag_exps / recent to perform a stable sort (falling back to
a default value for missing timestamps) and then compute recent =
sorted_tag_exps[-10:] (or equivalent) so keep_rate is computed from the actual
most-recent experiments.
- Around line 239-346: Both complete() and crash() currently mutate aggregate
state even if exp["status"] is already terminal, causing double-counting and
inconsistent best/crash records; add an early guard in both functions (before
entering async with _tag_lock and before mutating tag/best/crashes) that checks
exp["status"] against terminal states (e.g., "keep","discard","crash") and
returns immediately (no-op) if already finished. Ensure the guard uses the
loaded exp object from kv.get(SCOPES["experiments"], eid) and prevents any
updates to SCOPES["tags"], SCOPES["best"], or SCOPES["crashes"] when skipping,
so retries/concurrent calls don't change aggregates.
In `@autoharness/orchestrator/test_orchestrator.py`:
- Around line 21-25: The test helper api() currently constructs headers without
any Authorization, so tests for harness::snapshot and harness::restore cannot
authenticate when AUTOAGENT_AUTH_TOKEN is set; update api(path, data=None,
method="POST") to read os.environ.get("AUTOAGENT_AUTH_TOKEN") and, if present,
add an "Authorization": f"Bearer {token}" entry to the headers dict (keeping
existing Content-Type behavior), and ensure the same pattern is applied for
other api usages including the snapshot/restore test calls (functions/methods
invoking api for snapshot and restore).
In `@autoharness/plot_progress.py`:
- Around line 117-118: The code assumes fetch_json(..."/api/experiment/history")
returns experiments at the top level but the API responds with
{"statusCode":..., "body":...}; update the handling after fetch_json so you
unwrap the response body before reading experiments: retrieve the inner payload
from data.get("body") (and JSON-decode it if it's a string) and then set
experiments = payload if it's a list else payload.get("experiments", []); adjust
the variables around data and experiments (and keep using args.api and the
"/api/experiment/history" call) so the chart reads the actual experiments array.
In `@autoharness/program.md`:
- Around line 64-83: The documentation hardcodes "agent.py" in Steps 2 and 3
which breaks variant runs when HARNESS_PATH is set (e.g.,
HARNESS_PATH=agent-claude.py); update the instructions and example commands to
use the configured HARNESS_PATH variable instead of the literal agent.py so the
meta-agent edits and commits the actual harness file (reference symbols:
HARNESS_PATH, agent.py, COMMIT, the git add/commit and curl register snippet);
replace occurrences of agent.py in the editable-step and register payload with
the HARNESS_PATH variable or its shell interpolation ($HARNESS_PATH) so the loop
consistently targets the configured harness path.
In `@autoharness/README.md`:
- Around line 33-37: The README claims a `workers/orchestrator/` location and
"33 functions" but the code now registers 26 functions in
`autoharness/orchestrator/orchestrator.py`; update all README references
(including the occurrences around the previously called-out sections) to point
to `orchestrator/` (or the correct relative path) and change the advertised
route/function count from 33 to 26, and adjust any descriptive text that lists
or enumerates those endpoints (e.g., the blocks around lines referenced in the
review: the initial listing, the later summaries, and the detailed section at
392-405) so the directory path and counts match the actual implementation and
any example commands or links refer to
`autoharness/orchestrator/orchestrator.py`.
- Around line 29-30: The README's quick-start credential snippet only sets
ANTHROPIC_API_KEY but the default harness described is agent.py which uses the
gpt-5/OpenAI path; update the README so the credentials match the described
default: either change the quick-start to set the OpenAI/`gpt-5` provider key
(e.g., the variable name used by agent.py) or explicitly instruct readers to
switch to `agent-claude.py` before using ANTHROPIC_API_KEY; reference the files
`agent.py`, `agent-claude.py`, and the `gpt-5` provider in the updated text so
the credential guidance is unambiguous.
In `@autoharness/tasks/file-organizer/tests/test.sh`:
- Around line 17-29: The test currently only verifies destination existence via
the check calls and then writes reward based on SCORE/TOTAL; update test.sh to
also assert that the original root copies of the moved files no longer exist
before granting full score. Add absence checks for the original basenames
(report.pdf, photo.jpg, image.png, notes.txt, readme.txt, data.csv, budget.csv)
— e.g., after the existing check "documents/report.pdf" etc. verify that the
corresponding root paths are missing and decrement or fail the check (adjust
SCORE) if any original file still exists; ensure logic around SCORE, TOTAL and
the final reward write remains consistent so full reward only when both
destination exists and originals are removed.
In `@autoharness/tasks/fizzbuzz/environment/Dockerfile`:
- Line 1: Add a non-root user directive to ensure containers don't run as root:
update the Dockerfile (or the base Dockerfile used by this image) to create a
dedicated non-privileged user and set USER to that account (e.g., create a
user/group with adduser/useradd and then include USER nobody or USER
<username>). Ensure any files/directories the container needs are chowned or
have appropriate permissions for that user so runtime steps in this Dockerfile
(FROM autoagent-base:latest) continue to work without root.
In `@autoharness/tasks/fizzbuzz/instruction.md`:
- Line 3: Update the instruction text that currently instructs "When run with
`python fizzbuzz.py`" to use the same interpreter as the verifier by changing it
to "When run with `python3 fizzbuzz.py`"; locate the phrase in
autoharness/tasks/fizzbuzz/instruction.md (the sentence starting with "When run
with") and replace "python" with "python3" so wording is consistent with the
verifier.
In `@autoharness/tasks/fizzbuzz/tests/test.sh`:
- Line 2: The test.sh uses "set -e" which can abort the script before emitting
the reward if "python3 fizzbuzz.py" exits non-zero; change the script to remove
or override the immediate-exit behavior and add a trap or explicit capture of
the exit code around the "python3 fizzbuzz.py" invocation so the script always
writes the reward/result file (e.g., capture "$?" into a variable after running
python3 fizzbuzz.py, write the reward via the existing reward emission commands,
then exit with the captured status), and apply the same pattern for the other
occurrences referenced (lines around 12-13 and line 32) to ensure crash-safe
reward emission.
- Around line 22-30: The test currently only checks LINE1, LINE3, LINE5, LINE15
which is too weak and can be gamed; replace these spot checks with a full 1..100
verification: loop i from 1 to 100, extract the ith line from ACTUAL (e.g., via
sed -n "${i}p"), compute the expected FizzBuzz value for i ("Fizz" if i%3==0,
"Buzz" if i%5==0, "FizzBuzz" if both, otherwise the number), compare the
expected string to the extracted line, and on mismatch print a clear message
including i, expected, and actual and set PASS=0; ensure you stop or continue
based on existing test semantics and handle any trailing whitespace/newline
differences when comparing.
In `@autoharness/tasks/hello-world/instruction.md`:
- Around line 3-5: The fenced code block containing Hello, World! is missing a
language hint; update the opening fence to include a language (e.g., add "text"
after the triple backticks) so the block becomes ```text and the code block now
satisfies MD040 linting for autoharness/tasks/hello-world/instruction.md.
In `@autoharness/tasks/hello-world/tests/test.sh`:
- Around line 4-13: The verifier currently uses command substitution
(EXPECT/ACTUAL) which strips trailing newlines; instead create a byte-exact
expected file (e.g., write the exact "Hello, World!\n" into a temporary expected
file) and perform a binary comparison (use cmp -s or diff --no-deref) against
/task/hello.txt to enforce exact newline semantics; update the logic that sets
EXPECTED/ACTUAL and the if-condition to use the binary compare result to decide
writing 1/0 to /logs/verifier/reward.txt and printing PASS/FAIL.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f4a36929-95d2-486d-a00a-35ffe0802876
⛔ Files ignored due to path filters (1)
autoharness/progress.pngis excluded by!**/*.png
📒 Files selected for processing (24)
autoharness/.dockerignoreautoharness/.gitignoreautoharness/Dockerfile.baseautoharness/README.mdautoharness/agent.pyautoharness/bench.shautoharness/iii-config.yamlautoharness/orchestrator/orchestrator.pyautoharness/orchestrator/test_orchestrator.pyautoharness/plot_progress.pyautoharness/program.mdautoharness/pyproject.tomlautoharness/tasks/file-organizer/environment/Dockerfileautoharness/tasks/file-organizer/instruction.mdautoharness/tasks/file-organizer/task.tomlautoharness/tasks/file-organizer/tests/test.shautoharness/tasks/fizzbuzz/environment/Dockerfileautoharness/tasks/fizzbuzz/instruction.mdautoharness/tasks/fizzbuzz/task.tomlautoharness/tasks/fizzbuzz/tests/test.shautoharness/tasks/hello-world/environment/Dockerfileautoharness/tasks/hello-world/instruction.mdautoharness/tasks/hello-world/task.tomlautoharness/tasks/hello-world/tests/test.sh
autoharness/bench.sh
Outdated
| cd "$(dirname "$0")" | ||
|
|
||
| iii --config iii-config.yaml & | ||
| III_PID=$! | ||
| sleep 2 | ||
|
|
||
| cd workers/orchestrator | ||
| python3 orchestrator.py & | ||
| ORCH_PID=$! | ||
| sleep 3 | ||
| cd ../.. |
There was a problem hiding this comment.
Bootstrap the orchestrator from the real path.
After cd "$(dirname "$0")", this script descends into workers/orchestrator, but the worker added in this PR lives under orchestrator/orchestrator.py. The auto-start path will fail before any benchmark runs.
💡 Suggested fix
- cd workers/orchestrator
+ cd orchestrator
python3 orchestrator.py &
ORCH_PID=$!
sleep 3
- cd ../..
+ cd ..📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cd "$(dirname "$0")" | |
| iii --config iii-config.yaml & | |
| III_PID=$! | |
| sleep 2 | |
| cd workers/orchestrator | |
| python3 orchestrator.py & | |
| ORCH_PID=$! | |
| sleep 3 | |
| cd ../.. | |
| cd "$(dirname "$0")" | |
| iii --config iii-config.yaml & | |
| III_PID=$! | |
| sleep 2 | |
| cd orchestrator | |
| python3 orchestrator.py & | |
| ORCH_PID=$! | |
| sleep 3 | |
| cd .. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@autoharness/bench.sh` around lines 32 - 42, The script currently attempts to
start the orchestrator from workers/orchestrator but the actual worker is at
orchestrator/orchestrator.py; update the bootstrapping sequence in bench.sh so
after cd "$(dirname "$0")" you change into the correct directory (relative path
from autoharness to the orchestrator, e.g. ../orchestrator) before launching
python3 orchestrator.py &, then capture its PID into ORCH_PID as before; ensure
the sleep and subsequent cd ../.. (or adjusted return path) are updated so the
script returns to the original working location and III_PID/ORCH_PID behavior is
unchanged.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
autoharness/README.md (1)
118-127:⚠️ Potential issue | 🟠 MajorClarify default provider credentials in Quick Start to avoid first-run failures.
Line 125-Line 127 only documents
ANTHROPIC_API_KEY, while the README’s default harness description centersagent.py. Ifagent.pyfollows the OpenAI Agents SDK path, users will run with the wrong credential by default. Please either document the OpenAI key in Quick Start or explicitly instruct switching to the Anthropic harness before runningclaude.Proposed README patch
Requirements: Docker, Python 3.10+, [iii-engine](https://github.com/iii-hq/iii-engine), [Harbor](https://github.com/laude-institute/harbor), and whatever model-provider credentials your agent harness requires. ```bash uv tool install harbor cd autoharness cat > .env << 'EOF' -ANTHROPIC_API_KEY=sk-ant-... +# If using agent.py (OpenAI Agents SDK path): +OPENAI_API_KEY=sk-... + +# If using a Claude-based harness: +ANTHROPIC_API_KEY=sk-ant-... EOF @@ ```bash -claude -p "Read program.md and let's kick off a new experiment!" +claude -p "Read program.md and let's kick off a new experiment!" # Claude harness path</details> Also applies to: 145-147 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@autoharness/README.mdaround lines 118 - 127, Update the Quick Start env
setup in README to include both OPENAI_API_KEY and ANTHROPIC_API_KEY examples
and/or add a short note instructing users to switch to the Anthropic harness
before running the "claude" command; specifically, modify the .env snippet
referenced near agent.py / the default harness and the subsequent example
invoking "claude" so it either shows OPENAI_API_KEY by default with a comment
that Claude requires ANTHROPIC_API_KEY, or include both keys with clear comments
indicating which harness (agent.py/OpenAI Agents SDK vs Claude/Anthropic) they
correspond to.</details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@autoharness/README.md:
- Line 43: Multiple fenced code blocks in the README are missing language
identifiers which triggers MD040; update each triple-backtick block shown in the
diff by adding appropriate language tags (e.g., use "text" for plain ASCII-art
or directory listings and "http" for HTTP request/response examples) so the
blocks becometext orhttp; search for the unlabeled ``` blocks around the
examples referenced (ASCII header, tasks/my-task/, and the various API examples
like POST /api/experiment/setup, GET /api/task/list, POST /api/search/suggest,
GET /api/harness/read, POST /api/report/summary, and autoharness/) and add the
corresponding language identifier to each opening fence.
Duplicate comments:
In@autoharness/README.md:
- Around line 118-127: Update the Quick Start env setup in README to include
both OPENAI_API_KEY and ANTHROPIC_API_KEY examples and/or add a short note
instructing users to switch to the Anthropic harness before running the "claude"
command; specifically, modify the .env snippet referenced near agent.py / the
default harness and the subsequent example invoking "claude" so it either shows
OPENAI_API_KEY by default with a comment that Claude requires ANTHROPIC_API_KEY,
or include both keys with clear comments indicating which harness
(agent.py/OpenAI Agents SDK vs Claude/Anthropic) they correspond to.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Organization UI **Review profile**: CHILL **Plan**: Pro **Run ID**: `7b76cc75-66f0-44c3-9bc5-4ed261f6ca33` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 2ba8bd6b9a8166d8db348a7add3d1fa09b585593 and cabad42abe1daecd28e3589330cc028f8d508fc7. </details> <details> <summary>⛔ Files ignored due to path filters (1)</summary> * `image-resize/Cargo.lock` is excluded by `!**/*.lock` </details> <details> <summary>📒 Files selected for processing (3)</summary> * `LICENSE` * `autoharness/README.md` * `image-resize/src/manifest.rs` </details> <details> <summary>✅ Files skipped from review due to trivial changes (2)</summary> * image-resize/src/manifest.rs * LICENSE </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
autoharness/orchestrator/orchestrator.py (1)
864-864: Remove unnecessary f-string prefix.This f-string has no placeholders.
💡 Proposed fix
- return _err({"error": f"Tag not found"}, 404) + return _err({"error": "Tag not found"}, 404)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autoharness/orchestrator/orchestrator.py` at line 864, The return statement currently uses an unnecessary f-string for a static message (the expression _err({"error": f"Tag not found"}, 404)); change it to use a plain string literal for the "Tag not found" message so the call to _err uses {"error": "Tag not found"} with the same 404 status, locating the change in orchestrator.py where the return statement is defined.autoharness/plot_progress.py (2)
36-41: Consider adding URL scheme validation for security.The
fetch_jsonfunction opens arbitrary URLs without validating the scheme. While this is a CLI tool with user-provided--api, adding a check forhttp://orhttps://schemes would prevent accidentalfile://access.💡 Optional hardening
def fetch_json(url, data=None): + if not url.startswith(("http://", "https://")): + raise ValueError(f"URL must use http or https scheme: {url}") body = json.dumps(data).encode() if data else None headers = {"Content-Type": "application/json"} if body else {}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autoharness/plot_progress.py` around lines 36 - 41, The fetch_json function currently accepts arbitrary URL schemes which can allow dangerous accesses (e.g., file://); update fetch_json to parse the provided url (use urllib.parse.urlparse) at the start of the function and validate that parsed.scheme is either "http" or "https", raising a ValueError (or similar) if not allowed; keep the rest of the logic intact so only requests with http/https proceed.
88-88: Addstrict=Truetozip()for defensive coding.The
zip(kept_x, kept_y, kept_labels)call assumes all three lists have equal length. Addingstrict=Truewould catch any unexpected length mismatches during development.💡 Proposed fix
- for x, y, label in zip(kept_x, kept_y, kept_labels): + for x, y, label in zip(kept_x, kept_y, kept_labels, strict=True):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autoharness/plot_progress.py` at line 88, Add defensive checking to the iteration by changing the zip call that iterates over kept_x, kept_y, kept_labels to use zip(..., strict=True); locate the loop using the variables kept_x, kept_y, kept_labels in plot_progress.py (the for x, y, label in zip(...) loop) and update it to pass strict=True so a ValueError is raised if the lists differ in length.autoharness/agent.py (1)
51-54: Unused token/cost tracking placeholders.
total_tokensandestimated_costare hardcoded to 0 and printed but never actually calculated. If token tracking is planned, consider either implementing it or removing these lines to avoid confusion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autoharness/agent.py` around lines 51 - 54, Remove or implement the unused token/cost placeholders: the variables total_tokens and estimated_cost (and their print statements) in autoharness/agent.py are hardcoded to 0 and never updated; either delete these variables and the two print(...) calls to avoid misleading output, or update the code path that handles LLM requests (e.g., the function/method that calls the model/receives responses) to accumulate token usage into total_tokens and compute estimated_cost (using your tokenizer/usage metadata or response.usage) and then log those computed values instead of the hardcoded ones; ensure you update all references to total_tokens/estimated_cost so no unused variables remain.autoharness/orchestrator/test_orchestrator.py (1)
353-354: Remove unnecessary f-string prefix.These strings contain no placeholders, so the
fprefix is unnecessary.💡 Proposed fix
- print(f" Start: iii --config iii-config.yaml") - print(f" Then: cd workers/orchestrator && uv run orchestrator.py") + print(" Start: iii --config iii-config.yaml") + print(" Then: cd orchestrator && uv run orchestrator.py")Note: The path in line 354 should also be
orchestratornotworkers/orchestratorto match the actual directory structure.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autoharness/orchestrator/test_orchestrator.py` around lines 353 - 354, Remove the unnecessary f-string prefixes on the two print calls so they are plain strings (i.e., change print(f"...") to print("...")), and correct the path in the second print to use "orchestrator" instead of "workers/orchestrator" by updating the print statement that currently reads print(f" Then: cd workers/orchestrator && uv run orchestrator.py") accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@autoharness/agent.py`:
- Line 37: The MODEL constant currently uses an incomplete identifier "gpt-5";
update the MODEL value to a full, available gpt-5 variant (for example
"gpt-5.4", "gpt-5.4-mini", or "gpt-5.4-nano") by replacing the string assigned
to MODEL in autoharness.agent (symbol: MODEL) with the appropriate complete
model name for your usage.
In `@autoharness/orchestrator/orchestrator.py`:
- Around line 78-84: The _is_near_miss helper currently uses delta_score >
-NEAR_MISS_THRESHOLD which only catches scores slightly worse than best; update
the score check to capture experiments within the threshold in either direction
by replacing that clause with abs(delta_score) <= NEAR_MISS_THRESHOLD (keeping
the existing checks: not improved, best is not None, and abs(delta_passed) <= 1)
so near-misses include runs close to the current best whether slightly better or
worse.
- Around line 380-386: The near_misses handler (async def near_misses) currently
returns stored entries from kv.list(SCOPES["near_misses"]) without comparing
them to the current best, so stale entries can bias search::suggest_direction;
update near_misses to fetch the current best (e.g., from kv or best-record
function), compute each entry's delta relative to that best and re-filter/sort
using the live threshold before returning, or alternatively implement pruning
logic in the routine that records a new best (the function that writes to
SCOPES["near_misses"]) to remove entries whose recomputed delta_score exceeds
threshold—ensure you reference and update entries via the same keys returned by
kv.list and preserve the existing response shape {"near_misses": ..., "total":
...}.
---
Nitpick comments:
In `@autoharness/agent.py`:
- Around line 51-54: Remove or implement the unused token/cost placeholders: the
variables total_tokens and estimated_cost (and their print statements) in
autoharness/agent.py are hardcoded to 0 and never updated; either delete these
variables and the two print(...) calls to avoid misleading output, or update the
code path that handles LLM requests (e.g., the function/method that calls the
model/receives responses) to accumulate token usage into total_tokens and
compute estimated_cost (using your tokenizer/usage metadata or response.usage)
and then log those computed values instead of the hardcoded ones; ensure you
update all references to total_tokens/estimated_cost so no unused variables
remain.
In `@autoharness/orchestrator/orchestrator.py`:
- Line 864: The return statement currently uses an unnecessary f-string for a
static message (the expression _err({"error": f"Tag not found"}, 404)); change
it to use a plain string literal for the "Tag not found" message so the call to
_err uses {"error": "Tag not found"} with the same 404 status, locating the
change in orchestrator.py where the return statement is defined.
In `@autoharness/orchestrator/test_orchestrator.py`:
- Around line 353-354: Remove the unnecessary f-string prefixes on the two print
calls so they are plain strings (i.e., change print(f"...") to print("...")),
and correct the path in the second print to use "orchestrator" instead of
"workers/orchestrator" by updating the print statement that currently reads
print(f" Then: cd workers/orchestrator && uv run orchestrator.py")
accordingly.
In `@autoharness/plot_progress.py`:
- Around line 36-41: The fetch_json function currently accepts arbitrary URL
schemes which can allow dangerous accesses (e.g., file://); update fetch_json to
parse the provided url (use urllib.parse.urlparse) at the start of the function
and validate that parsed.scheme is either "http" or "https", raising a
ValueError (or similar) if not allowed; keep the rest of the logic intact so
only requests with http/https proceed.
- Line 88: Add defensive checking to the iteration by changing the zip call that
iterates over kept_x, kept_y, kept_labels to use zip(..., strict=True); locate
the loop using the variables kept_x, kept_y, kept_labels in plot_progress.py
(the for x, y, label in zip(...) loop) and update it to pass strict=True so a
ValueError is raised if the lists differ in length.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 16d4926f-451c-4f2a-bc06-6e2d9307c6e9
📒 Files selected for processing (14)
autoharness/Dockerfile.baseautoharness/README.mdautoharness/agent.pyautoharness/bench.shautoharness/iii-config.yamlautoharness/orchestrator/orchestrator.pyautoharness/orchestrator/test_orchestrator.pyautoharness/plot_progress.pyautoharness/program.mdautoharness/tasks/file-organizer/tests/test.shautoharness/tasks/fizzbuzz/instruction.mdautoharness/tasks/fizzbuzz/tests/test.shautoharness/tasks/hello-world/instruction.mdautoharness/tasks/hello-world/tests/test.sh
✅ Files skipped from review due to trivial changes (6)
- autoharness/tasks/hello-world/instruction.md
- autoharness/tasks/fizzbuzz/instruction.md
- autoharness/Dockerfile.base
- autoharness/iii-config.yaml
- autoharness/tasks/fizzbuzz/tests/test.sh
- autoharness/program.md
🚧 Files skipped from review as they are similar to previous changes (2)
- autoharness/tasks/hello-world/tests/test.sh
- autoharness/tasks/file-organizer/tests/test.sh
| - Verify your solution before finishing. | ||
| """ | ||
|
|
||
| MODEL = "gpt-5" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What are the currently available OpenAI GPT models for the Agents SDK?
💡 Result:
For the OpenAI Agents SDK (which uses the Responses API), the currently available GPT models listed in the API docs are: gpt-5.4, gpt-5.4-mini, and gpt-5.4-nano. [1]
Source(s): [1]
🏁 Script executed:
cat -n autoharness/agent.py | sed -n '30,45p'Repository: iii-hq/workers
Length of output: 529
Update model identifier to a complete, available version.
MODEL = "gpt-5" is an incomplete model identifier. Update to a complete gpt-5 variant like "gpt-5.4", "gpt-5.4-mini", or "gpt-5.4-nano", whichever is appropriate for your use case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@autoharness/agent.py` at line 37, The MODEL constant currently uses an
incomplete identifier "gpt-5"; update the MODEL value to a full, available gpt-5
variant (for example "gpt-5.4", "gpt-5.4-mini", or "gpt-5.4-nano") by replacing
the string assigned to MODEL in autoharness.agent (symbol: MODEL) with the
appropriate complete model name for your usage.
There was a problem hiding this comment.
Intentional — MODEL lives in the editable section (the whole point of autoharness is that the meta-agent modifies this). "gpt-5" is a valid OpenAI alias that resolves to the latest gpt-5 variant. Pinning to a specific sub-version like "gpt-5.4" would make the baseline less portable. The meta-agent can (and should) experiment with model selection as part of the loop — it's one of the 12 tracked categories (model_selection).
There was a problem hiding this comment.
@rohitg00, that's a well-reasoned design choice. OpenAI's own docs confirm that gpt-5 is a valid alias with snapshots available, so using the short alias rather than a pinned version makes the baseline more forward-compatible. The meta-agent's ability to experiment with model_selection as a tracked category is a great demonstration of the self-improving loop in action. Withdrawing my concern.
🐇✨
✏️ Learnings added
Learnt from: rohitg00
Repo: iii-hq/workers PR: 7
File: autoharness/agent.py:37-37
Timestamp: 2026-04-06T16:21:56.017Z
Learning: In `autoharness/agent.py`, `MODEL = "gpt-5"` in the editable section is intentional. The `gpt-5` string is a valid OpenAI alias that resolves to the latest gpt-5 variant. It is deliberately unpinned to keep the baseline portable; the meta-agent is expected to experiment with model selection (tracked under the `model_selection` category) as part of the autoharness experiment loop. Do not flag this as an incomplete or invalid model identifier.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
autoharness/orchestrator/orchestrator.py (1)
875-875: Remove unnecessary f-string prefix.The string has no placeholders, so the
fprefix is extraneous.🧹 Suggested fix
- return _err({"error": f"Tag not found"}, 404) + return _err({"error": "Tag not found"}, 404)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autoharness/orchestrator/orchestrator.py` at line 875, The return statement is using an unnecessary f-string for a static message; update the call to _err so the error value uses a plain string instead of an f-string (replace f"Tag not found" with "Tag not found") in the return _err({"error": ...}, 404) expression to remove the extraneous prefix.autoharness/README.md (1)
15-21: Minor grammar polish (optional).Static analysis suggests:
- Line 15: "single file harness" → "single-file harness" (compound adjective)
- Line 66: "auto keep/discard" → "auto-keep/discard"
The repetitive "It doesn't" pattern at line 21 is intentional emphasis and reads fine.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autoharness/README.md` around lines 15 - 21, Update the README phrasing: replace the phrase "single file harness" with the compound adjective "single-file harness" and change "auto keep/discard" to "auto-keep/discard" to fix the grammar; locate these exact strings in autoharness/README.md (the occurrences currently shown as "single file harness" and "auto keep/discard") and apply the hyphenation in those two places.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@autoharness/orchestrator/orchestrator.py`:
- Line 875: The return statement is using an unnecessary f-string for a static
message; update the call to _err so the error value uses a plain string instead
of an f-string (replace f"Tag not found" with "Tag not found") in the return
_err({"error": ...}, 404) expression to remove the extraneous prefix.
In `@autoharness/README.md`:
- Around line 15-21: Update the README phrasing: replace the phrase "single file
harness" with the compound adjective "single-file harness" and change "auto
keep/discard" to "auto-keep/discard" to fix the grammar; locate these exact
strings in autoharness/README.md (the occurrences currently shown as "single
file harness" and "auto keep/discard") and apply the hyphenation in those two
places.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 24e2266a-b23d-4950-b4f1-5af38ac0c3c1
📒 Files selected for processing (2)
autoharness/README.mdautoharness/orchestrator/orchestrator.py
a036132 to
7e0adfa
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (6)
autoharness/tasks/file-organizer/environment/Dockerfile (1)
1-1: Avoid mutablelatestfor the base image.Using
autoharness-base:latestmakes task environments non-reproducible across runs. The base image (autoharness/Dockerfile.base) should be pinned to a specific version tag to ensure consistent behavior.Proposed change
-FROM autoharness-base:latest +ARG AUTOHARNESS_BASE_IMAGE=autoharness-base:2026-04-06 +FROM ${AUTOHARNESS_BASE_IMAGE}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autoharness/tasks/file-organizer/environment/Dockerfile` at line 1, The Dockerfile uses an unpinned base image tag "autoharness-base:latest", which makes builds non-reproducible; update the FROM instruction in the Dockerfile (the line containing FROM autoharness-base:latest) to reference a specific, immutable tag (e.g., autoharness-base:vX.Y.Z) or introduce a build ARG like BASE_IMAGE_TAG and use FROM autoharness-base:${BASE_IMAGE_TAG} so the base image is pinned and can be overridden during builds.autoharness/orchestrator/test_orchestrator.py (1)
353-354: Remove extraneousfprefix from strings without placeholders.These strings don't contain any format placeholders, so the
fprefix is unnecessary.✨ Fix f-string warnings
- print(f" Start: iii --config iii-config.yaml") - print(f" Then: cd workers/orchestrator && uv run orchestrator.py") + print(" Start: iii --config iii-config.yaml") + print(" Then: cd orchestrator && uv run orchestrator.py")Note: Also updated the path from
workers/orchestratortoorchestratorto match the actual directory structure.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autoharness/orchestrator/test_orchestrator.py` around lines 353 - 354, The two print statements using f-strings in test_orchestrator.py are unnecessary since there are no placeholders; remove the leading "f" from the literals in the print calls that output " Start: iii --config iii-config.yaml" and " Then: cd workers/orchestrator && uv run orchestrator.py" and also update the second string to use the correct path ("orchestrator" instead of "workers/orchestrator") so the prints read as plain string literals reflecting the real directory; locate the calls by searching for the print invocations in test_orchestrator.py (the lines that currently start with print(f" Start:") and print(f" Then:")) and make the edits there.autoharness/plot_progress.py (2)
36-41: Add error handling for HTTP failures.
fetch_jsonwill raise an unhandled exception on HTTP errors (4xx/5xx) or network failures. Consider catchingurllib.error.HTTPErrorandurllib.error.URLErrorto provide clearer error messages.🛡️ Add error handling
def fetch_json(url, data=None): body = json.dumps(data).encode() if data else None headers = {"Content-Type": "application/json"} if body else {} req = urllib.request.Request(url, data=body, headers=headers, method="POST" if body else "GET") - with urllib.request.urlopen(req, timeout=10) as resp: - return json.loads(resp.read()) + try: + with urllib.request.urlopen(req, timeout=10) as resp: + return json.loads(resp.read()) + except urllib.error.HTTPError as e: + raise SystemExit(f"HTTP {e.code} from {url}: {e.read().decode()[:200]}") + except urllib.error.URLError as e: + raise SystemExit(f"Cannot reach {url}: {e.reason}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autoharness/plot_progress.py` around lines 36 - 41, fetch_json currently lets urllib exceptions propagate; wrap the urllib.request.urlopen call in a try/except inside fetch_json and catch urllib.error.HTTPError and urllib.error.URLError (optionally a generic Exception fallback). For HTTPError (from urllib.error import HTTPError), include the status code, reason and any response body when available in the raised/logged message; for URLError include the underlying reason (e.g., timeout, DNS). Re-raise a clearer exception (or raise RuntimeError) with context including the URL and method and attach the original exception so callers can inspect it; update the fetch_json function to perform this handling.
88-88: Addstrict=Truetozip()for safer iteration.The
zip()call iterates overkept_x,kept_y, andkept_labelswhich are constructed together and should always have matching lengths. Addingstrict=Truewill catch any future bugs where lengths diverge.✨ Proposed fix
- for x, y, label in zip(kept_x, kept_y, kept_labels): + for x, y, label in zip(kept_x, kept_y, kept_labels, strict=True):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autoharness/plot_progress.py` at line 88, The loop that unpacks kept_x, kept_y, and kept_labels using zip(kept_x, kept_y, kept_labels) should use strict=True to assert all three sequences have identical lengths; update the for-loop that iterates over kept_x/kept_y/kept_labels (the line with for x, y, label in zip(...)) to call zip with strict=True to fail fast if lengths diverge.autoharness/orchestrator/orchestrator.py (1)
875-875: Remove extraneousfprefix.This f-string has no placeholders.
- return _err({"error": f"Tag not found"}, 404) + return _err({"error": "Tag not found"}, 404)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autoharness/orchestrator/orchestrator.py` at line 875, The return statement currently calls _err({"error": f"Tag not found"}, 404) with an unnecessary f-string; update the call to use a plain string literal for the error message (e.g., {"error": "Tag not found"}) so remove the leading f in the dictionary value used in the _err(...) return.autoharness/agent.py (1)
51-54: Debug prints always output zero values.These print statements output hardcoded zeros for
total_tokensandestimated_cost. If token/cost tracking is intended, this needs implementation; otherwise, consider removing these debug artifacts.🧹 Remove placeholder debug prints
output = result.stdout + result.stderr - total_tokens = 0 - estimated_cost = 0.0 - print(f"total_tokens:{total_tokens}") - print(f"estimated_cost:{estimated_cost}") return output[-10000:] if len(output) > 10000 else output🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autoharness/agent.py` around lines 51 - 54, The two debug print statements printing hardcoded zeros for total_tokens and estimated_cost should be removed (or replaced with real tracking). Delete the lines printing total_tokens and estimated_cost in autoharness/agent.py, and if token/cost tracking is desired, implement logic to compute and update the total_tokens and estimated_cost variables where request/response tokens are counted (update the same total_tokens and estimated_cost variables) and only log or print them after they reflect actual values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@autoharness/orchestrator/orchestrator.py`:
- Around line 638-639: adapt() and suggest() are currently reading raw
near-misses via kv.list(SCOPES["near_misses"]) and only filtering by tag, so
they can return entries that no longer meet the active threshold; extract a
shared helper (e.g., get_valid_near_misses(best, tag) or
filter_near_misses_against_best) that takes the current best result and tag,
calls kv.list(SCOPES["near_misses"]) and re-applies the same threshold/distance
re-filtering logic used by the near_misses endpoint, then update adapt() and
suggest() to call this helper (instead of the simple tag-only filter) so both
functions only use near-misses that are still within the valid threshold
relative to the current best.
---
Nitpick comments:
In `@autoharness/agent.py`:
- Around line 51-54: The two debug print statements printing hardcoded zeros for
total_tokens and estimated_cost should be removed (or replaced with real
tracking). Delete the lines printing total_tokens and estimated_cost in
autoharness/agent.py, and if token/cost tracking is desired, implement logic to
compute and update the total_tokens and estimated_cost variables where
request/response tokens are counted (update the same total_tokens and
estimated_cost variables) and only log or print them after they reflect actual
values.
In `@autoharness/orchestrator/orchestrator.py`:
- Line 875: The return statement currently calls _err({"error": f"Tag not
found"}, 404) with an unnecessary f-string; update the call to use a plain
string literal for the error message (e.g., {"error": "Tag not found"}) so
remove the leading f in the dictionary value used in the _err(...) return.
In `@autoharness/orchestrator/test_orchestrator.py`:
- Around line 353-354: The two print statements using f-strings in
test_orchestrator.py are unnecessary since there are no placeholders; remove the
leading "f" from the literals in the print calls that output " Start: iii
--config iii-config.yaml" and " Then: cd workers/orchestrator && uv run
orchestrator.py" and also update the second string to use the correct path
("orchestrator" instead of "workers/orchestrator") so the prints read as plain
string literals reflecting the real directory; locate the calls by searching for
the print invocations in test_orchestrator.py (the lines that currently start
with print(f" Start:") and print(f" Then:")) and make the edits there.
In `@autoharness/plot_progress.py`:
- Around line 36-41: fetch_json currently lets urllib exceptions propagate; wrap
the urllib.request.urlopen call in a try/except inside fetch_json and catch
urllib.error.HTTPError and urllib.error.URLError (optionally a generic Exception
fallback). For HTTPError (from urllib.error import HTTPError), include the
status code, reason and any response body when available in the raised/logged
message; for URLError include the underlying reason (e.g., timeout, DNS).
Re-raise a clearer exception (or raise RuntimeError) with context including the
URL and method and attach the original exception so callers can inspect it;
update the fetch_json function to perform this handling.
- Line 88: The loop that unpacks kept_x, kept_y, and kept_labels using
zip(kept_x, kept_y, kept_labels) should use strict=True to assert all three
sequences have identical lengths; update the for-loop that iterates over
kept_x/kept_y/kept_labels (the line with for x, y, label in zip(...)) to call
zip with strict=True to fail fast if lengths diverge.
In `@autoharness/tasks/file-organizer/environment/Dockerfile`:
- Line 1: The Dockerfile uses an unpinned base image tag
"autoharness-base:latest", which makes builds non-reproducible; update the FROM
instruction in the Dockerfile (the line containing FROM autoharness-base:latest)
to reference a specific, immutable tag (e.g., autoharness-base:vX.Y.Z) or
introduce a build ARG like BASE_IMAGE_TAG and use FROM
autoharness-base:${BASE_IMAGE_TAG} so the base image is pinned and can be
overridden during builds.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fbf124c8-cb3f-4e7a-a13a-34d6ef72eea6
📒 Files selected for processing (18)
autoharness/Dockerfile.baseautoharness/README.mdautoharness/agent.pyautoharness/bench.shautoharness/iii-config.yamlautoharness/orchestrator/orchestrator.pyautoharness/orchestrator/test_orchestrator.pyautoharness/plot_progress.pyautoharness/program.mdautoharness/tasks/file-organizer/environment/Dockerfileautoharness/tasks/file-organizer/task.tomlautoharness/tasks/file-organizer/tests/test.shautoharness/tasks/fizzbuzz/instruction.mdautoharness/tasks/fizzbuzz/task.tomlautoharness/tasks/fizzbuzz/tests/test.shautoharness/tasks/hello-world/instruction.mdautoharness/tasks/hello-world/task.tomlautoharness/tasks/hello-world/tests/test.sh
✅ Files skipped from review due to trivial changes (10)
- autoharness/tasks/hello-world/instruction.md
- autoharness/tasks/fizzbuzz/instruction.md
- autoharness/tasks/hello-world/tests/test.sh
- autoharness/Dockerfile.base
- autoharness/iii-config.yaml
- autoharness/tasks/hello-world/task.toml
- autoharness/tasks/file-organizer/tests/test.sh
- autoharness/tasks/file-organizer/task.toml
- autoharness/program.md
- autoharness/tasks/fizzbuzz/task.toml
🚧 Files skipped from review as they are similar to previous changes (1)
- autoharness/tasks/fizzbuzz/tests/test.sh
| all_nm = await kv.list(SCOPES["near_misses"]) | ||
| near_misses = [n for n in all_nm if n.get("tag") == inp.get("tag")] |
There was a problem hiding this comment.
Internal near-miss consumers bypass threshold re-filtering.
While the near_misses endpoint (lines 380-397) correctly re-filters against the current best, both adapt() (lines 638-639) and suggest() (lines 677-678) fetch raw near-misses filtered only by tag. This means combine-mode selection and suggestion generation may use stale entries that have drifted outside the threshold.
🔧 Proposed fix — extract a shared helper
+async def _get_filtered_near_misses(kv, tag):
+ """Get near-misses re-filtered against current best threshold."""
+ current_best = await kv.get(SCOPES["best"], tag)
+ all_nm = await kv.list(SCOPES["near_misses"])
+ result = []
+ for n in all_nm:
+ if n.get("tag") != tag:
+ continue
+ if current_best:
+ delta_p = n.get("passed", 0) - current_best["passed"]
+ delta_s = n.get("aggregate_score", 0) - current_best["aggregate_score"]
+ if abs(delta_p) > 1 or abs(delta_s) > NEAR_MISS_THRESHOLD:
+ continue
+ result.append(n)
+ return result
# In adapt():
- all_nm = await kv.list(SCOPES["near_misses"])
- near_misses = [n for n in all_nm if n.get("tag") == inp.get("tag")]
+ near_misses = await _get_filtered_near_misses(kv, inp.get("tag"))
# In suggest():
- all_nm = await kv.list(SCOPES["near_misses"])
- near_misses = [n for n in all_nm if n.get("tag") == inp.get("tag")]
+ near_misses = await _get_filtered_near_misses(kv, inp.get("tag"))Also applies to: 677-678
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@autoharness/orchestrator/orchestrator.py` around lines 638 - 639, adapt() and
suggest() are currently reading raw near-misses via
kv.list(SCOPES["near_misses"]) and only filtering by tag, so they can return
entries that no longer meet the active threshold; extract a shared helper (e.g.,
get_valid_near_misses(best, tag) or filter_near_misses_against_best) that takes
the current best result and tag, calls kv.list(SCOPES["near_misses"]) and
re-applies the same threshold/distance re-filtering logic used by the
near_misses endpoint, then update adapt() and suggest() to call this helper
(instead of the simple tag-only filter) so both functions only use near-misses
that are still within the valid threshold relative to the current best.
Summary
autoharness/— a self-improving agent harness worker on iii-engineWhat's included
orchestrator/orchestrator.pyorchestrator/test_orchestrator.pyagent.pyprogram.mdbench.shplot_progress.pytasks/iii-config.yamlDockerfile.baseFunction groups (26 total)
Tested
claude-codeagentTest plan
iii --config iii-config.yamlstarts enginepython3 orchestrator/orchestrator.pyregisters 26 functionspython3 orchestrator/test_orchestrator.pypasses 50/50./bench.sh <tag>runs full benchmark cycleharbor run -p tasks/ -a claude-code --env-file .envpasses 3/3python3 plot_progress.py --tag <tag>generates chartSummary by CodeRabbit
New Features
New Content
Documentation
Tests
Chores