Skip to content

Commit 7e0adfa

Browse files
committed
docs: add one-liner description below heading
1 parent 1aa86c9 commit 7e0adfa

18 files changed

Lines changed: 154 additions & 82 deletions

File tree

autoharness/Dockerfile.base

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,13 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
1212
npm \
1313
&& rm -rf /var/lib/apt/lists/*
1414

15-
RUN mkdir -p /task/logs
15+
RUN groupadd -g 1000 taskgroup && \
16+
useradd -u 1000 -g taskgroup -m taskuser && \
17+
mkdir -p /task/logs && \
18+
chown -R taskuser:taskgroup /task
19+
1620
WORKDIR /task
21+
USER taskuser
22+
ENV HOME=/home/taskuser
1723

1824
CMD ["sleep", "infinity"]

autoharness/README.md

Lines changed: 25 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,18 @@
11
# autoharness
22

3-
> The forge where agent harnesses are shaped. Autonomous agent engineering on [iii-engine](https://github.com/iii-hq/iii-engine) — structured experiment tracking, adaptive search, and real-time monitoring through Worker/Function/Trigger primitives.
3+
Self-improving agent harness on [iii-engine](https://github.com/iii-hq/iii-engine).
4+
5+
> The forge where agent harnesses are shaped. Autonomous agent engineering with structured experiment tracking, adaptive search, and real-time monitoring through Worker/Function/Trigger primitives.
46
57
Give an AI agent a task, let it build and iterate on an agent harness autonomously overnight. It modifies the system prompt, tools, agent configuration, and orchestration, runs the benchmark, checks the score, keeps or discards the change, and repeats. Every experiment is tracked in a structured state store, the search strategy adapts automatically based on what's working, failures are diagnosed across runs, and you get 26 REST endpoints for live monitoring.
68

7-
Inspired by [kevinrgu/autoagent](https://github.com/kevinrgu/autoagent). Built from scratch on iii-engine primitives — same relationship as karpathy/autoresearch to [n-autoresearch](https://github.com/iii-hq/n-autoresearch).
9+
Inspired by the [autoagent](https://github.com/kevinrgu/autoagent) concept. Built from scratch on iii-engine primitives — same relationship as karpathy/autoresearch to [n-autoresearch](https://github.com/iii-hq/n-autoresearch).
810

911
![progress](progress.png)
1012

1113
## Why This Exists
1214

13-
The autoagent concept is a great idea executed simply: single file harness, `results.tsv`, hill-climbing, Docker isolation. It works. But after watching it run overnight you notice the gaps:
15+
The original concept is a great idea executed simply: single file harness, `results.tsv`, hill-climbing, Docker isolation. It works. But after watching it run overnight you notice the gaps:
1416

1517
**You can't query experiment history.** The TSV is append-only. Want to find all experiments that touched the system prompt and improved? Grep through a flat file. Want the keep rate for the last 10 runs? Count lines manually.
1618

@@ -38,7 +40,7 @@ The metric is total **score** produced by the benchmark's task test suites. The
3840

3941
## Architecture
4042

41-
```
43+
```text
4244
+------------------------------------------------------------+
4345
| Meta-Agent (Claude, GPT-5, Codex, etc.) |
4446
| Reads program.md, modifies agent.py, runs experiments |
@@ -65,7 +67,7 @@ The orchestrator connects to iii-engine over WebSocket and registers 26 function
6567
| `task::*` | 5 | Benchmark execution. List available tasks, run individual tasks via Harbor, batch-run all tasks with configurable concurrency, retrieve per-task scores, surface failures with stdout/stderr tails. |
6668
| `search::*` | 4 | Adaptive strategy. Get the current search mode, override it manually, auto-adapt based on keep rate / crash rate / plateau detection / near-miss availability, suggest concrete next directions with category stats and failure patterns. |
6769
| `harness::*` | 5 | Harness management. Read the current agent.py with editable-region detection, diff against previous commit, save named snapshots to the KV store, restore any snapshot to disk (auth-protected), list all snapshots. |
68-
| `report::*` | 5 | Monitoring and export. Full summary with stats and score progression, TSV export compatible with autoagent format, per-task diff between any two experiments showing regressions and improvements, top-N leaderboard, list all tags. |
70+
| `report::*` | 5 | Monitoring and export. Full summary with stats and score progression, TSV export, per-task diff between any two experiments showing regressions and improvements, top-N leaderboard, list all tags. |
6971

7072
## The Experiment Loop
7173

@@ -121,10 +123,13 @@ uv tool install harbor
121123
cd autoharness
122124

123125
cat > .env << 'EOF'
124-
ANTHROPIC_API_KEY=sk-ant-...
126+
# Default harness (agent.py) uses gpt-5 via OpenAI Agents SDK
127+
OPENAI_API_KEY=sk-...
128+
# To use agent-claude.py instead, set HARNESS_PATH=agent-claude.py and:
129+
# ANTHROPIC_API_KEY=sk-ant-...
125130
EOF
126131

127-
docker build -t autoagent-base -f Dockerfile.base .
132+
docker build -t autoharness-base -f Dockerfile.base .
128133

129134
# Terminal 1
130135
iii --config iii-config.yaml
@@ -150,14 +155,14 @@ The meta-agent reads `program.md`, inspects the current harness, runs the benchm
150155

151156
Each task is a self-contained directory under `tasks/` following [Harbor's task format](https://harborframework.com/docs/tasks):
152157

153-
```
158+
```text
154159
tasks/my-task/
155160
task.toml -- config (timeouts, resources, metadata)
156161
instruction.md -- the prompt sent to the agent
157162
tests/
158163
test.sh -- verifier entry point, writes reward to /logs/verifier/reward.txt
159164
environment/ -- optional, only if task needs custom setup
160-
Dockerfile -- task container (FROM autoagent-base)
165+
Dockerfile -- task container (FROM autoharness-base)
161166
```
162167

163168
The `task.toml` configuration controls timeouts, resource limits, network access, and environment variables:
@@ -242,7 +247,7 @@ All endpoints at `http://localhost:3111`. POST endpoints accept JSON bodies. GET
242247

243248
### Experiment Lifecycle
244249

245-
```
250+
```http
246251
POST /api/experiment/setup {"tag": "apr06"}
247252
POST /api/experiment/register {"tag", "hypothesis", "description", "category", "commit_sha", "diff_summary"}
248253
POST /api/experiment/complete {"experiment_id", "passed", "total_tasks", "aggregate_score", "task_scores", "duration_seconds", "tokens_used", "estimated_cost"}
@@ -254,7 +259,7 @@ POST /api/experiment/near-misses {"tag", "limit"?}
254259

255260
### Task Execution
256261

257-
```
262+
```http
258263
GET /api/task/list
259264
POST /api/task/run {"task_name", "experiment_id", "timeout"?}
260265
POST /api/task/batch {"experiment_id", "concurrency"?, "timeout"?, "tasks"?}
@@ -264,7 +269,7 @@ POST /api/task/failures {"experiment_id"}
264269

265270
### Search Strategy
266271

267-
```
272+
```http
268273
POST /api/search/suggest {"tag"}
269274
POST /api/search/strategy {"tag"}
270275
POST /api/search/set-strategy {"tag", "mode", "reason"}
@@ -273,7 +278,7 @@ POST /api/search/adapt {"tag"}
273278

274279
### Harness Management
275280

276-
```
281+
```http
277282
GET /api/harness/read
278283
GET /api/harness/diff
279284
POST /api/harness/snapshot {"name", "commit_sha"?, "experiment_id"?}
@@ -283,7 +288,7 @@ GET /api/harness/snapshots
283288

284289
### Reports
285290

286-
```
291+
```http
287292
POST /api/report/summary {"tag"}
288293
POST /api/report/leaderboard {"tag", "limit"?}
289294
POST /api/report/diff {"experiment_a", "experiment_b"}
@@ -293,7 +298,7 @@ GET /api/report/tags
293298

294299
## Security
295300

296-
The orchestrator supports bearer token authentication via the `AUTOAGENT_AUTH_TOKEN` environment variable. When set, write operations (harness snapshot/restore) require the token in the `Authorization: Bearer <token>` header. When not set, all endpoints are open — suitable for local development.
301+
The orchestrator supports bearer token authentication via the `AUTOHARNESS_AUTH_TOKEN` environment variable. When set, write operations (harness snapshot/restore) require the token in the `Authorization: Bearer <token>` header. When not set, all endpoints are open — suitable for local development.
297302

298303
Additional security measures:
299304
- Path traversal protection on task names (regex validation + resolve + prefix check)
@@ -307,7 +312,7 @@ All configuration is via environment variables:
307312
| Variable | Default | Description |
308313
|----------|---------|-------------|
309314
| `III_WS_URL` | `ws://localhost:49134` | iii-engine WebSocket URL |
310-
| `AUTOAGENT_AUTH_TOKEN` | (empty) | Bearer token for write endpoints |
315+
| `AUTOHARNESS_AUTH_TOKEN` | (empty) | Bearer token for write endpoints |
311316
| `HARNESS_PATH` | `../../agent.py` | Path to the agent harness file |
312317
| `TASKS_DIR` | `../../tasks` | Path to the benchmark tasks directory |
313318
| `HARBOR_TIMEOUT` | `600` | Per-task timeout in seconds |
@@ -316,9 +321,9 @@ All configuration is via environment variables:
316321
| `NEAR_MISS_THRESHOLD` | `0.02` | Score threshold for near-miss detection |
317322
| `MAX_EXPERIMENTS` | `200` | Budget cap per tag |
318323

319-
## What Changed from autoagent
324+
## What Changed from the Original
320325

321-
| Concern | autoagent | autoharness |
326+
| Concern | Original | autoharness |
322327
|---------|-----------|-------------|
323328
| State management | `results.tsv` flat file, append-only | Structured KV store with scoped queries via iii-engine |
324329
| Task execution | Serial, one at a time | Parallel with configurable concurrency via Harbor |
@@ -331,11 +336,11 @@ All configuration is via environment variables:
331336
| Cost tracking | None | Token count and cost estimation per experiment |
332337
| Observability | None | OpenTelemetry tracing and metrics via iii-engine |
333338
| Change classification | None | 12 categories with per-category yield tracking |
334-
| Progress visualization | Single-dataset chart | Multi-dataset overlay chart matching autoagent style |
339+
| Progress visualization | Single-dataset chart | Multi-dataset overlay chart with multiple datasets |
335340

336341
## Project Structure
337342

338-
```
343+
```text
339344
autoharness/
340345
orchestrator/
341346
orchestrator.py -- Python worker: 26 functions, 26 triggers

autoharness/agent.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
"""
2-
autoagent-iii harness — the file that the meta-agent modifies.
2+
autoharness — the agent harness file that the meta-agent modifies.
33
44
Everything above the FIXED ADAPTER line is fair game for the meta-agent.
55
Everything below is the Harbor integration and must not be modified
@@ -42,8 +42,10 @@
4242
def run_shell(command: str) -> str:
4343
"""Execute a shell command and return combined stdout+stderr."""
4444
try:
45+
task_cwd = os.environ.get("TASK_DIR", "/task")
4546
result = subprocess.run(
46-
command, shell=True, capture_output=True, text=True, timeout=120
47+
command, shell=True, capture_output=True, text=True, timeout=120,
48+
cwd=task_cwd,
4749
)
4850
output = result.stdout + result.stderr
4951
total_tokens = 0
@@ -61,7 +63,7 @@ def create_tools():
6163

6264
def create_agent():
6365
return Agent(
64-
name="autoagent",
66+
name="harness-agent",
6567
instructions=SYSTEM_PROMPT,
6668
model=MODEL,
6769
tools=create_tools(),
@@ -108,7 +110,7 @@ def to_atif(result: RunResult, duration: float, instruction: str) -> dict:
108110
}
109111

110112

111-
class AutoAgent:
113+
class HarnessAgent:
112114
"""Harbor BaseAgent adapter."""
113115

114116
async def run(self, task_path: str) -> dict:

autoharness/bench.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ API="http://localhost:3111"
55
TAG="${1:-$(date +%b%d | tr '[:upper:]' '[:lower:]')}"
66

77
echo "=========================================="
8-
echo " autoagent-iii benchmark runner"
8+
echo " autoharness benchmark runner"
99
echo " tag: $TAG"
1010
echo "=========================================="
1111

@@ -35,11 +35,11 @@ if ! curl -sf "$API/api/report/tags" >/dev/null 2>&1; then
3535
III_PID=$!
3636
sleep 2
3737

38-
cd workers/orchestrator
38+
cd orchestrator
3939
python3 orchestrator.py &
4040
ORCH_PID=$!
4141
sleep 3
42-
cd ../..
42+
cd ..
4343

4444
echo "Started iii-engine (PID $III_PID) + orchestrator (PID $ORCH_PID)"
4545

autoharness/iii-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ modules:
1212
port: 3111
1313
host: 127.0.0.1
1414
cors:
15-
allowed_origins: ["*"]
15+
allowed_origins: ["http://localhost:3111", "http://127.0.0.1:3111"]
1616

1717
- class: modules::pubsub::PubSubModule
1818
config:

autoharness/orchestrator/orchestrator.py

Lines changed: 23 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,15 +16,15 @@
1616

1717
VERSION = "0.1.0"
1818
WS_URL = os.environ.get("III_WS_URL", "ws://localhost:49134")
19-
WORKER_NAME = "autoagent-orchestrator"
19+
WORKER_NAME = "autoharness-orchestrator"
2020
MAX_CONSECUTIVE_CRASHES = int(os.environ.get("MAX_CONSECUTIVE_CRASHES", "3"))
2121
NEAR_MISS_THRESHOLD = float(os.environ.get("NEAR_MISS_THRESHOLD", "0.02"))
2222
MAX_EXPERIMENTS = int(os.environ.get("MAX_EXPERIMENTS", "200"))
2323
HARNESS_PATH = os.environ.get("HARNESS_PATH", os.path.join(os.path.dirname(__file__), "..", "..", "agent.py"))
2424
TASKS_DIR = os.environ.get("TASKS_DIR", os.path.join(os.path.dirname(__file__), "..", "..", "tasks"))
2525
HARBOR_TIMEOUT = int(os.environ.get("HARBOR_TIMEOUT", "600"))
2626
HARBOR_CONCURRENCY = int(os.environ.get("HARBOR_CONCURRENCY", "4"))
27-
AUTH_TOKEN = os.environ.get("AUTOAGENT_AUTH_TOKEN", "")
27+
AUTH_TOKEN = os.environ.get("AUTOHARNESS_AUTH_TOKEN", "")
2828
VALID_STRATEGIES = {"explore", "exploit", "combine", "ablation"}
2929

3030
_SAFE_NAME_RE = re.compile(r"^[a-zA-Z0-9][a-zA-Z0-9._-]{0,127}$")
@@ -80,7 +80,7 @@ def _is_near_miss(improved, best, delta_passed, delta_score):
8080
not improved
8181
and best is not None
8282
and abs(delta_passed) <= 1
83-
and delta_score > -NEAR_MISS_THRESHOLD
83+
and abs(delta_score) <= NEAR_MISS_THRESHOLD
8484
)
8585

8686

@@ -181,7 +181,7 @@ async def setup(data):
181181

182182
tag_data = {
183183
"name": tag,
184-
"branch": f"autoagent/{tag}",
184+
"branch": f"autoharness/{tag}",
185185
"created_at": datetime.now(timezone.utc).isoformat(),
186186
"best_passed": 0,
187187
"best_score": 0.0,
@@ -242,6 +242,8 @@ async def complete(data):
242242
exp = await kv.get(SCOPES["experiments"], eid)
243243
if not exp:
244244
return _err({"error": f"Experiment {eid} not found"}, 404)
245+
if exp.get("status") in ("keep", "discard", "crash"):
246+
return _ok({"experiment_id": eid, "status": exp["status"], "action": "no_op", "reason": "already terminal"})
245247

246248
async with _tag_lock(exp["tag"]):
247249
best = await kv.get(SCOPES["best"], exp["tag"])
@@ -327,6 +329,8 @@ async def crash(data):
327329
exp = await kv.get(SCOPES["experiments"], eid)
328330
if not exp:
329331
return _err({"error": f"Experiment {eid} not found"}, 404)
332+
if exp.get("status") in ("keep", "discard", "crash"):
333+
return _ok({"experiment_id": eid, "status": exp["status"], "action": "no_op", "reason": "already terminal"})
330334

331335
exp["status"] = "crash"
332336
exp["error"] = inp.get("error", "unknown")
@@ -375,8 +379,19 @@ async def best(data):
375379

376380
async def near_misses(data):
377381
inp = _unwrap_input(data)
382+
tag = inp.get("tag")
383+
current_best = await kv.get(SCOPES["best"], tag)
378384
all_nm = await kv.list(SCOPES["near_misses"])
379-
tag_nm = [n for n in all_nm if n.get("tag") == inp.get("tag")]
385+
tag_nm = []
386+
for n in all_nm:
387+
if n.get("tag") != tag:
388+
continue
389+
if current_best:
390+
delta_p = n.get("passed", 0) - current_best["passed"]
391+
delta_s = n.get("aggregate_score", 0) - current_best["aggregate_score"]
392+
if abs(delta_p) > 1 or abs(delta_s) > NEAR_MISS_THRESHOLD:
393+
continue
394+
tag_nm.append(n)
380395
filtered = sorted(tag_nm, key=lambda n: n.get("delta_score", 0), reverse=True)
381396
limit = inp.get("limit", 20)
382397
return _ok({"near_misses": filtered[:limit], "total": len(filtered)})
@@ -444,15 +459,15 @@ async def run_single(data):
444459

445460
score = 0.0
446461
reward_file = task_path / "logs" / "reward.txt"
447-
if reward_file.exists():
462+
if reward_file.exists() and reward_file.stat().st_mtime >= start:
448463
try:
449464
score = float(reward_file.read_text().strip())
450465
except ValueError:
451466
pass
452467

453468
trajectory = None
454469
traj_file = task_path / "logs" / "agent" / "trajectory.json"
455-
if traj_file.exists():
470+
if traj_file.exists() and traj_file.stat().st_mtime >= start:
456471
try:
457472
trajectory = json.loads(traj_file.read_text())
458473
except json.JSONDecodeError:
@@ -612,6 +627,7 @@ async def adapt(data):
612627
inp = _unwrap_input(data)
613628
all_exps = await kv.list(SCOPES["experiments"])
614629
tag_exps = [e for e in all_exps if e.get("tag") == inp.get("tag") and e.get("status") != "running"]
630+
tag_exps.sort(key=lambda e: e.get("started_at", ""))
615631

616632
if len(tag_exps) < 5:
617633
return _ok({"mode": "explore", "reason": "too few experiments to adapt"})

autoharness/orchestrator/test_orchestrator.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
"""
2-
Integration tests for autoagent-iii orchestrator.
2+
Integration tests for autoharness orchestrator.
33
44
Requires iii-engine running at localhost:3111 (REST) / localhost:49134 (WS).
55
Start with: iii --config iii-config.yaml
@@ -8,6 +8,7 @@
88
"""
99

1010
import json
11+
import os
1112
import sys
1213
import time
1314
import urllib.request
@@ -22,6 +23,9 @@ def api(path, data=None, method="POST"):
2223
url = f"{BASE}{path}"
2324
body = json.dumps(data).encode() if data else None
2425
headers = {"Content-Type": "application/json"} if body else {}
26+
token = os.environ.get("AUTOHARNESS_AUTH_TOKEN")
27+
if token:
28+
headers["Authorization"] = f"Bearer {token}"
2529
req = urllib.request.Request(url, data=body, headers=headers, method=method)
2630
try:
2731
with urllib.request.urlopen(req, timeout=30) as resp:
@@ -339,7 +343,7 @@ def test_harness_restore():
339343
def main():
340344
global PASS, FAIL
341345

342-
print(f"autoagent-iii orchestrator tests (tag={TAG})")
346+
print(f"autoharness orchestrator tests (tag={TAG})")
343347
print(f"API: {BASE}")
344348

345349
try:

0 commit comments

Comments
 (0)