You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: autoharness/README.md
+25-20Lines changed: 25 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,16 +1,18 @@
1
1
# autoharness
2
2
3
-
> The forge where agent harnesses are shaped. Autonomous agent engineering on [iii-engine](https://github.com/iii-hq/iii-engine) — structured experiment tracking, adaptive search, and real-time monitoring through Worker/Function/Trigger primitives.
3
+
Self-improving agent harness on [iii-engine](https://github.com/iii-hq/iii-engine).
4
+
5
+
> The forge where agent harnesses are shaped. Autonomous agent engineering with structured experiment tracking, adaptive search, and real-time monitoring through Worker/Function/Trigger primitives.
4
6
5
7
Give an AI agent a task, let it build and iterate on an agent harness autonomously overnight. It modifies the system prompt, tools, agent configuration, and orchestration, runs the benchmark, checks the score, keeps or discards the change, and repeats. Every experiment is tracked in a structured state store, the search strategy adapts automatically based on what's working, failures are diagnosed across runs, and you get 26 REST endpoints for live monitoring.
6
8
7
-
Inspired by [kevinrgu/autoagent](https://github.com/kevinrgu/autoagent). Built from scratch on iii-engine primitives — same relationship as karpathy/autoresearch to [n-autoresearch](https://github.com/iii-hq/n-autoresearch).
9
+
Inspired by the [autoagent](https://github.com/kevinrgu/autoagent) concept. Built from scratch on iii-engine primitives — same relationship as karpathy/autoresearch to [n-autoresearch](https://github.com/iii-hq/n-autoresearch).
8
10
9
11

10
12
11
13
## Why This Exists
12
14
13
-
The autoagent concept is a great idea executed simply: single file harness, `results.tsv`, hill-climbing, Docker isolation. It works. But after watching it run overnight you notice the gaps:
15
+
The original concept is a great idea executed simply: single file harness, `results.tsv`, hill-climbing, Docker isolation. It works. But after watching it run overnight you notice the gaps:
14
16
15
17
**You can't query experiment history.** The TSV is append-only. Want to find all experiments that touched the system prompt and improved? Grep through a flat file. Want the keep rate for the last 10 runs? Count lines manually.
16
18
@@ -38,7 +40,7 @@ The metric is total **score** produced by the benchmark's task test suites. The
@@ -65,7 +67,7 @@ The orchestrator connects to iii-engine over WebSocket and registers 26 function
65
67
|`task::*`| 5 | Benchmark execution. List available tasks, run individual tasks via Harbor, batch-run all tasks with configurable concurrency, retrieve per-task scores, surface failures with stdout/stderr tails. |
66
68
|`search::*`| 4 | Adaptive strategy. Get the current search mode, override it manually, auto-adapt based on keep rate / crash rate / plateau detection / near-miss availability, suggest concrete next directions with category stats and failure patterns. |
67
69
|`harness::*`| 5 | Harness management. Read the current agent.py with editable-region detection, diff against previous commit, save named snapshots to the KV store, restore any snapshot to disk (auth-protected), list all snapshots. |
68
-
|`report::*`| 5 | Monitoring and export. Full summary with stats and score progression, TSV export compatible with autoagent format, per-task diff between any two experiments showing regressions and improvements, top-N leaderboard, list all tags. |
70
+
|`report::*`| 5 | Monitoring and export. Full summary with stats and score progression, TSV export, per-task diff between any two experiments showing regressions and improvements, top-N leaderboard, list all tags. |
69
71
70
72
## The Experiment Loop
71
73
@@ -121,10 +123,13 @@ uv tool install harbor
121
123
cd autoharness
122
124
123
125
cat > .env << 'EOF'
124
-
ANTHROPIC_API_KEY=sk-ant-...
126
+
# Default harness (agent.py) uses gpt-5 via OpenAI Agents SDK
127
+
OPENAI_API_KEY=sk-...
128
+
# To use agent-claude.py instead, set HARNESS_PATH=agent-claude.py and:
The `task.toml` configuration controls timeouts, resource limits, network access, and environment variables:
@@ -242,7 +247,7 @@ All endpoints at `http://localhost:3111`. POST endpoints accept JSON bodies. GET
242
247
243
248
### Experiment Lifecycle
244
249
245
-
```
250
+
```http
246
251
POST /api/experiment/setup {"tag": "apr06"}
247
252
POST /api/experiment/register {"tag", "hypothesis", "description", "category", "commit_sha", "diff_summary"}
248
253
POST /api/experiment/complete {"experiment_id", "passed", "total_tasks", "aggregate_score", "task_scores", "duration_seconds", "tokens_used", "estimated_cost"}
@@ -254,7 +259,7 @@ POST /api/experiment/near-misses {"tag", "limit"?}
254
259
255
260
### Task Execution
256
261
257
-
```
262
+
```http
258
263
GET /api/task/list
259
264
POST /api/task/run {"task_name", "experiment_id", "timeout"?}
260
265
POST /api/task/batch {"experiment_id", "concurrency"?, "timeout"?, "tasks"?}
@@ -264,7 +269,7 @@ POST /api/task/failures {"experiment_id"}
264
269
265
270
### Search Strategy
266
271
267
-
```
272
+
```http
268
273
POST /api/search/suggest {"tag"}
269
274
POST /api/search/strategy {"tag"}
270
275
POST /api/search/set-strategy {"tag", "mode", "reason"}
@@ -273,7 +278,7 @@ POST /api/search/adapt {"tag"}
273
278
274
279
### Harness Management
275
280
276
-
```
281
+
```http
277
282
GET /api/harness/read
278
283
GET /api/harness/diff
279
284
POST /api/harness/snapshot {"name", "commit_sha"?, "experiment_id"?}
@@ -283,7 +288,7 @@ GET /api/harness/snapshots
283
288
284
289
### Reports
285
290
286
-
```
291
+
```http
287
292
POST /api/report/summary {"tag"}
288
293
POST /api/report/leaderboard {"tag", "limit"?}
289
294
POST /api/report/diff {"experiment_a", "experiment_b"}
@@ -293,7 +298,7 @@ GET /api/report/tags
293
298
294
299
## Security
295
300
296
-
The orchestrator supports bearer token authentication via the `AUTOAGENT_AUTH_TOKEN` environment variable. When set, write operations (harness snapshot/restore) require the token in the `Authorization: Bearer <token>` header. When not set, all endpoints are open — suitable for local development.
301
+
The orchestrator supports bearer token authentication via the `AUTOHARNESS_AUTH_TOKEN` environment variable. When set, write operations (harness snapshot/restore) require the token in the `Authorization: Bearer <token>` header. When not set, all endpoints are open — suitable for local development.
0 commit comments