iii-hq
diff --git a/‎autoharness/Dockerfile.base‎
Lines changed: 7 additions & 1 deletion b/‎autoharness/Dockerfile.base‎
Lines changed: 7 additions & 1 deletion
diff --git a/‎autoharness/README.md‎
Lines changed: 25 additions & 20 deletions b/‎autoharness/README.md‎
Lines changed: 25 additions & 20 deletions
diff --git a/‎autoharness/agent.py‎
Lines changed: 6 additions & 4 deletions b/‎autoharness/agent.py‎
Lines changed: 6 additions & 4 deletions
diff --git a/‎autoharness/bench.sh‎
Lines changed: 3 additions & 3 deletions b/‎autoharness/bench.sh‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎autoharness/iii-config.yaml‎
Lines changed: 1 addition & 1 deletion b/‎autoharness/iii-config.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎autoharness/orchestrator/orchestrator.py‎
Lines changed: 23 additions & 7 deletions b/‎autoharness/orchestrator/orchestrator.py‎
Lines changed: 23 additions & 7 deletions
diff --git a/‎autoharness/orchestrator/test_orchestrator.py‎
Lines changed: 6 additions & 2 deletions b/‎autoharness/orchestrator/test_orchestrator.py‎
Lines changed: 6 additions & 2 deletions
@@ -12,7 +12,13 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
     npm \
     && rm -rf /var/lib/apt/lists/*
 
-RUN mkdir -p /task/logs
+RUN groupadd -g 1000 taskgroup && \
+    useradd -u 1000 -g taskgroup -m taskuser && \
+    mkdir -p /task/logs && \
+    chown -R taskuser:taskgroup /task
+
 WORKDIR /task
+USER taskuser
+ENV HOME=/home/taskuser
 
 CMD ["sleep", "infinity"]
@@ -1,16 +1,18 @@
 # autoharness
 
-> The forge where agent harnesses are shaped. Autonomous agent engineering on [iii-engine](https://github.com/iii-hq/iii-engine) — structured experiment tracking, adaptive search, and real-time monitoring through Worker/Function/Trigger primitives.
+Self-improving agent harness on [iii-engine](https://github.com/iii-hq/iii-engine).
+
+> The forge where agent harnesses are shaped. Autonomous agent engineering with structured experiment tracking, adaptive search, and real-time monitoring through Worker/Function/Trigger primitives.
 
 Give an AI agent a task, let it build and iterate on an agent harness autonomously overnight. It modifies the system prompt, tools, agent configuration, and orchestration, runs the benchmark, checks the score, keeps or discards the change, and repeats. Every experiment is tracked in a structured state store, the search strategy adapts automatically based on what's working, failures are diagnosed across runs, and you get 26 REST endpoints for live monitoring.
 
-Inspired by [kevinrgu/autoagent](https://github.com/kevinrgu/autoagent). Built from scratch on iii-engine primitives — same relationship as karpathy/autoresearch to [n-autoresearch](https://github.com/iii-hq/n-autoresearch).
+Inspired by the [autoagent](https://github.com/kevinrgu/autoagent) concept. Built from scratch on iii-engine primitives — same relationship as karpathy/autoresearch to [n-autoresearch](https://github.com/iii-hq/n-autoresearch).
 
 ![progress](progress.png)
 
 ## Why This Exists
 
-The autoagent concept is a great idea executed simply: single file harness, `results.tsv`, hill-climbing, Docker isolation. It works. But after watching it run overnight you notice the gaps:
+The original concept is a great idea executed simply: single file harness, `results.tsv`, hill-climbing, Docker isolation. It works. But after watching it run overnight you notice the gaps:
 
 **You can't query experiment history.** The TSV is append-only. Want to find all experiments that touched the system prompt and improved? Grep through a flat file. Want the keep rate for the last 10 runs? Count lines manually.
 
@@ -38,7 +40,7 @@ The metric is total **score** produced by the benchmark's task test suites. The
 
 ## Architecture
 
-```
+```text
 +------------------------------------------------------------+
 |  Meta-Agent (Claude, GPT-5, Codex, etc.)                   |
 |  Reads program.md, modifies agent.py, runs experiments     |
@@ -65,7 +67,7 @@ The orchestrator connects to iii-engine over WebSocket and registers 26 function
 | `task::*` | 5 | Benchmark execution. List available tasks, run individual tasks via Harbor, batch-run all tasks with configurable concurrency, retrieve per-task scores, surface failures with stdout/stderr tails. |
 | `search::*` | 4 | Adaptive strategy. Get the current search mode, override it manually, auto-adapt based on keep rate / crash rate / plateau detection / near-miss availability, suggest concrete next directions with category stats and failure patterns. |
 | `harness::*` | 5 | Harness management. Read the current agent.py with editable-region detection, diff against previous commit, save named snapshots to the KV store, restore any snapshot to disk (auth-protected), list all snapshots. |
-| `report::*` | 5 | Monitoring and export. Full summary with stats and score progression, TSV export compatible with autoagent format, per-task diff between any two experiments showing regressions and improvements, top-N leaderboard, list all tags. |
+| `report::*` | 5 | Monitoring and export. Full summary with stats and score progression, TSV export, per-task diff between any two experiments showing regressions and improvements, top-N leaderboard, list all tags. |
 
 ## The Experiment Loop
 
@@ -121,10 +123,13 @@ uv tool install harbor
 cd autoharness
 
 cat > .env << 'EOF'
-ANTHROPIC_API_KEY=sk-ant-...
+# Default harness (agent.py) uses gpt-5 via OpenAI Agents SDK
+OPENAI_API_KEY=sk-...
+# To use agent-claude.py instead, set HARNESS_PATH=agent-claude.py and:
+# ANTHROPIC_API_KEY=sk-ant-...
 EOF
 
-docker build -t autoagent-base -f Dockerfile.base .
+docker build -t autoharness-base -f Dockerfile.base .
 
 # Terminal 1
 iii --config iii-config.yaml
@@ -150,14 +155,14 @@ The meta-agent reads `program.md`, inspects the current harness, runs the benchm
 
 Each task is a self-contained directory under `tasks/` following [Harbor's task format](https://harborframework.com/docs/tasks):
 
-```
+```text
 tasks/my-task/
   task.toml           -- config (timeouts, resources, metadata)
   instruction.md      -- the prompt sent to the agent
   tests/
     test.sh           -- verifier entry point, writes reward to /logs/verifier/reward.txt
   environment/        -- optional, only if task needs custom setup
-    Dockerfile        -- task container (FROM autoagent-base)
+    Dockerfile        -- task container (FROM autoharness-base)
 ```
 
 The `task.toml` configuration controls timeouts, resource limits, network access, and environment variables:
@@ -242,7 +247,7 @@ All endpoints at `http://localhost:3111`. POST endpoints accept JSON bodies. GET
 
 ### Experiment Lifecycle
 
-```
+```http
 POST /api/experiment/setup        {"tag": "apr06"}
 POST /api/experiment/register     {"tag", "hypothesis", "description", "category", "commit_sha", "diff_summary"}
 POST /api/experiment/complete     {"experiment_id", "passed", "total_tasks", "aggregate_score", "task_scores", "duration_seconds", "tokens_used", "estimated_cost"}
@@ -254,7 +259,7 @@ POST /api/experiment/near-misses  {"tag", "limit"?}
 
 ### Task Execution
 
-```
+```http
 GET  /api/task/list
 POST /api/task/run                {"task_name", "experiment_id", "timeout"?}
 POST /api/task/batch              {"experiment_id", "concurrency"?, "timeout"?, "tasks"?}
@@ -264,7 +269,7 @@ POST /api/task/failures           {"experiment_id"}
 
 ### Search Strategy
 
-```
+```http
 POST /api/search/suggest          {"tag"}
 POST /api/search/strategy         {"tag"}
 POST /api/search/set-strategy     {"tag", "mode", "reason"}
@@ -273,7 +278,7 @@ POST /api/search/adapt            {"tag"}
 
 ### Harness Management
 
-```
+```http
 GET  /api/harness/read
 GET  /api/harness/diff
 POST /api/harness/snapshot        {"name", "commit_sha"?, "experiment_id"?}
@@ -283,7 +288,7 @@ GET  /api/harness/snapshots
 
 ### Reports
 
-```
+```http
 POST /api/report/summary          {"tag"}
 POST /api/report/leaderboard      {"tag", "limit"?}
 POST /api/report/diff             {"experiment_a", "experiment_b"}
@@ -293,7 +298,7 @@ GET  /api/report/tags
 
 ## Security
 
-The orchestrator supports bearer token authentication via the `AUTOAGENT_AUTH_TOKEN` environment variable. When set, write operations (harness snapshot/restore) require the token in the `Authorization: Bearer <token>` header. When not set, all endpoints are open — suitable for local development.
+The orchestrator supports bearer token authentication via the `AUTOHARNESS_AUTH_TOKEN` environment variable. When set, write operations (harness snapshot/restore) require the token in the `Authorization: Bearer <token>` header. When not set, all endpoints are open — suitable for local development.
 
 Additional security measures:
 - Path traversal protection on task names (regex validation + resolve + prefix check)
@@ -307,7 +312,7 @@ All configuration is via environment variables:
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `III_WS_URL` | `ws://localhost:49134` | iii-engine WebSocket URL |
-| `AUTOAGENT_AUTH_TOKEN` | (empty) | Bearer token for write endpoints |
+| `AUTOHARNESS_AUTH_TOKEN` | (empty) | Bearer token for write endpoints |
 | `HARNESS_PATH` | `../../agent.py` | Path to the agent harness file |
 | `TASKS_DIR` | `../../tasks` | Path to the benchmark tasks directory |
 | `HARBOR_TIMEOUT` | `600` | Per-task timeout in seconds |
@@ -316,9 +321,9 @@ All configuration is via environment variables:
 | `NEAR_MISS_THRESHOLD` | `0.02` | Score threshold for near-miss detection |
 | `MAX_EXPERIMENTS` | `200` | Budget cap per tag |
 
-## What Changed from autoagent
+## What Changed from the Original
 
-| Concern | autoagent | autoharness |
+| Concern | Original | autoharness |
 |---------|-----------|-------------|
 | State management | `results.tsv` flat file, append-only | Structured KV store with scoped queries via iii-engine |
 | Task execution | Serial, one at a time | Parallel with configurable concurrency via Harbor |
@@ -331,11 +336,11 @@ All configuration is via environment variables:
 | Cost tracking | None | Token count and cost estimation per experiment |
 | Observability | None | OpenTelemetry tracing and metrics via iii-engine |
 | Change classification | None | 12 categories with per-category yield tracking |
-| Progress visualization | Single-dataset chart | Multi-dataset overlay chart matching autoagent style |
+| Progress visualization | Single-dataset chart | Multi-dataset overlay chart with multiple datasets |
 
 ## Project Structure
 
-```
+```text
 autoharness/
   orchestrator/
     orchestrator.py               -- Python worker: 26 functions, 26 triggers
 
@@ -1,5 +1,5 @@
 """
-autoagent-iii harness — the file that the meta-agent modifies.
+autoharness — the agent harness file that the meta-agent modifies.
 
 Everything above the FIXED ADAPTER line is fair game for the meta-agent.
 Everything below is the Harbor integration and must not be modified
@@ -42,8 +42,10 @@
 def run_shell(command: str) -> str:
     """Execute a shell command and return combined stdout+stderr."""
     try:
+        task_cwd = os.environ.get("TASK_DIR", "/task")
         result = subprocess.run(
-            command, shell=True, capture_output=True, text=True, timeout=120
+            command, shell=True, capture_output=True, text=True, timeout=120,
+            cwd=task_cwd,
         )
         output = result.stdout + result.stderr
         total_tokens = 0
@@ -61,7 +63,7 @@ def create_tools():
 
 def create_agent():
     return Agent(
-        name="autoagent",
+        name="harness-agent",
         instructions=SYSTEM_PROMPT,
         model=MODEL,
         tools=create_tools(),
@@ -108,7 +110,7 @@ def to_atif(result: RunResult, duration: float, instruction: str) -> dict:
     }
 
 
-class AutoAgent:
+class HarnessAgent:
     """Harbor BaseAgent adapter."""
 
     async def run(self, task_path: str) -> dict:
 
@@ -5,7 +5,7 @@ API="http://localhost:3111"
 TAG="${1:-$(date +%b%d | tr '[:upper:]' '[:lower:]')}"
 
 echo "=========================================="
-echo "  autoagent-iii benchmark runner"
+echo "  autoharness benchmark runner"
 echo "  tag: $TAG"
 echo "=========================================="
 
@@ -35,11 +35,11 @@ if ! curl -sf "$API/api/report/tags" >/dev/null 2>&1; then
     III_PID=$!
     sleep 2
 
-    cd workers/orchestrator
+    cd orchestrator
     python3 orchestrator.py &
     ORCH_PID=$!
     sleep 3
-    cd ../..
+    cd ..
 
     echo "Started iii-engine (PID $III_PID) + orchestrator (PID $ORCH_PID)"
 
 
@@ -12,7 +12,7 @@ modules:
       port: 3111
       host: 127.0.0.1
       cors:
-        allowed_origins: ["*"]
+        allowed_origins: ["http://localhost:3111", "http://127.0.0.1:3111"]
 
   - class: modules::pubsub::PubSubModule
     config:
 
@@ -16,15 +16,15 @@
 
 VERSION = "0.1.0"
 WS_URL = os.environ.get("III_WS_URL", "ws://localhost:49134")
-WORKER_NAME = "autoagent-orchestrator"
+WORKER_NAME = "autoharness-orchestrator"
 MAX_CONSECUTIVE_CRASHES = int(os.environ.get("MAX_CONSECUTIVE_CRASHES", "3"))
 NEAR_MISS_THRESHOLD = float(os.environ.get("NEAR_MISS_THRESHOLD", "0.02"))
 MAX_EXPERIMENTS = int(os.environ.get("MAX_EXPERIMENTS", "200"))
 HARNESS_PATH = os.environ.get("HARNESS_PATH", os.path.join(os.path.dirname(__file__), "..", "..", "agent.py"))
 TASKS_DIR = os.environ.get("TASKS_DIR", os.path.join(os.path.dirname(__file__), "..", "..", "tasks"))
 HARBOR_TIMEOUT = int(os.environ.get("HARBOR_TIMEOUT", "600"))
 HARBOR_CONCURRENCY = int(os.environ.get("HARBOR_CONCURRENCY", "4"))
-AUTH_TOKEN = os.environ.get("AUTOAGENT_AUTH_TOKEN", "")
+AUTH_TOKEN = os.environ.get("AUTOHARNESS_AUTH_TOKEN", "")
 VALID_STRATEGIES = {"explore", "exploit", "combine", "ablation"}
 
 _SAFE_NAME_RE = re.compile(r"^[a-zA-Z0-9][a-zA-Z0-9._-]{0,127}$")
@@ -80,7 +80,7 @@ def _is_near_miss(improved, best, delta_passed, delta_score):
         not improved
         and best is not None
         and abs(delta_passed) <= 1
-        and delta_score > -NEAR_MISS_THRESHOLD
+        and abs(delta_score) <= NEAR_MISS_THRESHOLD
     )
 
 
@@ -181,7 +181,7 @@ async def setup(data):
 
         tag_data = {
             "name": tag,
-            "branch": f"autoagent/{tag}",
+            "branch": f"autoharness/{tag}",
             "created_at": datetime.now(timezone.utc).isoformat(),
             "best_passed": 0,
             "best_score": 0.0,
@@ -242,6 +242,8 @@ async def complete(data):
         exp = await kv.get(SCOPES["experiments"], eid)
         if not exp:
             return _err({"error": f"Experiment {eid} not found"}, 404)
+        if exp.get("status") in ("keep", "discard", "crash"):
+            return _ok({"experiment_id": eid, "status": exp["status"], "action": "no_op", "reason": "already terminal"})
 
         async with _tag_lock(exp["tag"]):
             best = await kv.get(SCOPES["best"], exp["tag"])
@@ -327,6 +329,8 @@ async def crash(data):
         exp = await kv.get(SCOPES["experiments"], eid)
         if not exp:
             return _err({"error": f"Experiment {eid} not found"}, 404)
+        if exp.get("status") in ("keep", "discard", "crash"):
+            return _ok({"experiment_id": eid, "status": exp["status"], "action": "no_op", "reason": "already terminal"})
 
         exp["status"] = "crash"
         exp["error"] = inp.get("error", "unknown")
@@ -375,8 +379,19 @@ async def best(data):
 
     async def near_misses(data):
         inp = _unwrap_input(data)
+        tag = inp.get("tag")
+        current_best = await kv.get(SCOPES["best"], tag)
         all_nm = await kv.list(SCOPES["near_misses"])
-        tag_nm = [n for n in all_nm if n.get("tag") == inp.get("tag")]
+        tag_nm = []
+        for n in all_nm:
+            if n.get("tag") != tag:
+                continue
+            if current_best:
+                delta_p = n.get("passed", 0) - current_best["passed"]
+                delta_s = n.get("aggregate_score", 0) - current_best["aggregate_score"]
+                if abs(delta_p) > 1 or abs(delta_s) > NEAR_MISS_THRESHOLD:
+                    continue
+            tag_nm.append(n)
         filtered = sorted(tag_nm, key=lambda n: n.get("delta_score", 0), reverse=True)
         limit = inp.get("limit", 20)
         return _ok({"near_misses": filtered[:limit], "total": len(filtered)})
@@ -444,15 +459,15 @@ async def run_single(data):
 
             score = 0.0
             reward_file = task_path / "logs" / "reward.txt"
-            if reward_file.exists():
+            if reward_file.exists() and reward_file.stat().st_mtime >= start:
                 try:
                     score = float(reward_file.read_text().strip())
                 except ValueError:
                     pass
 
             trajectory = None
             traj_file = task_path / "logs" / "agent" / "trajectory.json"
-            if traj_file.exists():
+            if traj_file.exists() and traj_file.stat().st_mtime >= start:
                 try:
                     trajectory = json.loads(traj_file.read_text())
                 except json.JSONDecodeError:
@@ -612,6 +627,7 @@ async def adapt(data):
         inp = _unwrap_input(data)
         all_exps = await kv.list(SCOPES["experiments"])
         tag_exps = [e for e in all_exps if e.get("tag") == inp.get("tag") and e.get("status") != "running"]
+        tag_exps.sort(key=lambda e: e.get("started_at", ""))
 
         if len(tag_exps) < 5:
             return _ok({"mode": "explore", "reason": "too few experiments to adapt"})
 
@@ -1,5 +1,5 @@
 """
-Integration tests for autoagent-iii orchestrator.
+Integration tests for autoharness orchestrator.
 
 Requires iii-engine running at localhost:3111 (REST) / localhost:49134 (WS).
 Start with: iii --config iii-config.yaml
@@ -8,6 +8,7 @@
 """
 
 import json
+import os
 import sys
 import time
 import urllib.request
@@ -22,6 +23,9 @@ def api(path, data=None, method="POST"):
     url = f"{BASE}{path}"
     body = json.dumps(data).encode() if data else None
     headers = {"Content-Type": "application/json"} if body else {}
+    token = os.environ.get("AUTOHARNESS_AUTH_TOKEN")
+    if token:
+        headers["Authorization"] = f"Bearer {token}"
     req = urllib.request.Request(url, data=body, headers=headers, method=method)
     try:
         with urllib.request.urlopen(req, timeout=30) as resp:
@@ -339,7 +343,7 @@ def test_harness_restore():
 def main():
     global PASS, FAIL
 
-    print(f"autoagent-iii orchestrator tests (tag={TAG})")
+    print(f"autoharness orchestrator tests (tag={TAG})")
     print(f"API: {BASE}")
 
     try: