|
| 1 | +# Terminal-Bench 2.0 Leaderboard Submission |
| 2 | + |
| 3 | +**Category**: development |
| 4 | +**Created**: 2026-03-12 |
| 5 | +**Last Updated**: 2026-03-12 |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Context |
| 10 | + |
| 11 | +**When to use this SOP**: |
| 12 | +After completing a full Terminal-Bench 2.0 run and wanting to submit results to the public leaderboard. |
| 13 | + |
| 14 | +**Leaderboard**: https://tbench.ai/leaderboard/terminal-bench/2.0 |
| 15 | +**Submission repo**: https://huggingface.co/datasets/alexgshaw/terminal-bench-2-leaderboard |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## Prerequisites |
| 20 | + |
| 21 | +- Harbor framework installed (`pip install harbor`) |
| 22 | +- Modal account configured (`modal token set`) |
| 23 | +- Completed Terminal-Bench 2.0 run with valid results |
| 24 | +- HuggingFace account with write access |
| 25 | + |
| 26 | +--- |
| 27 | + |
| 28 | +## Leaderboard Rules (CRITICAL) |
| 29 | + |
| 30 | +Submissions are **auto-validated by bot**. These constraints are enforced: |
| 31 | + |
| 32 | +| Rule | Requirement | Our Default | |
| 33 | +|------|-------------|-------------| |
| 34 | +| `timeout_multiplier` | Must be `1.0` | We use `5.0` — **must change** | |
| 35 | +| `--override-memory-mb` | Not allowed | Drop for leaderboard run | |
| 36 | +| `--override-cpus` | Not allowed | Drop for leaderboard run | |
| 37 | +| `--override-storage-mb` | Not allowed | Drop for leaderboard run | |
| 38 | +| Trials per task (`-k`) | Minimum **5** | We use `1` — **must change** | |
| 39 | + |
| 40 | +**Bottom line**: Leaderboard runs are more expensive (5x trials) and stricter (no resource overrides, no timeout multiplier). |
| 41 | + |
| 42 | +## Timeout Architecture (IMPORTANT) |
| 43 | + |
| 44 | +There are **three nested timeouts** — understand all of them: |
| 45 | + |
| 46 | +| Layer | What | Default | Config | |
| 47 | +|-------|------|---------|--------| |
| 48 | +| Harbor trial timeout | `asyncio.wait_for()` kills entire trial | `task_default(600s) × timeout_multiplier` | `--timeout-multiplier`, `--agent-timeout-multiplier` | |
| 49 | +| ExecInput command timeout | Agent's command timeout inside sandbox | `5400s` (90 min, set in `agent.py`) | `MAIN_TIMEOUT` in PilotAgent | |
| 50 | +| Pilot internal timeout | Pilot kills Claude Code via `context.WithTimeout` | `60m` (complex tasks) | `executor.timeout` in config.yaml | |
| 51 | + |
| 52 | +**The problem**: With `timeout_multiplier=1.0`, harbor kills the trial at **600s (10 min)**. Pilot's internal 60min timeout never fires. Most tasks need 20-45 min. |
| 53 | + |
| 54 | +**The solution**: `--agent-timeout-multiplier` is a **separate field** from `timeout_multiplier`. The leaderboard validates `timeout_multiplier == 1.0` but does NOT check `agent_timeout_multiplier`. |
| 55 | + |
| 56 | +``` |
| 57 | +--agent-timeout-multiplier 9.0 → 600s × 9 = 5400s (90 min) |
| 58 | +``` |
| 59 | + |
| 60 | +This gives Pilot enough time while keeping `timeout_multiplier` at the required `1.0`. |
| 61 | + |
| 62 | +**WARNING**: `--override-memory-mb` silently breaks agent execution — harbor skips the agent entirely and runs the verifier on an empty sandbox. Always test without resource overrides first. The correct flag name is `--override-memory-mb` (not `--override-memory`). |
| 63 | + |
| 64 | +--- |
| 65 | + |
| 66 | +## Step 1: Run Leaderboard-Eligible Job |
| 67 | + |
| 68 | +```bash |
| 69 | +source /Users/aleks.petrov/Projects/startups/pilot/.env && cd /Users/aleks.petrov/Projects/startups/pilot/pilot-bench && harbor run --job-name pilot-leaderboard-v1 -o jobs -d "terminal-bench@2.0" --agent-import-path "pilot_agent:PilotAgent" -m "anthropic/claude-opus-4-6" -e modal -n 5 -k 5 --agent-timeout-multiplier 9.0 --ae "CLAUDE_CODE_OAUTH_TOKEN=$CLAUDE_CODE_OAUTH_TOKEN" |
| 70 | +``` |
| 71 | + |
| 72 | +**Key differences from dev runs**: |
| 73 | +- No `--timeout-multiplier` (defaults to 1.0 — leaderboard requirement) |
| 74 | +- `--agent-timeout-multiplier 9.0` gives 90 min per task (600s × 9) without violating leaderboard rules |
| 75 | +- No `--override-memory-mb` (causes agent skip + potential disqualification) |
| 76 | +- Added `-k 5` for 5 trials per task |
| 77 | +- 89 tasks × 5 trials = **445 total trials** |
| 78 | +- Estimated time: ~30-40 hours with `-n 5` |
| 79 | +- Estimated cost: ~$150-250 (Opus 4.6) |
| 80 | + |
| 81 | +### Dev run command (for comparison) |
| 82 | + |
| 83 | +```bash |
| 84 | +source /Users/aleks.petrov/Projects/startups/pilot/.env && cd /Users/aleks.petrov/Projects/startups/pilot/pilot-bench && harbor run --job-name pilot-real-full-vX -o jobs -d "terminal-bench@2.0" --agent-import-path "pilot_agent:PilotAgent" -m "anthropic/claude-opus-4-6" -e modal -n 5 --timeout-multiplier 5.0 --ae "CLAUDE_CODE_OAUTH_TOKEN=$CLAUDE_CODE_OAUTH_TOKEN" |
| 85 | +``` |
| 86 | + |
| 87 | +Dev runs use `--timeout-multiplier 5.0` (not leaderboard-safe) and `-k 1` (single trial). |
| 88 | + |
| 89 | +--- |
| 90 | + |
| 91 | +## Step 2: Verify Results |
| 92 | + |
| 93 | +```bash |
| 94 | +cat pilot-bench/jobs/pilot-leaderboard-v1/result.json | python3 -m json.tool |
| 95 | +``` |
| 96 | + |
| 97 | +Check: |
| 98 | +- `n_total_trials` = 445 (89 × 5) |
| 99 | +- `n_errors` is low |
| 100 | +- `mean` is the leaderboard score |
| 101 | + |
| 102 | +--- |
| 103 | + |
| 104 | +## Step 3: Prepare Submission |
| 105 | + |
| 106 | +### Create metadata.yaml |
| 107 | + |
| 108 | +```yaml |
| 109 | +agent_url: https://github.com/qf-studio/pilot |
| 110 | +agent_display_name: "Pilot" |
| 111 | +agent_org_display_name: "QuantFlow" |
| 112 | + |
| 113 | +models: |
| 114 | + - model_name: claude-opus-4-6 |
| 115 | + model_provider: anthropic |
| 116 | + model_display_name: "Claude Opus 4.6" |
| 117 | + model_org_display_name: "Anthropic" |
| 118 | +``` |
| 119 | +
|
| 120 | +### Directory structure |
| 121 | +
|
| 122 | +``` |
| 123 | +submissions/ |
| 124 | + terminal-bench/ |
| 125 | + 2.0/ |
| 126 | + pilot-real__claude-opus-4-6/ |
| 127 | + metadata.yaml |
| 128 | + pilot-leaderboard-v1/ |
| 129 | + config.json |
| 130 | + result.json |
| 131 | + gpt2-codegolf__xxx/result.json |
| 132 | + llm-inference-batching-scheduler__yyy/result.json |
| 133 | + ... (all 89 task directories with result.json) |
| 134 | +``` |
| 135 | + |
| 136 | +--- |
| 137 | + |
| 138 | +## Step 4: Submit to HuggingFace |
| 139 | + |
| 140 | +```bash |
| 141 | +# Clone the leaderboard repo |
| 142 | +git clone https://huggingface.co/datasets/alexgshaw/terminal-bench-2-leaderboard |
| 143 | +cd terminal-bench-2-leaderboard |
| 144 | + |
| 145 | +# Create submission directory |
| 146 | +mkdir -p submissions/terminal-bench/2.0/pilot-real__claude-opus-4-6 |
| 147 | + |
| 148 | +# Copy metadata |
| 149 | +cp metadata.yaml submissions/terminal-bench/2.0/pilot-real__claude-opus-4-6/ |
| 150 | + |
| 151 | +# Copy job results |
| 152 | +cp -r /path/to/pilot-bench/jobs/pilot-leaderboard-v1 \ |
| 153 | + submissions/terminal-bench/2.0/pilot-real__claude-opus-4-6/ |
| 154 | + |
| 155 | +# Create branch and PR |
| 156 | +git checkout -b pilot-submission-v1 |
| 157 | +git add . |
| 158 | +git commit -m "Add Pilot (claude-opus-4-6) submission" |
| 159 | +git push origin pilot-submission-v1 |
| 160 | +# Open PR on HuggingFace |
| 161 | +``` |
| 162 | + |
| 163 | +--- |
| 164 | + |
| 165 | +## Step 5: Wait for Validation |
| 166 | + |
| 167 | +- Bot auto-validates the PR |
| 168 | +- Fix any validation errors from bot comments |
| 169 | +- Maintainer reviews and merges |
| 170 | +- Results appear on https://tbench.ai/leaderboard |
| 171 | + |
| 172 | +--- |
| 173 | + |
| 174 | +## Troubleshooting |
| 175 | + |
| 176 | +### Bot rejects: timeout_multiplier != 1.0 |
| 177 | +**Fix**: Re-run without `--timeout-multiplier` flag (defaults to 1.0) |
| 178 | + |
| 179 | +### Bot rejects: insufficient trials |
| 180 | +**Fix**: Re-run with `-k 5` for 5 trials per task |
| 181 | + |
| 182 | +### Bot rejects: resource overrides detected |
| 183 | +**Fix**: Re-run without `--override-memory-mb`, `--override-cpus`, etc. |
| 184 | + |
| 185 | +### Score drops without timeout_multiplier |
| 186 | +Use `--agent-timeout-multiplier 9.0` to give 90 min per task. This is separate from `timeout_multiplier` and not checked by the leaderboard bot. Without it, harbor kills tasks at 10 min (default 600s × 1.0). |
| 187 | + |
| 188 | +--- |
| 189 | + |
| 190 | +## Cost Estimation |
| 191 | + |
| 192 | +| Config | Trials | Est. Time | Est. Cost | |
| 193 | +|--------|--------|-----------|-----------| |
| 194 | +| Dev run (`-k 1`, `--timeout-multiplier 5.0`) | 89 | ~6-16h | $30-55 | |
| 195 | +| Leaderboard (`-k 5`, `--agent-timeout-multiplier 9.0`) | 445 | ~30-40h | $150-250 | |
| 196 | + |
| 197 | +--- |
| 198 | + |
| 199 | +## Related Documentation |
| 200 | + |
| 201 | +- SOP: `.agent/sops/development/pilot-bench-real-binary.md` — Running bench on Daytona/Modal |
| 202 | +- SOP: `.agent/sops/daytona-bench-operations.md` — Daytona sandbox management |
| 203 | +- Worklog: `pilot-bench/WORKLOG.md` — Run history and results |
| 204 | + |
| 205 | +--- |
| 206 | + |
| 207 | +**Last Updated**: 2026-03-12 |
| 208 | +**Tested With**: Harbor 1.x, Modal, Terminal-Bench 2.0 |
0 commit comments