TPI-004 eval-first monkey model Runpod evidence pass by gb250e · Pull Request #3 · gb250e/parameter-golf

gb250e · 2026-03-21T05:33:17Z

Purpose

This is the TPI-004 internal review PR.

TPI-003 closed the environment-selection question and selected the runpod_parameter_golf path as the smallest runnable route. TPI-004 is the first actual evidence pass for the existing eval-first monkey-model branch.

Thesis

In the selected Runpod Parameter Golf environment, the current eval-first monkey-model branch should be tested with one real baseline/candidate pair before any further branching or mechanism expansion.

Scope

keep the existing eval-first mechanism unchanged
use the selected Runpod path
fetch the published assets if needed
run one baseline/candidate pair
capture runtime and val_bpb
preserve monkey model framing in public-facing artifacts

What is included

Runpod execution plan
results placeholder for the first evidence pair
run-notes surface for the actual Runpod pass
PR summary placeholder for the next update

What is intentionally not included yet

new model mechanism
tokenizer changes
final non-record submission folder
final evidence numbers

Expected next step

provision or attach the chosen Runpod pod
fetch sp1024 assets with --train-shards 1
run baseline with EVAL_STRIDE=1024
run candidate with EVAL_STRIDE=128
capture runtime + val_bpb

Public-facing safety

This branch uses monkey model framing only and is intended to avoid exposing proprietary architecture language.

gb250e · 2026-03-21T05:45:29Z

LLM review checkpoint for TPI-004

Current assessment

The execution contract is now fixed enough that model-side ambiguity is no longer the blocker.
The failure is operational and sharply localized: pod attachment / access handoff.
move_to_failure_ledger is the correct classification for this turn.

Why this is a good failure

It preserves a precise blocker instead of blurring it into generic sharpening.
It shows the eval-first monkey-model policy itself has not yet been falsified.
It keeps the next loop focused on access handoff rather than new mechanisms or environment re-selection.

Required next step

Open TPI-005 as an attachability / access-handoff pass.
The first objective should be to obtain and exercise a concrete pod attach command or SSH path that lands in /workspace/parameter-golf.

Update note for this PR

The PR body should highlight:

selected Runpod path remains valid
blocker is missing pod access, not model behavior
next-turn first command should be the concrete pod attach / SSH invocation
baseline/candidate contract remains unchanged (EVAL_STRIDE=1024 vs EVAL_STRIDE=128)

gb250e · 2026-03-21T19:50:05Z

TPI-004 evidence update:

concrete SSH handoff worked and /workspace/parameter-golf is now reachable
exp/eval-first-004 baseline completed on Runpod
baseline result: val_bpb=2.4957 at step:149, train_time=600533ms
candidate-128 reached eval_policy:sliding_window and completed warmup, so the path is real
candidate-128 did not emit a comparable first metric in this turn because the initial sliding evaluation path remained materially slower than baseline
first candidate retry also revealed a false OOM if stale baseline processes are not cleaned up before rerun

The branch is no longer blocked on access. The active blocker is candidate runtime overhead and clean GPU/process state between runs.

gb250e · 2026-03-21T21:11:54Z

H100 rerun update:

New H100 SSH route worked after pod recreation: ssh root@157.66.254.40 -p 19285 -i ~/.ssh/id_ed25519
Existing /workspace/parameter-golf storage, dataset, and tokenizer were present on the new pod
The fresh pod needed Python environment repair: datasets, sentencepiece, and torch 2.9.1+cu128
Baseline completed on H100:
- step:1750
- train_time:600135ms
- val_bpb:1.3449
- final_int8_zlib_roundtrip_exact val_bpb:1.34594409
Candidate EVAL_STRIDE=128 also completed on H100:
- step:1757
- train_time:600231ms
- val_bpb:1.3114
- final_int8_zlib_roundtrip_exact val_bpb:1.31271009
H100 conclusion:
- the old RTX-side candidate startup/runtime blocker no longer reproduces
- candidate improves score by about 0.0332 val_bpb on the same wallclock budget
- final sliding-window eval is still materially heavier than baseline (115208ms vs 11309ms)

Remote artifacts on the H100 pod:

/workspace/parameter-golf/runs/TPI-004/tpi004_h100_baseline.stdout.log
/workspace/parameter-golf/runs/TPI-004/tpi004_h100_baseline.commit.txt
/workspace/parameter-golf/runs/TPI-004/tpi004_h100_candidate_128.stdout.log
/workspace/parameter-golf/runs/TPI-004/tpi004_h100_candidate_128.commit.txt

gb250e and others added 6 commits March 20, 2026 22:30

docs: add TPI-004 Runpod execution plan

6ce4cf4

docs: add TPI-004 result placeholder

9ae2b9a

docs: add TPI-004 run notes placeholder

bf0d1fb

docs: add PR #3 summary placeholder

baeace8

docs: record TPI-004 Runpod blocker evidence

5b2a43d

docs: summarize PR #3 Runpod blocker state

43d42d2

docs: record TPI-004 runpod evidence

ac700c8

docs: add H100 TPI-004 evidence pair

cd7d37e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPI-004 eval-first monkey model Runpod evidence pass#3

TPI-004 eval-first monkey model Runpod evidence pass#3
gb250e wants to merge 8 commits intoexp/eval-first-003from
exp/eval-first-004

gb250e commented Mar 21, 2026

Uh oh!

gb250e commented Mar 21, 2026

Uh oh!

gb250e commented Mar 21, 2026

Uh oh!

gb250e commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gb250e commented Mar 21, 2026

Purpose

Thesis

Scope

What is included

What is intentionally not included yet

Expected next step

Public-facing safety

Uh oh!

gb250e commented Mar 21, 2026

LLM review checkpoint for TPI-004

Current assessment

Why this is a good failure

Required next step

Update note for this PR

Uh oh!

gb250e commented Mar 21, 2026

Uh oh!

gb250e commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants