Skip to content

TPI-004 eval-first monkey model Runpod evidence pass#3

Draft
gb250e wants to merge 8 commits intoexp/eval-first-003from
exp/eval-first-004
Draft

TPI-004 eval-first monkey model Runpod evidence pass#3
gb250e wants to merge 8 commits intoexp/eval-first-003from
exp/eval-first-004

Conversation

@gb250e
Copy link
Copy Markdown
Owner

@gb250e gb250e commented Mar 21, 2026

Purpose

This is the TPI-004 internal review PR.

TPI-003 closed the environment-selection question and selected the runpod_parameter_golf path as the smallest runnable route. TPI-004 is the first actual evidence pass for the existing eval-first monkey-model branch.

Thesis

In the selected Runpod Parameter Golf environment, the current eval-first monkey-model branch should be tested with one real baseline/candidate pair before any further branching or mechanism expansion.

Scope

  • keep the existing eval-first mechanism unchanged
  • use the selected Runpod path
  • fetch the published assets if needed
  • run one baseline/candidate pair
  • capture runtime and val_bpb
  • preserve monkey model framing in public-facing artifacts

What is included

  • Runpod execution plan
  • results placeholder for the first evidence pair
  • run-notes surface for the actual Runpod pass
  • PR summary placeholder for the next update

What is intentionally not included yet

  • new model mechanism
  • tokenizer changes
  • final non-record submission folder
  • final evidence numbers

Expected next step

  1. provision or attach the chosen Runpod pod
  2. fetch sp1024 assets with --train-shards 1
  3. run baseline with EVAL_STRIDE=1024
  4. run candidate with EVAL_STRIDE=128
  5. capture runtime + val_bpb

Public-facing safety

This branch uses monkey model framing only and is intended to avoid exposing proprietary architecture language.

Copy link
Copy Markdown
Owner Author

gb250e commented Mar 21, 2026

LLM review checkpoint for TPI-004

Current assessment

  • The execution contract is now fixed enough that model-side ambiguity is no longer the blocker.
  • The failure is operational and sharply localized: pod attachment / access handoff.
  • move_to_failure_ledger is the correct classification for this turn.

Why this is a good failure

  • It preserves a precise blocker instead of blurring it into generic sharpening.
  • It shows the eval-first monkey-model policy itself has not yet been falsified.
  • It keeps the next loop focused on access handoff rather than new mechanisms or environment re-selection.

Required next step

Open TPI-005 as an attachability / access-handoff pass.
The first objective should be to obtain and exercise a concrete pod attach command or SSH path that lands in /workspace/parameter-golf.

Update note for this PR

The PR body should highlight:

  • selected Runpod path remains valid
  • blocker is missing pod access, not model behavior
  • next-turn first command should be the concrete pod attach / SSH invocation
  • baseline/candidate contract remains unchanged (EVAL_STRIDE=1024 vs EVAL_STRIDE=128)

@gb250e
Copy link
Copy Markdown
Owner Author

gb250e commented Mar 21, 2026

TPI-004 evidence update:

  • concrete SSH handoff worked and /workspace/parameter-golf is now reachable
  • exp/eval-first-004 baseline completed on Runpod
  • baseline result: val_bpb=2.4957 at step:149, train_time=600533ms
  • candidate-128 reached eval_policy:sliding_window and completed warmup, so the path is real
  • candidate-128 did not emit a comparable first metric in this turn because the initial sliding evaluation path remained materially slower than baseline
  • first candidate retry also revealed a false OOM if stale baseline processes are not cleaned up before rerun

The branch is no longer blocked on access. The active blocker is candidate runtime overhead and clean GPU/process state between runs.

@gb250e
Copy link
Copy Markdown
Owner Author

gb250e commented Mar 21, 2026

H100 rerun update:

  • New H100 SSH route worked after pod recreation: ssh root@157.66.254.40 -p 19285 -i ~/.ssh/id_ed25519
  • Existing /workspace/parameter-golf storage, dataset, and tokenizer were present on the new pod
  • The fresh pod needed Python environment repair: datasets, sentencepiece, and torch 2.9.1+cu128
  • Baseline completed on H100:
    • step:1750
    • train_time:600135ms
    • val_bpb:1.3449
    • final_int8_zlib_roundtrip_exact val_bpb:1.34594409
  • Candidate EVAL_STRIDE=128 also completed on H100:
    • step:1757
    • train_time:600231ms
    • val_bpb:1.3114
    • final_int8_zlib_roundtrip_exact val_bpb:1.31271009
  • H100 conclusion:
    • the old RTX-side candidate startup/runtime blocker no longer reproduces
    • candidate improves score by about 0.0332 val_bpb on the same wallclock budget
    • final sliding-window eval is still materially heavier than baseline (115208ms vs 11309ms)

Remote artifacts on the H100 pod:

  • /workspace/parameter-golf/runs/TPI-004/tpi004_h100_baseline.stdout.log
  • /workspace/parameter-golf/runs/TPI-004/tpi004_h100_baseline.commit.txt
  • /workspace/parameter-golf/runs/TPI-004/tpi004_h100_candidate_128.stdout.log
  • /workspace/parameter-golf/runs/TPI-004/tpi004_h100_candidate_128.commit.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants