Skip to content

TPI-003 eval-first monkey model runnable environment pass#2

Draft
gb250e wants to merge 4 commits intoexp/eval-first-002from
exp/eval-first-003
Draft

TPI-003 eval-first monkey model runnable environment pass#2
gb250e wants to merge 4 commits intoexp/eval-first-002from
exp/eval-first-003

Conversation

@gb250e
Copy link
Copy Markdown
Owner

@gb250e gb250e commented Mar 21, 2026

Purpose

This is the TPI-003 internal review PR.

TPI-002 confirmed that the current blocker is operational rather than conceptual. The eval-first monkey-model branch still lacks runtime and val_bpb evidence because the local environment cannot execute the comparison contract.

Thesis

For the current monkey-model eval-first policy, the next highest-value move is to secure the smallest runnable environment that can produce one baseline/candidate evidence pair under Parameter Golf-compatible conditions.

Scope

  • keep the same eval-first mechanism
  • add no new model mechanism
  • select the smallest runnable environment path
  • fix the minimum asset set and command contract
  • preserve monkey model framing in public-facing artifacts

What is included

  • environment selection plan
  • execution contract for one baseline/candidate evidence pair
  • explicit environment path comparison

What is intentionally not included yet

  • new model mechanism
  • tokenizer changes
  • records submission folder
  • final runtime/val_bpb evidence

Expected next step

Choose one runnable environment path (preferably the smallest path that can actually execute the branch), then run one real baseline/candidate pair and record:

  1. eval runtime
  2. val_bpb

Public-facing safety

This branch uses monkey model framing only and is intended to avoid exposing proprietary architecture language.

Copy link
Copy Markdown
Owner Author

gb250e commented Mar 21, 2026

LLM review checkpoint for TPI-003

Current assessment

  • The branch made the right move: it converted a generic blocker into a concrete environment choice.
  • runpod_parameter_golf is the correct primary path because it is both documented and challenge-aligned.
  • Scope discipline remains good: no new model mechanism, tokenizer change, or architecture branch was introduced.

What changed materially

  • the environment debate is now closed
  • the baseline/candidate contract is concrete enough to execute
  • the next turn can focus on one real evidence pair instead of further environment speculation

Decision reading

continue_sharpening is acceptable, but this branch is effectively at the boundary of an execution pass.

Required next step

Open TPI-004 as the actual evidence pass in the chosen Runpod environment.
The minimum acceptable outcome for that turn is one baseline/candidate pair with:

  1. eval runtime
  2. val_bpb
  3. preserved monkey-model framing

Update note for this PR

The PR body should highlight:

  • chosen path: runpod_parameter_golf
  • rejected alternatives and why
  • first command: python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1
  • baseline/candidate contract: EVAL_STRIDE=1024 vs EVAL_STRIDE=128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants