Record: MTP-2 Funnel + LeakyReLU(0.75)² + Legal TTT + Parallel Muon by michaelwinczuk · Pull Request #1031 · openai/parameter-golf

michaelwinczuk · 2026-03-28T18:38:55Z

Summary

Added Multi-Token Prediction (MTP_NUM_HEADS=2, MTP_LOSS_WEIGHT=0.1) as auxiliary training signal
MTP forces backbone to learn richer representations by predicting 2 tokens ahead
MTP heads discarded at export — zero 16MB impact, zero eval overhead
Validated -0.0037 BPB improvement on test pod (apples-to-apples, same hardware)
Lighter weight (0.1 vs default 0.2) avoids gradient stealing from main CE loss

Changes from prior submission (val_bpb 1.1185)

MTP_NUM_HEADS: 0 → 2
MTP_LOSS_WEIGHT: 0.2 → 0.1

Changes from SOTA baseline

negative_slope: 0.5 → 0.75
MATRIX_LR: 0.025 → 0.027
WARMDOWN_ITERS: 3500 → 3700
MTP_NUM_HEADS: 0 → 2
MTP_LOSS_WEIGHT: 0.2 → 0.1

Research methodology

8 swarm missions + external cross-validation identified MTP as highest-ROI unexplored lever. The "training funnel" concept: MTP auxiliary loss focuses gradient signal on structurally important tokens without adding parameters to the final checkpoint.

🤖 Generated with Claude Code

One-line activation change (negative_slope 0.5→0.75) + minor LR/warmdown tuning. Discovered via multi-agent think tank swarm research system. 3-seed results with legal TTT: Seed 1337: 1.1183 BPB (15.96MB) Seed 42: 1.1194 BPB (15.96MB) Seed 2024: 1.1179 BPB (15.95MB) Mean: 1.1185 BPB

Added Multi-Token Prediction (MTP_NUM_HEADS=2, MTP_LOSS_WEIGHT=0.1) as auxiliary training signal. MTP forces the backbone to learn richer representations by predicting 2 tokens ahead during training. Heads are discarded at export — zero 16MB impact, zero eval overhead. Validated -0.0037 BPB improvement on test pod (apples-to-apples comparison). Lighter MTP weight (0.1 vs default 0.2) avoids gradient stealing from main CE. Changes from prior submission (1.1185 BPB): - MTP_NUM_HEADS: 0 -> 2 - MTP_LOSS_WEIGHT: 0.2 -> 0.1 Changes from SOTA baseline: - negative_slope: 0.5 -> 0.75 - MATRIX_LR: 0.025 -> 0.027 - WARMDOWN_ITERS: 3500 -> 3700 - MTP_NUM_HEADS: 0 -> 2 - MTP_LOSS_WEIGHT: 0.2 -> 0.1 Research: 8 TTS swarm missions + Grok + Gemini cross-validation. MTP identified as "training funnel" — every gradient counts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…openai#549)

MatoTeziTanka · 2026-04-11T20:02:38Z

Community Review — Record: MTP-2 Funnel + LeakyReLU(0.75)² + Legal TTT + Parallel Muon

BPB: 0.0037 (cache parse — may be delta/std, not val_bpb; check PR title) | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 02eb7a8d204b, file records/track_10min_16mb/2026-03-27_LeakyReLU075_LegalTTT_ParallelMuon_TunedLR/train_gpt.py):

The TTT path at line 1074 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=89459 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=89459 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

michaelwinczuk and others added 2 commits March 27, 2026 12:36

notapplica mentioned this pull request Mar 29, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

demouo added a commit to demouo/parameter-golf that referenced this pull request Mar 30, 2026

experiment: MTP-2 + LeakyReLU(0.75)^2 + GPTQ-lite (from PR openai#1031 …

1ea2c9e

…openai#549)

michaelwinczuk mentioned this pull request Apr 11, 2026

Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011) #1094

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: MTP-2 Funnel + LeakyReLU(0.75)² + Legal TTT + Parallel Muon#1031

Record: MTP-2 Funnel + LeakyReLU(0.75)² + Legal TTT + Parallel Muon#1031
michaelwinczuk wants to merge 2 commits intoopenai:mainfrom
michaelwinczuk:submission/mtp2-funnel-leakyrelu075

michaelwinczuk commented Mar 28, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

michaelwinczuk commented Mar 28, 2026

Summary

Changes from prior submission (val_bpb 1.1185)

Changes from SOTA baseline

Research methodology

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: MTP-2 Funnel + LeakyReLU(0.75)² + Legal TTT + Parallel Muon

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants