Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds) by Christopher-Lee-McClendon · Pull Request #537 · openai/parameter-golf

Christopher-Lee-McClendon · 2026-03-23T15:02:29Z

Non-Record: 11L LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds)

val_bpb = 1.13872 (best seed) | Mean: 1.13936 ± 0.0008 | Pre-TTT mean: 1.1574 | Artifact: 15.36 MB

Non-record unlimited-compute submission (trained on 4×A100-40GB, ~42 min; eval ~3690s on 1×A100).

What Changed vs PR #526 (BPB 1.14252)

LeakyReLU(0.5)² (from PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518, sofiabod): Replace ReLU² → −0.0035 mean pre-TTT improvement. This accounts for essentially all of the final BPB gain.
Per-layer LR for TTT (from PR Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) #481, mrdavtan): mlp.proj 3×, mlp.fc 0.5× learning rate
Intra-chunk cosine LR (from PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518): Cosine decay within each chunk's 30 TTT epochs

Note on TTT modifications: Per-layer LR and intra-chunk cosine were adopted from other PRs but showed no measurable TTT improvement in this data — TTT gain went from −0.0184 (PR #526) to −0.0182 (this PR), a slight regression. The entire final BPB improvement comes from the better pre-TTT model via LeakyReLU. These TTT modifications require further ablation.

3-Seed Validation

Seed	Pre-TTT BPB	Final BPB	Δ vs PR #526
1337	1.1572	1.13912	−0.00340
42	1.1580	1.14024	−0.00228
7	1.1569	1.13872	−0.00380
Mean	1.1574	1.13936	−0.00316

Architecture

11L depth-recurrence (10 unique BlockCores), d=512, 8 heads, 4 KV heads
LeakyReLU(0.5)² MLP, Partial RoPE (16/64), Value Embeddings (128d, layers 9-10)
SmearGate (input embeddings), BigramHash(2048), XSA(last 4), U-Net skips, LN Scale
SWA, Late QAT, int6+zstd quantization
15,357,245 bytes total (4.0% headroom under 16MB limit)

TTT Protocol (Legal)

Score-first: torch.inference_mode() during scoring, then train
SGD(lr=0.002, momentum=0.9), 30 epochs/chunk, freeze first 2 blocks
Per-layer LR: mlp.proj 3×, mlp.fc 0.5×, intra-chunk cosine decay

Credits

This submission integrates work from many contributors:

LeakyReLU(0.5)² — PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 (sofiabod)
Per-layer LR for TTT — PR Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) #481 (mrdavtan)
Intra-chunk cosine LR — PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 (sofiabod)
30-epoch legal TTT + score-first protocol — Our prior work (PRs Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461, Non-record: 11L + 30-Epoch Legal TTT (BPB 1.14252) #526)
11L depth recurrence — PRs Record: 11L Tight SWA + VE128 + XSA4 + TTT (3-seed mean val_bpb=1.1299) #455, Record: 11L EMA + AdamW TTT 10ep (mean val_bpb=1.1027) #442
Partial RoPE, VE128, LN Scale — PRs Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1246) #374, Record: 11L Tight SWA + VE128 + XSA4 + TTT (3-seed mean val_bpb=1.1299) #455
SmearGate, BigramHash, XSA — Community contributions across multiple PRs
Muon optimizer — PR Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1246) #374 and descendants
SWA + Late QAT + int6/zstd — Evolved across many PRs

- Replace ReLU² with LeakyReLU(0.5)² activation (-0.004 BPB pre-TTT) - Add per-layer LR groups: mlp.proj 3x, mlp.fc 0.5x for TTT - Add intra-chunk cosine LR schedule for TTT epochs - 3-seed validation: 1.13912, 1.14024, 1.13872 (mean 1.13936) - Score-first legal TTT with SGD momentum, 30 epochs, freeze-2 - Best seed (7): BPB 1.13872, artifact 15.36 MB

Novel: TTT adapts ONLY scalar/control parameters (q_gain, attn_scale, mlp_scale, resid_mix, RMSNorm weights, skip_weights, skip_gates). Matrix weights (c_q/c_k/c_v/proj/MLP/tok_emb) stay frozen. This is mechanistically different from full-model TTT (openai#1413, openai#537): the model retunes its existing control knobs rather than learning new weight directions. Higher LR (0.01) since scalars need bigger steps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:07:55Z

Community Review — Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds)

BPB: 1.13872 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 3b619c756cd8, file records/track_non_record_16mb/2026-03-23_11L_LeakyReLU_PerLayerLR_LegalTTT/train_gpt.py):

The TTT path at line 849 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=74210 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=74210 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Christopher-Lee-McClendon changed the title ~~BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds)~~ Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds) Mar 23, 2026

Christopher-Lee-McClendon force-pushed the submission/11L-leaky-relu-perlayer-lr-legal-ttt branch from e15a447 to 3b619c7 Compare March 23, 2026 15:18

notapplica mentioned this pull request Mar 23, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Christopher-Lee-McClendon mentioned this pull request Mar 24, 2026

Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT #598

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds)#537

Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds)#537
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-leaky-relu-perlayer-lr-legal-ttt

Christopher-Lee-McClendon commented Mar 23, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Christopher-Lee-McClendon commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-Record: 11L LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds)

What Changed vs PR #526 (BPB 1.14252)

3-Seed Validation

Architecture

TTT Protocol (Legal)

Credits

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Christopher-Lee-McClendon commented Mar 23, 2026 •

edited

Loading