Skip to content

Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds)#537

Open
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-leaky-relu-perlayer-lr-legal-ttt
Open

Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds)#537
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-leaky-relu-perlayer-lr-legal-ttt

Conversation

@Christopher-Lee-McClendon
Copy link
Copy Markdown

@Christopher-Lee-McClendon Christopher-Lee-McClendon commented Mar 23, 2026

Non-Record: 11L LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds)

val_bpb = 1.13872 (best seed) | Mean: 1.13936 ± 0.0008 | Pre-TTT mean: 1.1574 | Artifact: 15.36 MB

Non-record unlimited-compute submission (trained on 4×A100-40GB, ~42 min; eval ~3690s on 1×A100).

What Changed vs PR #526 (BPB 1.14252)

  1. LeakyReLU(0.5)² (from PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518, sofiabod): Replace ReLU² → −0.0035 mean pre-TTT improvement. This accounts for essentially all of the final BPB gain.
  2. Per-layer LR for TTT (from PR Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) #481, mrdavtan): mlp.proj 3×, mlp.fc 0.5× learning rate
  3. Intra-chunk cosine LR (from PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518): Cosine decay within each chunk's 30 TTT epochs

Note on TTT modifications: Per-layer LR and intra-chunk cosine were adopted from other PRs but showed no measurable TTT improvement in this data — TTT gain went from −0.0184 (PR #526) to −0.0182 (this PR), a slight regression. The entire final BPB improvement comes from the better pre-TTT model via LeakyReLU. These TTT modifications require further ablation.

3-Seed Validation

Seed Pre-TTT BPB Final BPB Δ vs PR #526
1337 1.1572 1.13912 −0.00340
42 1.1580 1.14024 −0.00228
7 1.1569 1.13872 −0.00380
Mean 1.1574 1.13936 −0.00316

Architecture

  • 11L depth-recurrence (10 unique BlockCores), d=512, 8 heads, 4 KV heads
  • LeakyReLU(0.5)² MLP, Partial RoPE (16/64), Value Embeddings (128d, layers 9-10)
  • SmearGate (input embeddings), BigramHash(2048), XSA(last 4), U-Net skips, LN Scale
  • SWA, Late QAT, int6+zstd quantization
  • 15,357,245 bytes total (4.0% headroom under 16MB limit)

TTT Protocol (Legal)

  • Score-first: torch.inference_mode() during scoring, then train
  • SGD(lr=0.002, momentum=0.9), 30 epochs/chunk, freeze first 2 blocks
  • Per-layer LR: mlp.proj 3×, mlp.fc 0.5×, intra-chunk cosine decay

Credits

This submission integrates work from many contributors:

@Christopher-Lee-McClendon Christopher-Lee-McClendon changed the title BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds) Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds) Mar 23, 2026
- Replace ReLU² with LeakyReLU(0.5)² activation (-0.004 BPB pre-TTT)
- Add per-layer LR groups: mlp.proj 3x, mlp.fc 0.5x for TTT
- Add intra-chunk cosine LR schedule for TTT epochs
- 3-seed validation: 1.13912, 1.14024, 1.13872 (mean 1.13936)
- Score-first legal TTT with SGD momentum, 30 epochs, freeze-2
- Best seed (7): BPB 1.13872, artifact 15.36 MB
@Christopher-Lee-McClendon Christopher-Lee-McClendon force-pushed the submission/11L-leaky-relu-perlayer-lr-legal-ttt branch from e15a447 to 3b619c7 Compare March 23, 2026 15:18
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 7, 2026
Novel: TTT adapts ONLY scalar/control parameters (q_gain, attn_scale,
mlp_scale, resid_mix, RMSNorm weights, skip_weights, skip_gates).
Matrix weights (c_q/c_k/c_v/proj/MLP/tok_emb) stay frozen.

This is mechanistically different from full-model TTT (openai#1413, openai#537):
the model retunes its existing control knobs rather than learning
new weight directions. Higher LR (0.01) since scalars need bigger steps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds)

BPB: 1.13872 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 3b619c756cd8, file records/track_non_record_16mb/2026-03-23_11L_LeakyReLU_PerLayerLR_LegalTTT/train_gpt.py):

The TTT path at line 849 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=74210 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=74210 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants