Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97

EthanYangTW · 2026-04-10T09:34:37Z

val_bpb = 1.0778 (3-seed mean, std 0.0008) | ~15.99 MB | 8xH100 SXM

3-Seed Results

Seed	Sliding BPP	TTT BPP
1337	1.0786	1.0771
42	1.0792	1.0776
2024	1.0798	1.0787
Mean	1.0792	1.0778

Merged SOTA (PR #1493): 1.0810 BPP. Delta: -0.0032 BPP.

Key Technical Contributions

Parameter Banking + Parallel Muon — 66 matrices → 4 contiguous banks, batched Newton-Schulz (15x faster optimizer step), +3.8% throughput
Fused MLP Triton TMA Kernel — fc→LeakyReLU(0.5)→square in one kernel, +2% throughput. Combined with banking: +5.2% total
Muon Momentum 0.97 — reduced from 0.99, -0.0004 BPP
Triple Depth Recurrence — 17 virtual layers from 11 physical (loop 3,4,5 x3, enable at 35%)
Eval-Time Hash Embedding — zero-init 16384×512 bigram hash, trained in TTT loop
TTT LR=0.01 — optimized from default 0.005, -0.0003 BPP

Compliance (Track B)

Score-first TTT: every chunk scored under no_grad() before gradient update
No SLOT, no pre-quant TTT, no n-gram caches, no Tap-In
All 4 conditions from Issue A Field Guide to Valid Submissions #1017 satisfied

Reproduction

pip install brotli sentencepiece
SEED=1337 TTT_ENABLED=1 HASH_EMBED_ENABLED=1 TTT_LR=0.01 MUON_MOMENTUM=0.97 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1420 @abaybektursun, PR #1394 @clarkkev, PR #1471 @X-Abhishek-X, PR #1477 @aryanbhosale, PR #1460 @resouer, PR #399 @abaybektursun, PR #1514 @dexhunter

…— val_bpb 1.0778 (3-seed mean)

Copilot

Pull request overview

Adds a new Track B (10min/16mb) record submission under records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA, documenting the claimed SOTA result and including training/eval logs plus multiple training script variants.

Changes:

Add record metadata (README.md, submission.json) and reproduction scripts (packed train_gpt.py + several readable/experimental variants).
Add 3-seed training/eval logs and sweep logs supporting the reported BPB.
Add additional experimental notes/modules (e.g., parallel Muon notes, ECT module) and a saved .ptz artifact.

Reviewed changes

Copilot reviewed 3 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/README.md	Record write-up, reported metrics, compliance and reproduction instructions
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/submission.json	Machine-readable submission metadata (seeds, BPB, techniques)
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_gpt.py	Packed submission entrypoint (lzma+base85 `exec`)
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_gpt_banked_fixed.py	Readable “banked” training script variant (banking + parallel Muon)
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_gpt_readable.py	Readable non-banked script variant
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_gpt_focal.py	Script variant with focal-weighted TTT
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_gpt_lora_ttt.py	Script variant adding LoRA adapters for eval-time adaptation
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_gpt_ect.py	Script variant with entropy-constrained training hooks
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_gpt_ect2.py	Additional ECT variant
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/ect_module.py	Standalone ECT controller module (documentation + prototype)
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/parallel_muon_notes.md	Implementation notes for parameter banking / parallel Muon
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_seed42.log	Seed 42 training/eval log for the claimed run
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_seed1337.log	Seed 1337 training/eval log for the claimed run
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_seed2024.log	Seed 2024 training/eval log for the claimed run
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/run_lr010.log	Eval-only log for TTT LR=0.01 sweep run
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/sweep_results.txt	Summary of LR/hash sweep results
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/sweep_results_pod.txt	Placeholder sweep output
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/saved_model.ptz	Saved quantized model artifact included in the repo

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-10T09:41:46Z

records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/README.md

+# Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean)
+
+**val_bpb = 1.0778** (3-seed mean, std 0.0008) | **~15.99 MB** | 8xH100 SXM
+


The README claims an artifact size of "~15.99 MB", but the training logs in this folder report total submission sizes of ~16.02 MB (e.g., 16,032,371 bytes on seed 42). Please reconcile these numbers and clarify whether the cap is 16,000,000 bytes or 16 MiB, and ensure the reported artifact fits the enforced limit.

Copilot · 2026-04-10T09:41:47Z

records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/README.md

+## Reproduction
+
+```bash
+pip install brotli sentencepiece
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
+SEED=1337 TTT_ENABLED=1 HASH_EMBED_ENABLED=1 TTT_LR=0.01 MUON_MOMENTUM=0.97 \
+  torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+Requires: CUTLASS 3.x for EVT backward fusion (optional, falls back to standard PyTorch).


This record directory includes saved_model.ptz (~15MB). Committing large binary artifacts directly into the repo can significantly bloat clone size and slow CI; consider removing it from the PR (or using Git LFS / external artifact hosting) if it isn't strictly required for record verification.

EthanYangTW · 2026-04-10T09:50:56Z

The training logs report Code size: 63396 because the runs used the uncompressed source. The submitted train_gpt.py is LZMA+base85 compressed at 20,555 bytes. Model sizes from the 3 seeds: 15,962,818 / 15,968,266 / 15,966,393 bytes. Total artifact with compressed code: ~15.98 MB (under 16,000,000 bytes on all seeds). The 16 MB cap is 16,000,000 decimal bytes per the competition rules.

…nking HIGH priority Key findings from daily scan: - Merged SOTA updated to 1.0810 (bigbag, PR openai#1493, Apr 9) — was stale at 1.1147 - New target: ≤1.0760 bpb (beat by ≥0.005 nats) - ANS weight compression (PR openai#1510): 1.6MB freed = +2.2M params, zero legality risk - Parameter Banking + Parallel Muon (PR openai#1523): +5.2% throughput, ~30 free steps - Free wins: Muon momentum 0.97 (-0.0004 bpb), QK-Gain 5.25 (monotonic vs 5.0) - Per-Pass Loop Embeddings (PR openai#1518): reduces quant gap 0.0131→0.0114 - Do NOT implement: Eval-Time Hash Emb (illegal pattern), Tap-In V6 (await ruling) - CLAUDE.md: updated SOTA, target, current approach, technique table, Session 9 lessons https://claude.ai/code/session_01FLdCggVuuBKQCUy6J3xyss

Reverted BPB-weighted loss (caused torch.compile slowdown, timed out 2x). Clean forward with standard mean CE. Stacking two proven improvements: - Muon momentum 0.97 (measured -0.00129 in R20v10) - TTT LR 0.01 (measured -0.0003 in PR openai#1523)

msisovic · 2026-04-10T21:22:03Z

Hi, the ideas look cool, but the result is unreproducible for me.

First, the script won't even run before changing to:

import lzma as L,base64 as B,linecache as C
S=L.decompress(B.b85decode('<payload>')).decode()
F=__file__+'.__decompressed__.py'
C.cache[F]=(len(S),None,S.splitlines(True),F)
exec(compile(S,F,'exec'))

Second, you didn't include the needed dependencies for backward EVT fusion to work.

After resolving these, I was able to reproduce your results.

20 virtual layers from 11 physical (was 17 with NUM_LOOPS=2). Layers 3-5 looped 4x total. May be slower per step but deeper model. PR openai#1523 reports this as the biggest single improvement.

…rrence) Extracted from PR openai#1523 diff. Hardcoded correct params that openai#1523 runs with: Muon 0.97, TTT LR 0.01, NUM_LOOPS 3. These were env-var overrides that don't forward to GPU. 659 lines, 63KB code, 20KB packed.

…tch) v21 used FORMAT_RAW but openai#1523 originally uses standard lzma. Repacked with matching format for compatibility.

EthanYangTW · 2026-04-11T00:02:50Z

Hi, the ideas look cool, but the result is unreproducible for me.

First, the script won't even run before changing to:
import lzma as L,base64 as B,linecache as C

S=L.decompress(B.b85decode('<payload>')).decode()

F=__file__+'.__decompressed__.py'

C.cache[F]=(len(S),None,S.splitlines(True),F)

exec(compile(S,F,'exec'))
Second, you didn't include the needed dependencies for backward EVT fusion to work.

After resolving these, I was able to reproduce your results.

Ah, I think I removed the wrong file during clean up in messy pr
I'll close this when I'm home
Again thankyou for your noticing

Porting all openai#1523 hyperparams that differ from openai#1493: - EMA_DECAY: 0.9965 -> 0.997 (stronger smoothing) - WARMDOWN_FRAC: 0.72 -> 0.667 (shorter warmdown) - Muon 0.97 (kept from previous best)

Banking architecture may cause torch.compile graph breaks with fullgraph=True. Disabled fullgraph for all 4 compile calls. Params: Muon 0.97, TTT LR 0.01, NUM_LOOPS 3.

…BLED

All torch.compile removed (including @torch.compile decorator). All Triton/CUTLASS disabled. Fused MLP off. Pure eager mode. If this works: crash is from torch.compile. If fails: crash is elsewhere.

msisovic · 2026-04-11T01:27:20Z

Ah, I think I removed the wrong file during clean up in messy pr
I'll close this when I'm home
Again thankyou for your noticing

Happy to help!

…cluster healthy Cluster heimdall-dev has 3/4 nodes unhealthy. Previous failures (v21-v28) were from degraded nodes, NOT code bugs. This version keeps all openai#1523 features enabled (torch.compile, Triton TMA, banking) for full performance.

…h exec) ROOT CAUSE: @triton.jit requires source code from a real .py file. Packed code runs via exec(lzma.decompress(...)) which has no source file. Fix: force _HAS_TRITON_TMA=False, fused MLP uses PyTorch fallback. Banking + all other features still enabled.

… fused MLP)

@jit

Instead of exec(decompress(...)), writes to _train_impl.py then runs via runpy.run_path. This gives Triton a real source file for @jit inspection. All features enabled including fused MLP Triton kernel. Muon 0.97, TTT LR 0.01, NUM_LOOPS=3.

…ch=1 — val_bpb 1.07636 (3-seed mean) 3-seed mean 1.07636 BPB (std 0.0006), delta -0.00897 nats vs merged SOTA openai#1493. Novel: TMA fused MLP kernel, Tap-In unigram matching (min_match=1, fires 21% of positions), improved parallel residuals from openai#1529, parameter banking from openai#1523. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

EthanYangTW · 2026-04-12T06:13:22Z

Closing in favor of #1561 — clean resubmission with fixed LZMA wrapper.

Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 …

b5c2872

…— val_bpb 1.0778 (3-seed mean)

EthanYangTW marked this pull request as ready for review April 10, 2026 09:36

Copilot AI review requested due to automatic review settings April 10, 2026 09:36

Copilot started reviewing on behalf of EthanYangTW April 10, 2026 09:36 View session

Clean submission: remove experimental files

cbe5ed1

Copilot AI reviewed Apr 10, 2026

View reviewed changes

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R20v23: EXACT original openai#1523 code (no modifications, debug test)

51c8bb0

msisovic mentioned this pull request Apr 11, 2026

Record: ParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523 #1529

Open

samacqua mentioned this pull request Apr 11, 2026

Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530

Open

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026

R20v28: openai#1523 pure banking test - all Triton/CUTLASS/fused DISA…

df46f0b

…BLED

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026

R20v30b: openai#1523 with crash logging wrapper

322a32f

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026

R20v31: openai#1523 with pre-exec diagnostics (logs/DIAG.txt)

075a9e8

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026

R20v32: openai#1523 with stdout diagnostics for crash tracking

8e5750c

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026

R20v34: openai#1523 banking + Muon 0.97, NUM_LOOPS=2 (skip triple w/o…

0db5eb3

… fused MLP)

aryanbhosale mentioned this pull request Apr 11, 2026

Record: SP8192 + Banking + Triple Recurrence + Parallel Residuals + Muon 0.97 + TTT — val_bpb 1.0790 (5-seed mean) #1533

Open

andrewbaggio1 mentioned this pull request Apr 11, 2026

Record: TMA Megakernel + Improved Parallel Residuals + Tap-In min_match=1 — val_bpb 1.07636 (3-seed mean) #1555

Open

11 tasks

dexhunter mentioned this pull request Apr 12, 2026

Record: VarLen Attention + Triton Fused MLP + Doc-TTT + Warmdown 0.75 + Chunk 48 — val_bpb 1.07406 (3-seed mean) #1560

Open

11 tasks

EthanYangTW mentioned this pull request Apr 12, 2026

Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0783 (3-seed mean) #1561

Open

EthanYangTW closed this Apr 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean)#1523