Skip to content

Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean)#1523

Closed
EthanYangTW wants to merge 2 commits intoopenai:mainfrom
EthanYangTW:submission/triple-recurrence-banking-muon97
Closed

Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean)#1523
EthanYangTW wants to merge 2 commits intoopenai:mainfrom
EthanYangTW:submission/triple-recurrence-banking-muon97

Conversation

@EthanYangTW
Copy link
Copy Markdown

Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97

val_bpb = 1.0778 (3-seed mean, std 0.0008) | ~15.99 MB | 8xH100 SXM

3-Seed Results

Seed Sliding BPP TTT BPP
1337 1.0786 1.0771
42 1.0792 1.0776
2024 1.0798 1.0787
Mean 1.0792 1.0778

Merged SOTA (PR #1493): 1.0810 BPP. Delta: -0.0032 BPP.

Key Technical Contributions

  1. Parameter Banking + Parallel Muon — 66 matrices → 4 contiguous banks, batched Newton-Schulz (15x faster optimizer step), +3.8% throughput
  2. Fused MLP Triton TMA Kernel — fc→LeakyReLU(0.5)→square in one kernel, +2% throughput. Combined with banking: +5.2% total
  3. Muon Momentum 0.97 — reduced from 0.99, -0.0004 BPP
  4. Triple Depth Recurrence — 17 virtual layers from 11 physical (loop 3,4,5 x3, enable at 35%)
  5. Eval-Time Hash Embedding — zero-init 16384×512 bigram hash, trained in TTT loop
  6. TTT LR=0.01 — optimized from default 0.005, -0.0003 BPP

Compliance (Track B)

  • Score-first TTT: every chunk scored under no_grad() before gradient update
  • No SLOT, no pre-quant TTT, no n-gram caches, no Tap-In
  • All 4 conditions from Issue A Field Guide to Valid Submissions #1017 satisfied

Reproduction

pip install brotli sentencepiece
SEED=1337 TTT_ENABLED=1 HASH_EMBED_ENABLED=1 TTT_LR=0.01 MUON_MOMENTUM=0.97 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1420 @abaybektursun, PR #1394 @clarkkev, PR #1471 @X-Abhishek-X, PR #1477 @aryanbhosale, PR #1460 @resouer, PR #399 @abaybektursun, PR #1514 @dexhunter

@EthanYangTW EthanYangTW marked this pull request as ready for review April 10, 2026 09:36
Copilot AI review requested due to automatic review settings April 10, 2026 09:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Track B (10min/16mb) record submission under records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA, documenting the claimed SOTA result and including training/eval logs plus multiple training script variants.

Changes:

  • Add record metadata (README.md, submission.json) and reproduction scripts (packed train_gpt.py + several readable/experimental variants).
  • Add 3-seed training/eval logs and sweep logs supporting the reported BPB.
  • Add additional experimental notes/modules (e.g., parallel Muon notes, ECT module) and a saved .ptz artifact.

Reviewed changes

Copilot reviewed 3 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/README.md Record write-up, reported metrics, compliance and reproduction instructions
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/submission.json Machine-readable submission metadata (seeds, BPB, techniques)
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_gpt.py Packed submission entrypoint (lzma+base85 exec)
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_gpt_banked_fixed.py Readable “banked” training script variant (banking + parallel Muon)
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_gpt_readable.py Readable non-banked script variant
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_gpt_focal.py Script variant with focal-weighted TTT
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_gpt_lora_ttt.py Script variant adding LoRA adapters for eval-time adaptation
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_gpt_ect.py Script variant with entropy-constrained training hooks
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_gpt_ect2.py Additional ECT variant
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/ect_module.py Standalone ECT controller module (documentation + prototype)
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/parallel_muon_notes.md Implementation notes for parameter banking / parallel Muon
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_seed42.log Seed 42 training/eval log for the claimed run
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_seed1337.log Seed 1337 training/eval log for the claimed run
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/train_seed2024.log Seed 2024 training/eval log for the claimed run
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/run_lr010.log Eval-only log for TTT LR=0.01 sweep run
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/sweep_results.txt Summary of LR/hash sweep results
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/sweep_results_pod.txt Placeholder sweep output
records/track_10min_16mb/2026-04-09_SP8192_LegalSOTA/saved_model.ptz Saved quantized model artifact included in the repo

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +4
# Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean)

**val_bpb = 1.0778** (3-seed mean, std 0.0008) | **~15.99 MB** | 8xH100 SXM

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README claims an artifact size of "~15.99 MB", but the training logs in this folder report total submission sizes of ~16.02 MB (e.g., 16,032,371 bytes on seed 42). Please reconcile these numbers and clarify whether the cap is 16,000,000 bytes or 16 MiB, and ensure the reported artifact fits the enforced limit.

Copilot uses AI. Check for mistakes.
Comment on lines +67 to +76
## Reproduction

```bash
pip install brotli sentencepiece
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
SEED=1337 TTT_ENABLED=1 HASH_EMBED_ENABLED=1 TTT_LR=0.01 MUON_MOMENTUM=0.97 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

Requires: CUTLASS 3.x for EVT backward fusion (optional, falls back to standard PyTorch).
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This record directory includes saved_model.ptz (~15MB). Committing large binary artifacts directly into the repo can significantly bloat clone size and slow CI; consider removing it from the PR (or using Git LFS / external artifact hosting) if it isn't strictly required for record verification.

Copilot uses AI. Check for mistakes.
@EthanYangTW
Copy link
Copy Markdown
Author

The training logs report Code size: 63396 because the runs used the uncompressed source. The submitted train_gpt.py is LZMA+base85 compressed at 20,555 bytes. Model sizes from the 3 seeds: 15,962,818 / 15,968,266 / 15,966,393 bytes. Total artifact with compressed code: ~15.98 MB (under 16,000,000 bytes on all seeds). The 16 MB cap is 16,000,000 decimal bytes per the competition rules.

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 10, 2026
…nking HIGH priority

Key findings from daily scan:
- Merged SOTA updated to 1.0810 (bigbag, PR openai#1493, Apr 9) — was stale at 1.1147
- New target: ≤1.0760 bpb (beat by ≥0.005 nats)
- ANS weight compression (PR openai#1510): 1.6MB freed = +2.2M params, zero legality risk
- Parameter Banking + Parallel Muon (PR openai#1523): +5.2% throughput, ~30 free steps
- Free wins: Muon momentum 0.97 (-0.0004 bpb), QK-Gain 5.25 (monotonic vs 5.0)
- Per-Pass Loop Embeddings (PR openai#1518): reduces quant gap 0.0131→0.0114
- Do NOT implement: Eval-Time Hash Emb (illegal pattern), Tap-In V6 (await ruling)
- CLAUDE.md: updated SOTA, target, current approach, technique table, Session 9 lessons

https://claude.ai/code/session_01FLdCggVuuBKQCUy6J3xyss
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026
Reverted BPB-weighted loss (caused torch.compile slowdown, timed out 2x).
Clean forward with standard mean CE. Stacking two proven improvements:
- Muon momentum 0.97 (measured -0.00129 in R20v10)
- TTT LR 0.01 (measured -0.0003 in PR openai#1523)
@msisovic
Copy link
Copy Markdown
Contributor

Hi, the ideas look cool, but the result is unreproducible for me.

First, the script won't even run before changing to:

import lzma as L,base64 as B,linecache as C
S=L.decompress(B.b85decode('<payload>')).decode()
F=__file__+'.__decompressed__.py'
C.cache[F]=(len(S),None,S.splitlines(True),F)
exec(compile(S,F,'exec'))

Second, you didn't include the needed dependencies for backward EVT fusion to work.

After resolving these, I was able to reproduce your results.

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026
20 virtual layers from 11 physical (was 17 with NUM_LOOPS=2).
Layers 3-5 looped 4x total. May be slower per step but deeper model.
PR openai#1523 reports this as the biggest single improvement.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026
…rrence)

Extracted from PR openai#1523 diff. Hardcoded correct params that openai#1523 runs with:
Muon 0.97, TTT LR 0.01, NUM_LOOPS 3. These were env-var overrides that
don't forward to GPU. 659 lines, 63KB code, 20KB packed.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026
…tch)

v21 used FORMAT_RAW but openai#1523 originally uses standard lzma.
Repacked with matching format for compatibility.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026
@EthanYangTW
Copy link
Copy Markdown
Author

Hi, the ideas look cool, but the result is unreproducible for me.

First, the script won't even run before changing to:

import lzma as L,base64 as B,linecache as C

S=L.decompress(B.b85decode('<payload>')).decode()

F=__file__+'.__decompressed__.py'

C.cache[F]=(len(S),None,S.splitlines(True),F)

exec(compile(S,F,'exec'))

Second, you didn't include the needed dependencies for backward EVT fusion to work.

After resolving these, I was able to reproduce your results.

Ah, I think I removed the wrong file during clean up in messy pr
I'll close this when I'm home
Again thankyou for your noticing

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026
Porting all openai#1523 hyperparams that differ from openai#1493:
- EMA_DECAY: 0.9965 -> 0.997 (stronger smoothing)
- WARMDOWN_FRAC: 0.72 -> 0.667 (shorter warmdown)
- Muon 0.97 (kept from previous best)
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026
Banking architecture may cause torch.compile graph breaks with fullgraph=True.
Disabled fullgraph for all 4 compile calls. Params: Muon 0.97, TTT LR 0.01, NUM_LOOPS 3.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026
All torch.compile removed (including @torch.compile decorator).
All Triton/CUTLASS disabled. Fused MLP off. Pure eager mode.
If this works: crash is from torch.compile. If fails: crash is elsewhere.
@msisovic
Copy link
Copy Markdown
Contributor

Ah, I think I removed the wrong file during clean up in messy pr
I'll close this when I'm home
Again thankyou for your noticing

Happy to help!

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026
…cluster healthy

Cluster heimdall-dev has 3/4 nodes unhealthy. Previous failures (v21-v28)
were from degraded nodes, NOT code bugs. This version keeps all openai#1523
features enabled (torch.compile, Triton TMA, banking) for full performance.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026
…h exec)

ROOT CAUSE: @triton.jit requires source code from a real .py file.
Packed code runs via exec(lzma.decompress(...)) which has no source file.
Fix: force _HAS_TRITON_TMA=False, fused MLP uses PyTorch fallback.
Banking + all other features still enabled.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026
Instead of exec(decompress(...)), writes to _train_impl.py then runs via
runpy.run_path. This gives Triton a real source file for @jit inspection.
All features enabled including fused MLP Triton kernel.
Muon 0.97, TTT LR 0.01, NUM_LOOPS=3.
This was referenced Apr 11, 2026
andrewbaggio1 added a commit to andrewbaggio1/parameter-golf that referenced this pull request Apr 11, 2026
…ch=1 — val_bpb 1.07636 (3-seed mean)

3-seed mean 1.07636 BPB (std 0.0006), delta -0.00897 nats vs merged SOTA openai#1493.
Novel: TMA fused MLP kernel, Tap-In unigram matching (min_match=1, fires 21% of positions),
improved parallel residuals from openai#1529, parameter banking from openai#1523.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@EthanYangTW
Copy link
Copy Markdown
Author

Closing in favor of #1561 — clean resubmission with fixed LZMA wrapper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants