Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean)#399
Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean)#399abaybektursun wants to merge 4 commits intoopenai:mainfrom
Conversation
Systems optimization built on PR openai#315 by @jfprincz (11L XSA4+EMA, 1.1248 bpb). Same architecture, same hyperparameters, only optimizer changed. 82.14ms/step vs 84.76ms baseline = 7,306 steps vs 7,079 in 600s. Pre-quant val_bpb 1.1421 (identical to baseline). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…1.1248) Unbank state dict before quantization so int6 per-row scales match baseline. Rebank after dequantization for roundtrip eval. Results: 82.13ms/step, 7,306 steps, int6 sliding window val_bpb 1.1238. Artifact: 16.06MB (int6+zstd). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seeds 42, 1337, 2025: mean 82.08ms/step, val_bpb 1.1239 (std 0.0001). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5f4d141 to
4db0057
Compare
Replaced Polar Express with standard Newton-Schulz + switched to lzma compression. 3-seed results: 81.87ms/step mean, 1.1247 sliding bpb mean, all artifacts ~15.8MB. Seed 1337: 7331 steps, 1.1241 bpb, 15,830,960 bytes Seed 42: 7328 steps, 1.1253 bpb, 15,819,728 bytes Seed 2025: 7330 steps, 1.1247 bpb, 15,796,052 bytes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d mean) Legal score-first TTT (PR openai#461 recipe) + BigramHash(3072) + freeze=0 on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results (BIGRAM=3072, 3ep, freeze=0, SGD+mom=0.9): Seed 1337: 1.1204 bpb, 413s TTT, 15.98 MB Seed 42: 1.1216 bpb, 406s TTT, 15.99 MB Seed 2025: 1.1221 bpb, 405s TTT, 15.99 MB Mean: 1.1214 (std 0.0009) All artifacts under 16MB. All eval times under 600s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Seed 1337: pending (log will be added) Mean: 1.1195 (std 0.0008) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR #461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on #414 stack with Parameter Banking + Parallel Muon (PR #399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a non-record submission documenting a stack that runs without Flash Attention 3 (the runpod default pytorch:2.4.0 image lacks flash_attn_3). 1-seed result: val_bpb 1.1854, beating the OpenAI baseline (1.2244) by -0.039 BPB. Stack: - 11L d=512 SP1024 - XSA-all + BigramHash 3072x112 (from PR openai#1019) - Parallel Muon (from PR openai#399) - Step-based warmdown=2000/3500 (documents trigger bug) - Mixed Q4/Q5/Q6 quantization (Gemma-4 inspired, ~100 LOC pipeline) - Sliding-window eval stride=32, temperature=0.90 No SLOT, no TTT, no validation data accessed during eval. Eval: 322s wall on 8xH100 (within 600s budget). Single seed only (record track requires 3-seed mean).
Thesis: the speed path is the most underutilized section of openai/parameter-golf. The quality path has 170+ PRs; the speed path has maybe 30 and 2-3 genuine novelties. Our 13x per-GPU gap vs comp records is almost entirely soft — most of it collapses under free wins + comp ports. Findings: TIER 0 FREE WINS (before any kernel work) — ~3x speedup, 2-3 days total: - Shot 0a: drop grad_accum_steps 8→1. The single biggest easy win hiding in plain sight. We're paying 8x kernel-launch overhead because grad_accum was inherited from an 8xGPU distributed config. 5 LOC, 30-50% speedup. - Shot 0b: eval batched + streaming KV cache. Current sliding-window eval is 625K sequential forwards at B=1 stride=64. 97% of each window's context is shared with the previous. Streaming KV (StreamingLLM arXiv 2309.17453) gives 5-15x eval speedup, saves 3-5 min of the 600s budget. - Shot 0c: SkyLadder progressive seq_len 256→2048 (NeurIPS 2025 arXiv 2503.15450). 22% throughput + 1-3.7% quality. Already in Mac SETUP §35 backlog, never shipped. - Shot 0d: train-data GPTQ calibration (PR openai#1219, comp-organizer-approved). Replaces 220s AR self-gen with 14s. +2000 extra training steps. - Free: TORCHINDUCTOR_MIX_ORDER_REDUCTION=0 + torch 2.9.1 pin. +8.8% step time. TIER 2 COMP-PORT WINS we missed in the original Phase 2 plan: - Shot 9: FA3 varlen + window + mixed seq_len across GPUs (PR openai#1212 holds the fastest step in the leaderboard at 69.6 ms/step) - Shot 10: Parameter Banking + Parallel Muon (PR openai#399): 66 nn.Linear → 4 contiguous 3D banks → Newton-Schulz becomes one bmm → optimizer time 19.7 ms → 1.3 ms (15x). World-novel, NOT in modded-nanogpt. - Shot 11: CUTLASS EVT backward with the novel `post=0.5·act_grad·pre` identity (PRs openai#1105, openai#1420). Identity itself looks world-novel. - Shots 13-14: eval path wins (Triton KV-cache backend, fused softcap+CE megakernel). Combined eval speedup ~5x on top of Shot 0b. TIER 3 BIG DREAMS (world-first opportunities): - Megadream 1: **Training megakernel** (fwd+bwd+optim in a single persistent SM kernel). HazyResearch / Mirage / MegaQwen have inference megakernels; nobody has built one for TRAINING. 1.3us × ~600 launches per step = 16% of our step budget is pure launch overhead. 5-7 days, 500-1500 LOC, ThunderKittens templates. Potential PhD-defensible mini-paper. - Megadream 2: **Streaming KV sliding-window eval** (our Shot 0b, also novel) - Megadream 3: **Fuzzy LR bandit per microbatch** — user's "dial-in" hint operationalized. Thompson sampling from {0.5x, 1x, 2x} * base_lr. 80 LOC. - Megadream 4: **CPU n-gram precompute thread** — user's "CPU while GPU" hint operationalized. BG thread pre-computes n-gram hash tensors, 50 LOC. - Megadream 5: **GPU-resident successive halving** — user's "GPU tests" hint operationalized. Run 4 replicas × 100 steps inside the 600s budget, pick winner, continue. Online hyperband. 200 LOC. - Megadream 6: **AOTInductor precompile + binary ship** — kill the 5+ min compile cold-start permanently. Stacked expected impact: - Phase 1 (now): 180 steps / 600s, val_bpb ~1.4-1.6 - +Tier 0 free wins: ~540 steps, val_bpb ~1.25-1.35 - +Tier 1 kernel work: ~2000 steps, val_bpb ~1.15-1.22 - +Tier 2 comp ports: ~4000 steps, val_bpb ~1.10-1.15 - +Tier 3 Megadream 1 (training megakernel): ~8000 steps, val_bpb ~1.08-1.12 - +Tier 3 all: ~10000 steps, val_bpb ~1.06-1.10 (**ahead of comp on 1xH100**) 10000 steps on 1xH100 = 4x more per-GPU training than the comp's 20000 on 8xH100. That's where val_bpb drops BELOW comp records. Key finding: eval path holds the biggest speed wins currently, not training. Our sliding-window eval eats 10-15 min of the 600s budget. Tier 0b + Tier 2 Shots 13-14 save 5-8 min per eval pass. More than any training-side single patch would buy at our current rate. Source reports: /tmp/phase2_comp_speed_audit.md (22 PRs surveyed), /tmp/phase2_world_speed_research.md (12 research areas surveyed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents what's actually in the repo now: SHIPPED: - phase2/README.md, bootstrap.sh, metrics.py, warm_compile_cache.py, run.sh - submission/run.sh: Inductor patch + CUDA allocator expandable segments - submission/train.py ShuffledSequenceLoader: prefetch thread + pinned RAM + prefill during pretime - All gated by env vars with sensible defaults on NOT SHIPPED (future work): - Shot 2 FA3 sourcing (not on PyPI) - Shot 9 FA3 varlen + window attention (PR openai#1212) - Shot 10 Parameter Banking + Parallel Muon (PR openai#399) - Shot 14 Training megakernel (world-first) - Shot 0b batched + streaming KV sliding eval - Shot 17 fuzzy LR bandit - Shot 19 GPU-resident successive halving HONEST SKIPS: - grad_accum 8→1: research agent missed memory math, would OOM - CPU n-gram precompute: research agent missed GPU HBM is 60× faster than CPU→GPU PCIe path for gather ops. Pivoted to prefetch prefill instead. Tasks 7-12 complete (metrics, free env wins, prefetch loader, compile cache warmup, prefill during pretime, bootstrap wiring). Phase 2 Tier 0 is mechanically shipped. Still a plan for the bigger shots. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean)BPB: 1.1247 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=76436 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=76436 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Novel Contribution: Parameter Banking + Parallel Muon
This submission introduces Parameter Banking, a weight layout restructuring that enables batched optimizer operations, combined with an adapted Parallel Muon communication strategy. Together, these provide a 3.4% training throughput improvement that is architecture-agnostic and composes with any Muon-based training stack. The approach has since been adopted by subsequent competition submissions (e.g., PR #549).
Pure systems optimization — model architecture and hyperparameters are unchanged.
3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, 600s)
Technical Approach
1. Parameter Banking (novel)
We restructure 66 separate
nn.Linearweight matrices into 4 contiguous 3Dnn.Parametertensors, grouped by shape:qo_bank: (22, 512, 512) — Q + Out projectionskv_bank: (22, 256, 512) — K + V projectionsmlp_up_bank: (11, 1536, 512) — MLP upmlp_down_bank: (11, 512, 1536) — MLP downForward pass uses
F.linear(x, bank[layer_idx])— compiles identically tonn.Linearundertorch.compile. Verified: banked forward+backward = 72.33ms vs baseline 72.59ms.The key benefit: Newton-Schulz orthogonalization (used by Muon) becomes a single
torch.bmmover the batch dimension, replacing 66 sequential small GEMMs. This reduces optimizer time from 19.7ms to 1.3ms (15× faster).2. Parallel Muon (adapted from arXiv:2511.07464)
Standard DDP is incompatible with parameter banking: bank gradients aggregate across all 11 layers and are only available at end of backward, destroying compute-communication overlap (+4ms regression).
Our solution removes DDP for banked parameters and schedules communication explicitly:
reduce_scatterfor all banks (biggest first)all_reduce+ Adam step on small replicated params (while bank RS is in-flight)all_gatherThis follows the DDP-free communication pattern from modded-nanogpt, adapted to work with our banking structure.
Engineering notes
Compatibility analysis
Key finding: The throughput advantage translates to quality gains exclusively for EMA-based models, where every additional step monotonically refines the exponential moving average.
Credits
🤖 Generated with Claude Code