Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) by abaybektursun · Pull Request #399 · openai/parameter-golf

abaybektursun · 2026-03-22T04:52:11Z

Novel Contribution: Parameter Banking + Parallel Muon

This submission introduces Parameter Banking, a weight layout restructuring that enables batched optimizer operations, combined with an adapted Parallel Muon communication strategy. Together, these provide a 3.4% training throughput improvement that is architecture-agnostic and composes with any Muon-based training stack. The approach has since been adopted by subsequent competition submissions (e.g., PR #549).

Pure systems optimization — model architecture and hyperparameters are unchanged.

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, 600s)

Seed	step_avg	steps	int6 sliding val_bpb	artifact
1337	81.86 ms	7,331	1.1241	15,830,960 bytes
42	81.88 ms	7,328	1.1253	15,819,728 bytes
2025	81.86 ms	7,330	1.1247	15,796,052 bytes
Mean	81.87 ms	7,330	1.1247 (std 0.0006)	~15.8 MB

Technical Approach

1. Parameter Banking (novel)

We restructure 66 separate nn.Linear weight matrices into 4 contiguous 3D nn.Parameter tensors, grouped by shape:

qo_bank: (22, 512, 512) — Q + Out projections
kv_bank: (22, 256, 512) — K + V projections
mlp_up_bank: (11, 1536, 512) — MLP up
mlp_down_bank: (11, 512, 1536) — MLP down

Forward pass uses F.linear(x, bank[layer_idx]) — compiles identically to nn.Linear under torch.compile. Verified: banked forward+backward = 72.33ms vs baseline 72.59ms.

The key benefit: Newton-Schulz orthogonalization (used by Muon) becomes a single torch.bmm over the batch dimension, replacing 66 sequential small GEMMs. This reduces optimizer time from 19.7ms to 1.3ms (15× faster).

2. Parallel Muon (adapted from arXiv:2511.07464)

Standard DDP is incompatible with parameter banking: bank gradients aggregate across all 11 layers and are only available at end of backward, destroying compute-communication overlap (+4ms regression).

Our solution removes DDP for banked parameters and schedules communication explicitly:

Launch async reduce_scatter for all banks (biggest first)
all_reduce + Adam step on small replicated params (while bank RS is in-flight)
Wait for RS, local batched NS on each GPU's shard, async all_gather

This follows the DDP-free communication pattern from modded-nanogpt, adapted to work with our banking structure.

Engineering notes

Approach	Result	Lesson
Non-surgery batching (keep 66 params, batch in optimizer)	85.73ms	Gather/scatter kernel overhead offsets speedup
DDP with banks	88.8ms (+4ms)	Bank grads only available at end of backward
Polar Express (arXiv:2505.16932)	82ms, 16.2MB	PE weights compress ~190KB worse than NS
Parameter Banking + Parallel Muon	81.87ms, 15.8MB	Architecture-agnostic, composable

Compatibility analysis

Base PR	Speed	Score	Finding
#315 (EMA only)	-3.4%	-0.0006 BPB	Extra steps improve EMA monotonically
#374 (Tight SWA)	-3.5%	+0.001	SWA averages warmdown weights; extra steps don't enter the window
#401 (EMA+SWA)	-2.8%	+0.0005	Same SWA dilution
#398 (TTT)	-2.3%	+0.004	More-converged model has less room for TTT adaptation

Key finding: The throughput advantage translates to quality gains exclusively for EMA-based models, where every additional step monotonically refines the exponential moving average.

Credits

Architecture: PR #315 by @jfprincz (11L Partial RoPE + LN Scale + EMA + XSA4)
Parallel Muon scheduling: Adapted from arXiv:2511.07464 and modded-nanogpt

🤖 Generated with Claude Code

@jfprincz

Systems optimization built on PR openai#315 by @jfprincz (11L XSA4+EMA, 1.1248 bpb). Same architecture, same hyperparameters, only optimizer changed. 82.14ms/step vs 84.76ms baseline = 7,306 steps vs 7,079 in 600s. Pre-quant val_bpb 1.1421 (identical to baseline). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…1.1248) Unbank state dict before quantization so int6 per-row scales match baseline. Rebank after dequantization for roundtrip eval. Results: 82.13ms/step, 7,306 steps, int6 sliding window val_bpb 1.1238. Artifact: 16.06MB (int6+zstd). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Seeds 42, 1337, 2025: mean 82.08ms/step, val_bpb 1.1239 (std 0.0001). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaced Polar Express with standard Newton-Schulz + switched to lzma compression. 3-seed results: 81.87ms/step mean, 1.1247 sliding bpb mean, all artifacts ~15.8MB. Seed 1337: 7331 steps, 1.1241 bpb, 15,830,960 bytes Seed 42: 7328 steps, 1.1253 bpb, 15,819,728 bytes Seed 2025: 7330 steps, 1.1247 bpb, 15,796,052 bytes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…d mean) Legal score-first TTT (PR openai#461 recipe) + BigramHash(3072) + freeze=0 on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results (BIGRAM=3072, 3ep, freeze=0, SGD+mom=0.9): Seed 1337: 1.1204 bpb, 413s TTT, 15.98 MB Seed 42: 1.1216 bpb, 406s TTT, 15.99 MB Seed 2025: 1.1221 bpb, 405s TTT, 15.99 MB Mean: 1.1214 (std 0.0009) All artifacts under 16MB. All eval times under 600s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Seed 1337: pending (log will be added) Mean: 1.1195 (std 0.0008) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR #461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on #414 stack with Parameter Banking + Parallel Muon (PR #399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add a non-record submission documenting a stack that runs without Flash Attention 3 (the runpod default pytorch:2.4.0 image lacks flash_attn_3). 1-seed result: val_bpb 1.1854, beating the OpenAI baseline (1.2244) by -0.039 BPB. Stack: - 11L d=512 SP1024 - XSA-all + BigramHash 3072x112 (from PR openai#1019) - Parallel Muon (from PR openai#399) - Step-based warmdown=2000/3500 (documents trigger bug) - Mixed Q4/Q5/Q6 quantization (Gemma-4 inspired, ~100 LOC pipeline) - Sliding-window eval stride=32, temperature=0.90 No SLOT, no TTT, no validation data accessed during eval. Eval: 322s wall on 8xH100 (within 600s budget). Single seed only (record track requires 3-seed mean).

Thesis: the speed path is the most underutilized section of openai/parameter-golf. The quality path has 170+ PRs; the speed path has maybe 30 and 2-3 genuine novelties. Our 13x per-GPU gap vs comp records is almost entirely soft — most of it collapses under free wins + comp ports. Findings: TIER 0 FREE WINS (before any kernel work) — ~3x speedup, 2-3 days total: - Shot 0a: drop grad_accum_steps 8→1. The single biggest easy win hiding in plain sight. We're paying 8x kernel-launch overhead because grad_accum was inherited from an 8xGPU distributed config. 5 LOC, 30-50% speedup. - Shot 0b: eval batched + streaming KV cache. Current sliding-window eval is 625K sequential forwards at B=1 stride=64. 97% of each window's context is shared with the previous. Streaming KV (StreamingLLM arXiv 2309.17453) gives 5-15x eval speedup, saves 3-5 min of the 600s budget. - Shot 0c: SkyLadder progressive seq_len 256→2048 (NeurIPS 2025 arXiv 2503.15450). 22% throughput + 1-3.7% quality. Already in Mac SETUP §35 backlog, never shipped. - Shot 0d: train-data GPTQ calibration (PR openai#1219, comp-organizer-approved). Replaces 220s AR self-gen with 14s. +2000 extra training steps. - Free: TORCHINDUCTOR_MIX_ORDER_REDUCTION=0 + torch 2.9.1 pin. +8.8% step time. TIER 2 COMP-PORT WINS we missed in the original Phase 2 plan: - Shot 9: FA3 varlen + window + mixed seq_len across GPUs (PR openai#1212 holds the fastest step in the leaderboard at 69.6 ms/step) - Shot 10: Parameter Banking + Parallel Muon (PR openai#399): 66 nn.Linear → 4 contiguous 3D banks → Newton-Schulz becomes one bmm → optimizer time 19.7 ms → 1.3 ms (15x). World-novel, NOT in modded-nanogpt. - Shot 11: CUTLASS EVT backward with the novel `post=0.5·act_grad·pre` identity (PRs openai#1105, openai#1420). Identity itself looks world-novel. - Shots 13-14: eval path wins (Triton KV-cache backend, fused softcap+CE megakernel). Combined eval speedup ~5x on top of Shot 0b. TIER 3 BIG DREAMS (world-first opportunities): - Megadream 1: **Training megakernel** (fwd+bwd+optim in a single persistent SM kernel). HazyResearch / Mirage / MegaQwen have inference megakernels; nobody has built one for TRAINING. 1.3us × ~600 launches per step = 16% of our step budget is pure launch overhead. 5-7 days, 500-1500 LOC, ThunderKittens templates. Potential PhD-defensible mini-paper. - Megadream 2: **Streaming KV sliding-window eval** (our Shot 0b, also novel) - Megadream 3: **Fuzzy LR bandit per microbatch** — user's "dial-in" hint operationalized. Thompson sampling from {0.5x, 1x, 2x} * base_lr. 80 LOC. - Megadream 4: **CPU n-gram precompute thread** — user's "CPU while GPU" hint operationalized. BG thread pre-computes n-gram hash tensors, 50 LOC. - Megadream 5: **GPU-resident successive halving** — user's "GPU tests" hint operationalized. Run 4 replicas × 100 steps inside the 600s budget, pick winner, continue. Online hyperband. 200 LOC. - Megadream 6: **AOTInductor precompile + binary ship** — kill the 5+ min compile cold-start permanently. Stacked expected impact: - Phase 1 (now): 180 steps / 600s, val_bpb ~1.4-1.6 - +Tier 0 free wins: ~540 steps, val_bpb ~1.25-1.35 - +Tier 1 kernel work: ~2000 steps, val_bpb ~1.15-1.22 - +Tier 2 comp ports: ~4000 steps, val_bpb ~1.10-1.15 - +Tier 3 Megadream 1 (training megakernel): ~8000 steps, val_bpb ~1.08-1.12 - +Tier 3 all: ~10000 steps, val_bpb ~1.06-1.10 (**ahead of comp on 1xH100**) 10000 steps on 1xH100 = 4x more per-GPU training than the comp's 20000 on 8xH100. That's where val_bpb drops BELOW comp records. Key finding: eval path holds the biggest speed wins currently, not training. Our sliding-window eval eats 10-15 min of the 600s budget. Tier 0b + Tier 2 Shots 13-14 save 5-8 min per eval pass. More than any training-side single patch would buy at our current rate. Source reports: /tmp/phase2_comp_speed_audit.md (22 PRs surveyed), /tmp/phase2_world_speed_research.md (12 research areas surveyed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Documents what's actually in the repo now: SHIPPED: - phase2/README.md, bootstrap.sh, metrics.py, warm_compile_cache.py, run.sh - submission/run.sh: Inductor patch + CUDA allocator expandable segments - submission/train.py ShuffledSequenceLoader: prefetch thread + pinned RAM + prefill during pretime - All gated by env vars with sensible defaults on NOT SHIPPED (future work): - Shot 2 FA3 sourcing (not on PyPI) - Shot 9 FA3 varlen + window attention (PR openai#1212) - Shot 10 Parameter Banking + Parallel Muon (PR openai#399) - Shot 14 Training megakernel (world-first) - Shot 0b batched + streaming KV sliding eval - Shot 17 fuzzy LR bandit - Shot 19 GPU-resident successive halving HONEST SKIPS: - grad_accum 8→1: research agent missed memory math, would OOM - CPU n-gram precompute: research agent missed GPU HBM is 60× faster than CPU→GPU PCIe path for gather ops. Pivoted to prefetch prefill instead. Tasks 7-12 complete (metrics, free env wins, prefetch loader, compile cache warmup, prefill during pretime, bootstrap wiring). Phase 2 Tier 0 is mechanically shipped. Still a plan for the bigger shots. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:02:22Z

Community Review — Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean)

BPB: 1.1247 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA d5340951f214, file records/track_10min_16mb/2026-03-22_ParallelMuon_ParameterBanking_82ms/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=76436 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=76436 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

notapplica mentioned this pull request Mar 22, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

abaybektursun and others added 2 commits March 22, 2026 00:13

Add 3-seed results + train logs

4db0057

Seeds 42, 1337, 2025: mean 82.08ms/step, val_bpb 1.1239 (std 0.0001). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun changed the title ~~Record: Parallel Muon + Parameter Banking — 82.14ms/step (3.1% faster than PR #315)~~ Record: Parallel Muon + Parameter Banking — 82.08ms/step (3.2% faster than PR #315) Mar 22, 2026

abaybektursun force-pushed the submission/parallel-muon-82ms branch from 5f4d141 to 4db0057 Compare March 22, 2026 15:24

abaybektursun changed the title ~~Record: Parallel Muon + Parameter Banking — 82.08ms/step (3.2% faster than PR #315)~~ Record: Parallel Muon + Parameter Banking + Polar Express — 82.14ms/step (3.1% faster than PR #315) Mar 22, 2026

abaybektursun changed the title ~~Record: Parallel Muon + Parameter Banking + Polar Express — 82.14ms/step (3.1% faster than PR #315)~~ Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) Mar 22, 2026

abaybektursun mentioned this pull request Mar 22, 2026

Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean) #473

Closed

abaybektursun mentioned this pull request Mar 23, 2026

Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549

Merged

abaybektursun mentioned this pull request Mar 24, 2026

Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean) #593

Closed

This was referenced Mar 24, 2026

Record Submission: Maestro Solar Protocol (1.1194 BPB) Joeavaib/parameter-golf#1

Merged

Add_Maestro_Solar_Protocol_Joeavaib #625

Open

abaybektursun mentioned this pull request Mar 25, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #728

Closed

This was referenced Mar 25, 2026

Non-Record: 11L Parallel Muon + LeakyReLU² MLP3x + Legal TTT (val_bpb 1.1253) #635

Closed

Non-Record: 11L Parallel Muon + LN Scale + LeakyReLU² MLP3x + Legal TTT (val_bpb 1.1215) #754

Closed

SirSaltySalmon mentioned this pull request Mar 26, 2026

(Nonrecord) Applied Async Prefetching Potentially Boosts Performance #785

Open

This was referenced Mar 26, 2026

Non-Record: 11L Parallel Muon + LN Scale + LeakyReLU² MLP3x + Legal TTT — val_bpb 1.1215 (3-seed mean) #838

Open

Record: 11L Parallel Muon + N-gram Backoff Cache — val_bpb 0.2841 (3-seed mean) #864

Open

AnubhavBharadwaaj mentioned this pull request Mar 29, 2026

Non-Record: SLOT Eval-Time Augmentation on PR #549 SOTA Stack val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM #1084

Open

mikeapedia mentioned this pull request Mar 29, 2026

Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d) #1089

Open

teddyoweh mentioned this pull request Mar 29, 2026

XSA-All 11L + LeakyReLU(0.75)² + Aggressive Legal TTT → 1.1219 BPB #1092

Open

AnubhavBharadwaaj mentioned this pull request Mar 30, 2026

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM #1128

Open

Gusanidas mentioned this pull request Mar 30, 2026

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) #1130

Open

aamodbhatt mentioned this pull request Mar 30, 2026

Record: 11L Muon Legal TTT + Entropy-Adaptive Epochs (8×H100) — val_bpb 1.1179 (3-seed mean) #1148

Open

8 tasks

Gusanidas mentioned this pull request Apr 1, 2026

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) #1212

Open

MatoTeziTanka mentioned this pull request Apr 2, 2026

Record: Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0876 (3-seed mean) #1274

Closed

aryanbhosale mentioned this pull request Apr 3, 2026

Record: Depth Recurrence + MuonEq-R + AR Self-Gen GPTQ — val_bpb 1.1104 (3-seed mean) #1290

Open

abaybektursun mentioned this pull request Apr 3, 2026

Record: Fused MLP (Triton+CUTLASS EVT) + Fast Causal N-Gram Tilt & Subword Certainty (3-seed mean) #1105

Closed

AR6420 mentioned this pull request Apr 3, 2026

Full-Depth MLP Megakernel + Fused Attention Preprocessing (non-record) #1316

Open

monisha-max mentioned this pull request Apr 4, 2026

Record Submission: Poly5 Softcap + Z-Loss + YaRN + Zstd-22 + Stride-16 (on PR #549 stack) #1325

Open

4 tasks

Its-Just-Crump mentioned this pull request Apr 5, 2026

Add record: SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + Brotli (1.1020 BPB) #1392

Open

abaybektursun mentioned this pull request Apr 6, 2026

Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean) #1420

Open

This was referenced Apr 7, 2026

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence #1426

Closed

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence #1429

Open

akaiHuang mentioned this pull request Apr 7, 2026

Non-record: No-FA3 stack combination — val_bpb 1.1854 (1-seed, 8xH100) #1442

Closed

RulinShao mentioned this pull request Apr 10, 2026

Record: Depth Recurrence + Banked Muon + Pre-Quant TTT (18ep) — val_bpb 1.0632 (3-seed mean) #1517

Open

EthanYangTW mentioned this pull request Apr 10, 2026

Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523

Closed

EthanYangTW mentioned this pull request Apr 12, 2026

Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0783 (3-seed mean) #1561

Open

shivangbaveja mentioned this pull request Apr 12, 2026

Record: 12L RecycledCore Int5 — val_bpb 1.1464 (seed 1337) #1573

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean)#399

Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean)#399
abaybektursun wants to merge 4 commits intoopenai:mainfrom
abaybektursun:submission/parallel-muon-82ms

abaybektursun commented Mar 22, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abaybektursun commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Novel Contribution: Parameter Banking + Parallel Muon

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, 600s)

Technical Approach

1. Parameter Banking (novel)

2. Parallel Muon (adapted from arXiv:2511.07464)

Engineering notes

Compatibility analysis

Credits

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abaybektursun commented Mar 22, 2026 •

edited

Loading