Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) by dexhunter · Pull Request #1285 · openai/parameter-golf

dexhunter · 2026-04-03T05:33:56Z

Summary

val_bpb = 1.0912 (3-seed mean, std 0.0009) | 2.5106 nats | ~15.96 MB | 8xH100 SXM, 590s | No TTT
WD-quantization synergy: higher weight decay (0.090) compresses 5% better, allowing ALL 66 layers at int6
All seeds under 16MB with 32K+ margins
No SLOT, no TTT, no eval-time adaptation, fully legal

Key Innovation: WD-Quantization Synergy

Higher WD (0.090 vs 0.085) → smaller weights → 5% better brotli compression → enough headroom for ALL 66 layers at int6 precision. The quantization quality gain exceeds the WD BPP cost:

Config	WD	N_INT6	Artifact	val_bpb (s42)
PR #1260	0.085	60	15,981K	1.09217
PR #1279	0.085	61	15,997K	1.09170
This	0.090	66	15,967K	1.09057

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	Steps	ms/step	Sliding BPB	val_loss (nats)	Artifact
42	5,540	106.5	1.0906	2.50910	15,967,483
0	5,536	106.6	1.0908	2.50973	15,962,242
1337	5,538	106.6	1.0923	2.51309	15,959,253
Mean	5,538	106.6	1.0912	2.51064	15,962,993

Changes from PR #1218

	PR #1218	This
val_bpb	1.09785	1.09124 (-0.00661)
Weight decay	0.085	0.090
Optimizer	Muon	MuonEq-R
Depth recurrence	None	Layers 4,5 repeated
Quantization	Mixed	All int6 (66/66)

Credits

@clarkkev for PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 (4096-Vocab + high-WD architecture)
@abaybektursun for PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 (GPTQ + XSA + BigramHash baseline)
@msisovic for PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 (depth recurrence concept)
@dexhunter for PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260/Record: MuonEq-R + Depth Recurrence + N61 Mixed GPTQ — val_bpb 1.0924 (3-seed mean) #1279 (MuonEq-R + recurrence + mixed quant)

Test plan

3-seed verification (42, 0, 1337) — all pass
All under 16MB (min margin: 32,517)
4-seed tested (seed 7 also fits at 15,970,676)
No TTT, no SLOT

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

@clarkkev

…b 1.0900 (3-seed mean) 3-layer depth recurrence (layers 3,4,5) with WD-LR synergy: higher WD (0.095) compresses for all-int6 headroom, higher MLR (0.022) recovers quality. All 66 layers at int6 precision. 3-seed mean: 1.0900 BPP / 2.5077 nats (seeds 42, 0, 7) All seeds under 16MB with 36K+ margins. No TTT, no SLOT, no eval-time adaptation. Improves PR openai#1285 (1.0912) by 0.0013 BPP. Beats PR openai#1218 by 0.0079. Built on PR openai#1218 by @clarkkev.

Hypothesis: Polar Express 4-step minimax NS on top of full PR openai#1334 stack Expected delta: ~-0.001 to -0.002 BPB from 1.0897 baseline Key changes vs PR openai#1334: - Polar Express Newton-Schulz (4-step minimax coefficients, arXiv:2505.16932) - MATRIX_LR=0.022 (validated for WD=0.090) - MUON_WD=0.090 (PR openai#1285/1334 optimal for 2-layer recurrence) - NoPE explicitly disabled (nope_every_n=0) after critique - Trackio experiment tracking added Stack: SP4096 vocab + MLP 4x + WD=0.090 + MuonEq-R + QK-Gain 5.0 + Depth recurrence L4-5 (step 3000) + Parallel residuals L7+ + Brotli

…el spectral test-time training Base: clarkkev openai#1218 (1.0974 BPB, 4096 vocab, brotli, 34M params) Added: depth recurrence L4,5 (from openai#1285), MuonEq-R, WD=0.09 Novel: Spectral TTT — adapt singular values at eval time (8192 params) Target: ~1.085 BPB

…1285 base

dexhunter mentioned this pull request Apr 4, 2026

Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 — val_bpb 1.0900 (3-seed mean) #1331

Open

This was referenced Apr 4, 2026

Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16 — val_bpb 1.0766 (3-seed mean) #1333

Open

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334

Open

himanshudongre mentioned this pull request Apr 4, 2026

Non-Record: TTT and GPTQ Are Fundamentally Incompatible — Quantized Weight Structure Defeats Test-Time Adaptation #1341

Open

AnubhavBharadwaaj added a commit to AnubhavBharadwaaj/parameter-golf that referenced this pull request Apr 6, 2026

1.0898 BPB: Pre-Quant TTT + ETLB (Eval-Time Logit Bias) on PR openai#…

f154b6f

…1285 base

AnubhavBharadwaaj mentioned this pull request Apr 6, 2026

Record: Pre-Quant TTT + ETLB: Eval-Time Logit Bias for Neural Language Model Compression 1.0898 BPB on PR #1285 base #1399

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)#1285

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)#1285
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-wd090-allint6

dexhunter commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dexhunter commented Apr 3, 2026

Summary

Key Innovation: WD-Quantization Synergy

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Changes from PR #1218

Credits

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant