Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0

aryanbhosale · 2026-04-03T11:26:19Z

val_bpb = 1.0897 (3-seed mean, std 0.0003) | ~15.99 MB | 8×H100 SXM

3-Seed Results

Seed	Sliding BPB	Artifact
42	1.0894	15,999,165
314	1.0898	15,997,318
999	1.0899	15,990,607
Mean	1.0897

Merged SOTA (PR #1019): 1.1147 BPB. Delta: −0.0250 BPB.

Key Techniques

4096-Vocab + MLP 4x + WD 0.090 — PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 @clarkkev, PR Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) #1285 @dexhunter
Depth Recurrence (layers 4,5) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
Parallel Residuals (from layer 7) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka
MuonEq-R — arXiv:2603.28254, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
QK-Gain 5.0 — PR Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217 @bigbag
Full GPTQ int6 + Brotli + Compressed Wrapper — LZMA self-extracting (~25KB code)

Compliance

No TTT, no SLOT, no n-gram cache, no eval-time adaptation
GPTQ calibration within training budget
All four conditions from Issue A Field Guide to Valid Submissions #1017 satisfied

Reproduction

pip install brotli
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --skip-manifest
SEED=42 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

…0940 (3-seed mean) 4096-vocab + MLP 4x + WD 0.090 + depth recurrence (layers 4,5) + MuonEq-R + full GPTQ int6 + brotli + selective pruning. 3-seed mean: 1.0940 BPB, beating merged SOTA (PR openai#1019, 1.1147 BPB) by 0.0208 BPB.

…d mean) LZMA self-extracting code wrapper (24KB vs 81KB) frees 57KB for model precision. No pruning needed. 3-seed mean improves from 1.0940 to 1.0926.

Added parallel residuals from layer 7+ (separate attn/MLP lanes). 3-seed mean improves from 1.0926 to 1.0904.

QK-Gain from 4.0 to 5.0 plus parallel residuals and depth recurrence. 3-seed mean: 1.0897 BPB (std 0.0003), delta -0.0250 vs merged SOTA.

Port depth recurrence from PR openai#1290 and parallel residuals from PR openai#1296. - Depth recurrence: layers 3,4 repeated in forward pass via virtual layer mapping - Parallel residuals: attn+mlp computed in parallel from layer 6 onward - Configurable via RECUR_LAYERS, RECUR_START_STEP, PARALLEL_START_LAYER env vars

Ports parallel residuals from PR openai#1296 to openai#1290 base: - Block.__init__ accepts parallel flag - Block.forward() computes attn+mlp in parallel when parallel=True - GPT.__init__ passes parallel_start_layer to Block constructors - Layers 7-10 run parallel, layers 0-6 sequential (default PARALLEL_START_LAYER=7) - Both base_model and eval_model wired up

- QK_GAIN_INIT: 1.5 -> 5.0 (matches openai#1296 proven config) - WARMDOWN_ITERS: already 4000 (matches openai#1290 run command) - MULTIRES_ENABLED: 1 -> 0 (multi-res failed: only 1.13x speedup) - BIGRAM: revert to 2048x128 (3072x112 exceeded 16MB artifact limit)

aryanbhosale added 2 commits April 3, 2026 16:55

Update: compressed wrapper + improved results — val_bpb 1.0926 (3-see…

d2388de

…d mean) LZMA self-extracting code wrapper (24KB vs 81KB) frees 57KB for model precision. No pruning needed. 3-seed mean improves from 1.0940 to 1.0926.

aryanbhosale changed the title ~~Record: SP4096 + Depth Recurrence + MuonEq-R + Full GPTQ — val_bpb 1.0940 (3-seed mean)~~ Record: SP4096 + Depth Recurrence + MuonEq-R + Full GPTQ — val_bpb 1.0926 (3-seed mean) Apr 3, 2026

Update: add parallel residuals — val_bpb 1.0904 (3-seed mean)

02340cd

Added parallel residuals from layer 7+ (separate attn/MLP lanes). 3-seed mean improves from 1.0926 to 1.0904.

aryanbhosale changed the title ~~Record: SP4096 + Depth Recurrence + MuonEq-R + Full GPTQ — val_bpb 1.0926 (3-seed mean)~~ Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Full GPTQ — val_bpb 1.0904 (3-seed mean) Apr 3, 2026

Update: QK-Gain 5.0 + TTT fix — val_bpb 1.0897 (3-seed mean)

acdf503

QK-Gain from 4.0 to 5.0 plus parallel residuals and depth recurrence. 3-seed mean: 1.0897 BPB (std 0.0003), delta -0.0250 vs merged SOTA.

aryanbhosale changed the title ~~Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Full GPTQ — val_bpb 1.0904 (3-seed mean)~~ Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) Apr 3, 2026

aryanbhosale mentioned this pull request Apr 3, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

himanshudongre mentioned this pull request Apr 4, 2026

Non-Record: TTT and GPTQ Are Fundamentally Incompatible — Quantized Weight Structure Defeats Test-Time Adaptation #1341

Open

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Re-apply CompTrain microbenchmark on new base (openai#1296 SP4096)

5357f7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)#1296