Skip to content

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)#1296

Open
aryanbhosale wants to merge 4 commits intoopenai:mainfrom
aryanbhosale:submission/sp4096-depth-recurrence-muoneqr
Open

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)#1296
aryanbhosale wants to merge 4 commits intoopenai:mainfrom
aryanbhosale:submission/sp4096-depth-recurrence-muoneqr

Conversation

@aryanbhosale
Copy link
Copy Markdown

@aryanbhosale aryanbhosale commented Apr 3, 2026

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0

val_bpb = 1.0897 (3-seed mean, std 0.0003) | ~15.99 MB | 8×H100 SXM

3-Seed Results

Seed Sliding BPB Artifact
42 1.0894 15,999,165
314 1.0898 15,997,318
999 1.0899 15,990,607
Mean 1.0897

Merged SOTA (PR #1019): 1.1147 BPB. Delta: −0.0250 BPB.

Key Techniques

  1. 4096-Vocab + MLP 4x + WD 0.090 — PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 @clarkkev, PR Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) #1285 @dexhunter
  2. Depth Recurrence (layers 4,5) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
  3. Parallel Residuals (from layer 7) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka
  4. MuonEq-R — arXiv:2603.28254, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
  5. QK-Gain 5.0 — PR Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217 @bigbag
  6. Full GPTQ int6 + Brotli + Compressed Wrapper — LZMA self-extracting (~25KB code)

Compliance

Reproduction

pip install brotli
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --skip-manifest
SEED=42 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

…0940 (3-seed mean)

4096-vocab + MLP 4x + WD 0.090 + depth recurrence (layers 4,5) + MuonEq-R
+ full GPTQ int6 + brotli + selective pruning.

3-seed mean: 1.0940 BPB, beating merged SOTA (PR openai#1019, 1.1147 BPB) by 0.0208 BPB.
…d mean)

LZMA self-extracting code wrapper (24KB vs 81KB) frees 57KB for model precision.
No pruning needed. 3-seed mean improves from 1.0940 to 1.0926.
@aryanbhosale aryanbhosale changed the title Record: SP4096 + Depth Recurrence + MuonEq-R + Full GPTQ — val_bpb 1.0940 (3-seed mean) Record: SP4096 + Depth Recurrence + MuonEq-R + Full GPTQ — val_bpb 1.0926 (3-seed mean) Apr 3, 2026
Added parallel residuals from layer 7+ (separate attn/MLP lanes).
3-seed mean improves from 1.0926 to 1.0904.
@aryanbhosale aryanbhosale changed the title Record: SP4096 + Depth Recurrence + MuonEq-R + Full GPTQ — val_bpb 1.0926 (3-seed mean) Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Full GPTQ — val_bpb 1.0904 (3-seed mean) Apr 3, 2026
QK-Gain from 4.0 to 5.0 plus parallel residuals and depth recurrence.
3-seed mean: 1.0897 BPB (std 0.0003), delta -0.0250 vs merged SOTA.
@aryanbhosale aryanbhosale changed the title Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Full GPTQ — val_bpb 1.0904 (3-seed mean) Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) Apr 3, 2026
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 3, 2026
Port depth recurrence from PR openai#1290 and parallel residuals from PR openai#1296.
- Depth recurrence: layers 3,4 repeated in forward pass via virtual layer mapping
- Parallel residuals: attn+mlp computed in parallel from layer 6 onward
- Configurable via RECUR_LAYERS, RECUR_START_STEP, PARALLEL_START_LAYER env vars
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 3, 2026
Ports parallel residuals from PR openai#1296 to openai#1290 base:
- Block.__init__ accepts parallel flag
- Block.forward() computes attn+mlp in parallel when parallel=True
- GPT.__init__ passes parallel_start_layer to Block constructors
- Layers 7-10 run parallel, layers 0-6 sequential (default PARALLEL_START_LAYER=7)
- Both base_model and eval_model wired up
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 4, 2026
- QK_GAIN_INIT: 1.5 -> 5.0 (matches openai#1296 proven config)
- WARMDOWN_ITERS: already 4000 (matches openai#1290 run command)
- MULTIRES_ENABLED: 1 -> 0 (multi-res failed: only 1.13x speedup)
- BIGRAM: revert to 2048x128 (3072x112 exceeded 16MB artifact limit)
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant