Record: 12L RecycledCore Int5 — val_bpb 1.1464 (seed 1337) by shivangbaveja · Pull Request #1573 · openai/parameter-golf

shivangbaveja · 2026-04-12T20:30:49Z

Summary

Single-seed (1337) run on 8xH100 SXM, 600s wallclock. 12-layer model with recycled-core depth (layers 3,4 replayed once, 14 virtual layers), gated attention, value residual, and int5
quantization on both MLP and attention weights to fit under 16 MB.

Metric	Value
Int5 sliding-window BPB	1.1464
Int5 roundtrip BPB	1.1699
Pre-quant BPB	1.1329
Post-EMA BPB	1.1333
Steps	5,449
ms/step	110.14
Wallclock	600s
Artifact size	15,925,822 bytes (15.93 MB)

Key Techniques

12L recycled-core (default NUM_LAYERS=12): layers 3,4 replayed once for 14 virtual layers from 12 physical, with per-application learned modulation (scale+bias)
LeakyReLU(0.5)^2 MLP activation with 3x expansion (512->1536->512)
Gated attention + value residual: per-head sigmoid gate on attention output, first-layer values blended into all subsequent layers
Int5 quantization on both MLP and attention weights (GPTQ-lite per-row, pack 8 values -> 5 bytes) + LZMA preset=9
max-autotune compile with cudagraph_trees=False for tied embeddings
SWA (every 50 steps, 13 snapshots) + EMA (decay=0.997)
XSA last 4 layers, partial RoPE (16/64 dims), LN scale, U-Net skips, BigramHash (2048), SmearGate, value embeddings (layers 10,11)
Muon optimizer (lr=0.025, WD=0.04) + Adam (embeddings/scalars, WD=0.04)
Step-based warmdown over last 3500 steps

Results

Seed	step_avg	steps	pre-quant bpb	int5 roundtrip bpb	int5 sliding bpb	Artifact
1337	110.14ms	5,449	1.1329	1.1699	1.1464	15,925,822

Single-seed run. Seeds 42 and 2025 not yet completed.

Reproduction

# From repo root (all params are defaults at commit bf17df6):
SEED=1337 torchrun --standalone --nproc_per_node=8 \                                                                                                                                  
  records/track_10min_16mb/2026-04-12_11L_4xMLP_RecycledCore_ParallelResid_Int5/train_gpt.py

Credits

LeakyReLU(0.5)^2: PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493 by @parinzee
Parameter Banking + Parallel Muon: PR Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) #399 by @abaybektursun
Base model stack (XSA, EMA, BigramHash, SmearGate, etc.): PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 by @signalrush
Legal TTT recipe: PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 by @Christopher-Lee-McClendon (not enabled in this run)

… log

026-04-12_11L_4xMLP_RecycledCore_ParallelResid_Int5 submission single…

ca39c38

… log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 12L RecycledCore Int5 — val_bpb 1.1464 (seed 1337)#1573

Record: 12L RecycledCore Int5 — val_bpb 1.1464 (seed 1337)#1573
shivangbaveja wants to merge 1 commit intoopenai:mainfrom
shivangbaveja:sbaveja/submission/11L-4xMLP-RecycledCore-ParallelResid

shivangbaveja commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shivangbaveja commented Apr 12, 2026

Summary

Key Techniques

Results

Reproduction

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant