[Record] SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866 by X-Abhishek-X · Pull Request #1471 · openai/parameter-golf

X-Abhishek-X · 2026-04-08T10:20:31Z

Record: SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866

val_bpb: 1.0866 (3-seed mean, std 0.0007) | ~15.98 MB | 8×H100 SXM, 590s

3-Seed Results (8×H100 80GB SXM)

Seed	Pre-quant BPB	Sliding BPB (s64)	Pruning	Artifact
42	1.0874	1.0873	None	15,981,300 B
1337	1.0865	1.0866	None	15,978,870 B
2024	—	1.0859	None	—
Mean		1.0866 (std 0.0007)	Zero

Current merged SOTA: 1.1147 (PR #1019). Delta: −0.0281 BPB.

Key Changes (over PR #1445, this author)

Change	PR #1445	This	Impact
Tokenizer	SP4096	SP8192	Larger vocab, better context
Quantization	Percentile search	SDClip (c = k·std)	Zero pruning, better rate-distortion

SDClip Quantization

Replaces multi-percentile clip search with clip = k · std(row) (PR #1394):

k=12.85 for int6 matrices, k=20.0 for int8 embeddings
Directly accounts for compressed size, not just reconstruction error
One GPTQ pass per matrix instead of 5
Result: zero selective pruning — model fits natively under 16MB

Full Stack

Parameter	Value	Source
Tokenizer	SP8192	This work
SDClip k (matrices/embed)	12.85 / 20.0	PR #1394, this work
Recurrence layers	3,4,5 (14 virtual)	PR #1331
Weight decay	0.095	PR #1331
Matrix LR	0.022	PR #1331
EMA decay	0.9965	PR #1421 (this author)
Recurrence start	step 2000	PR #1445 (this author)
Warmdown fraction	0.72	PR #1445 (this author)

Architecture

11L, 512-dim, 8 heads (4 KV), depth recurrence (3,4,5), 14 virtual layers
Skip gates, parallel residuals from layer 7, QK-Gain 5.0
XSA all 11 layers, LeakyReLU(0.5)², VE128 (layers 9,10)
Tied embeddings, logit softcap=30.0

Training

FlashAttention 3, Muon (lr=0.022, WD=0.095), Adam/AdamW (fused=True)
Warmdown: 72%, EMA=0.9965, Wallclock: 590s

Quantization

Full Hessian GPTQ + Cholesky + actorder
SDClip (c = k·std) — int6 matrices, int8 embeddings
Brotli compression, zero selective pruning

Run Command

SEED=42 VOCAB_SIZE=8192 \
DATA_PATH=./data/datasets/fineweb10B_sp8192/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
RECUR_START_STEP=2000 WARMDOWN_FRAC=0.72 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

SDClip + SP8192 baseline: PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 by @clarkkev
Base architecture: PR Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334 by @aryanbhosale
3-layer recurrence + WD/LR: PR Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 — val_bpb 1.0900 (3-seed mean) #1331
EMA 0.9965 + early recurrence + warmdown: PRs [Record] 11L Depth Recurrence + EMA Tuning (0.9965) — val_bpb 1.0925 #1421, [Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889 #1445 by @X-Abhishek-X (this author)

…l_bpb 1.0866 3-seed mean: 1.0866 BPB (sliding window stride=64) Beats merged SOTA (1.1147) by 0.0281 BPB. SP8192 tokenizer with SDClip quantization (c=k*std), 3-layer recurrence (3,4,5), EMA 0.9965, WD=0.095, early recurrence (step 2000), extended warmdown (72%). Zero selective pruning across all seeds. Seeds: 42 (1.0873), 1337 (1.0866), 2024 (1.0859) All artifacts under 16MB. 8xH100 SXM, 590s training.

Copilot

Pull request overview

Adds a new Track A (10min / 16MB) record snapshot for the “SP8192 + SDClip + 3-layer depth recurrence + EMA 0.9965” run, including the exact training script, seed logs, and leaderboard metadata for reproducibility.

Changes:

Add a standalone train_gpt.py implementing SP8192 + SDClip quantization + 3-layer depth recurrence + EMA=0.9965 configuration.
Add training logs for the reported seeds (42 / 1337 / 2024) plus a canonical train.log.
Add record metadata (submission.json) and a human-readable writeup (README.md).

Reviewed changes

Copilot reviewed 3 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/train_gpt.py	Standalone training/eval/quantization script for the record configuration
records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/train.log	Canonical run log (seed 42) capturing hyperparams, training, GPTQ, and final metrics
records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/train_seed42.log	Full seed-42 run log for reproducibility
records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/train_seed1337.log	Full seed-1337 run log for reproducibility
records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/train_seed2024.log	Full seed-2024 run log for reproducibility
records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/submission.json	Leaderboard metadata (mean metrics + artifact size)
records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/README.md	Record summary, reported results table, and reproduction command

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T10:26:10Z