sp4096 + 10L 3.5x MLP + GPTQ + TTT (1.1266 BPB) by Idan3011 · Pull Request #1431 · openai/parameter-golf

Idan3011 · 2026-04-07T06:03:06Z

sp4096 Custom Tokenizer + 10L 3.5x MLP + GPTQ + Score-First TTT

val_bpb: 1.1266 (TTT) | 1.1277 (sliding) | 15.99 MB | 8xH100 SXM, 600s

Follow-up to my previous submission #996 (1.1478 sliding) — fully rebuilt around a custom tokenizer + score-first TTT.

	#996 (previous)	this submission
val_bpb	1.1478	1.1266
Tokenizer	sp1024 (default)	sp4096 (custom HF)
MLP mult	3x	3.5x
Batch	524K	786K
Quantization	int6+lzma	int5+brotli+byte-shuffle
Eval-time TTT	none	score-first SGD
Artifact	14.94 MB	15.99 MB

Headline metrics

Stage	val_bpb	val_loss
Pre-quant (step 5952)	1.1427	2.6289
Post-quant (int6+brotli roundtrip)	1.1439	2.6318
Sliding window (stride=64)	1.1277	—
Score-first TTT (final)	1.1266	—

Key contributions

Custom sp4096 SentencePiece tokenizer — own dataset hosted on HuggingFace (idan3011/parameter-golf-sp4096), ~26% fewer tokens/byte than sp1024. Auto-downloads on first run; no setup scripts.

Mixed quantization scheme

Tensor	Bits	Why
Attention weights	int5 per-row	Stable with GPTQ
MLP weights	int5 per-row	Stable with QAT
tok_emb (tied)	int8 per-row	int5 destroys tied embedding (input AND output projection)
Control tensors	fp32 passthrough	Small total size, stability-critical

Quant gap: 0.0012 BPB (1.1427 → 1.1439).

brotli + byte-shuffle compression — byte-shuffle pre-filter groups int8 high/low bytes column-wise, exploiting brotli's context modeling. Saves ~280KB vs LZMA. Final artifact: 15,989,376 bytes (10KB under cap).

GPTQ with AR self-generated calibration — model generates its own 16×512 calibration sequences via autoregressive sampling, perfectly matched to its own activation distribution.

Score-First TTT (legal under #402 / #1017)

348 chunks of 131K tokens, 20 SGD epochs each, lr=0.003 cosine decay across chunks, grad clip 1.0, all 10 blocks trainable. Distributed scoring + training across 8 GPUs with dist.all_reduce(grad, AVG). Final TTT lift: -0.0011 BPB over sliding baseline.

Architecture

10L, 512d, 8H/4KV GQA, MLP 3.5x (1792 hidden), tied embeddings, U-Net skip connections, LeakyReLU(0.5)² activation, logit softcap 30, XSA on last 4 layers, 28.3M params.

Training

786K batch (vs 524K — smoother warmdown), Muon for matrix params, Adam for embeddings, EMA(0.997) + SWA blend in last 50%, wallclock-fraction warmdown (35%), QAT on MLP layers. Pre-quant: 1.1427 at step 5952/20000 (wallclock cap).

What didn't work

forward_logits() not compiled by torch.compile — only __call__/forward() get compiled. CastedLinear's float32→bfloat16 cast in the uncompiled path lost ~0.45 BPB. Fix: route TTT scoring through model(x) (compiled) by making forward() return logits when target_ids is None.
AdamW weight_decay in TTT — pushed weights toward zero, destroyed pre-trained representations. SGD momentum 0.9 worked.
Per-layer LR groups (PR Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) #481 recipe) — AdamW with 3x LR on mlp.proj destabilized this model. Got 1.1725 (worse than sliding).
Scale-only TTT (only 30K control params) — gave only -0.0007 BPB. Full block training with low LR worked better.
Last-2-blocks TTT — 1.1272, worse than full-block training.
Smaller TTT chunks (32K) — 4x more steps, 4x more overhead, no quality gain over 131K.
Higher LR (0.005, 0.002) — diminishing returns past lr=0.003.
EMA in TTT (decay 0.998) — dampens adaptation to nothing. Removed.
requires_grad_(False) on frozen blocks — corrupts torch.compile guards mid-eval. Removed all requires_grad changes.
CROWN-Q — improved training BPB but destroyed compressibility.
More layers (11L/12L/13L) — exceeds 16MB cap even with int5.

Reproduce

pip install -r requirements.txt
torchrun --standalone --nproc_per_node=8 train_gpt.py

Every hyperparameter is baked into the script. Data and tokenizer auto-download from HuggingFace on first run. No env vars, no shell scripts.

Credits

Muon optimizer — modded-nanogpt baseline (kellerjordan)
SmearGate - PR Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65
XSA - arXiv:2603.09078; PR Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265

submission: sp4096 + 10L 3.5x MLP + GPTQ + TTT (1.1266 BPB)

32ab2e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sp4096 + 10L 3.5x MLP + GPTQ + TTT (1.1266 BPB)#1431

sp4096 + 10L 3.5x MLP + GPTQ + TTT (1.1266 BPB)#1431
Idan3011 wants to merge 1 commit intoopenai:mainfrom
Idan3011:submission-final

Idan3011 commented Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Idan3011 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

sp4096 Custom Tokenizer + 10L 3.5x MLP + GPTQ + Score-First TTT

Headline metrics

Key contributions

Architecture

Training

What didn't work

Reproduce

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Idan3011 commented Apr 7, 2026 •

edited

Loading