Skip to content

sp4096 + 10L 3.5x MLP + GPTQ + TTT (1.1266 BPB)#1431

Open
Idan3011 wants to merge 1 commit intoopenai:mainfrom
Idan3011:submission-final
Open

sp4096 + 10L 3.5x MLP + GPTQ + TTT (1.1266 BPB)#1431
Idan3011 wants to merge 1 commit intoopenai:mainfrom
Idan3011:submission-final

Conversation

@Idan3011
Copy link
Copy Markdown

@Idan3011 Idan3011 commented Apr 7, 2026

sp4096 Custom Tokenizer + 10L 3.5x MLP + GPTQ + Score-First TTT

val_bpb: 1.1266 (TTT) | 1.1277 (sliding) | 15.99 MB | 8xH100 SXM, 600s

Follow-up to my previous submission #996 (1.1478 sliding) — fully rebuilt around a custom tokenizer + score-first TTT.

#996 (previous) this submission
val_bpb 1.1478 1.1266
Tokenizer sp1024 (default) sp4096 (custom HF)
MLP mult 3x 3.5x
Batch 524K 786K
Quantization int6+lzma int5+brotli+byte-shuffle
Eval-time TTT none score-first SGD
Artifact 14.94 MB 15.99 MB

Headline metrics

Stage val_bpb val_loss
Pre-quant (step 5952) 1.1427 2.6289
Post-quant (int6+brotli roundtrip) 1.1439 2.6318
Sliding window (stride=64) 1.1277
Score-first TTT (final) 1.1266

Key contributions

Custom sp4096 SentencePiece tokenizer — own dataset hosted on HuggingFace (idan3011/parameter-golf-sp4096), ~26% fewer tokens/byte than sp1024. Auto-downloads on first run; no setup scripts.

Mixed quantization scheme

Tensor Bits Why
Attention weights int5 per-row Stable with GPTQ
MLP weights int5 per-row Stable with QAT
tok_emb (tied) int8 per-row int5 destroys tied embedding (input AND output projection)
Control tensors fp32 passthrough Small total size, stability-critical

Quant gap: 0.0012 BPB (1.1427 → 1.1439).

brotli + byte-shuffle compression — byte-shuffle pre-filter groups int8 high/low bytes column-wise, exploiting brotli's context modeling. Saves ~280KB vs LZMA. Final artifact: 15,989,376 bytes (10KB under cap).

GPTQ with AR self-generated calibration — model generates its own 16×512 calibration sequences via autoregressive sampling, perfectly matched to its own activation distribution.

Score-First TTT (legal under #402 / #1017)

348 chunks of 131K tokens, 20 SGD epochs each, lr=0.003 cosine decay across chunks, grad clip 1.0, all 10 blocks trainable. Distributed scoring + training across 8 GPUs with dist.all_reduce(grad, AVG). Final TTT lift: -0.0011 BPB over sliding baseline.

Architecture

10L, 512d, 8H/4KV GQA, MLP 3.5x (1792 hidden), tied embeddings, U-Net skip connections, LeakyReLU(0.5)² activation, logit softcap 30, XSA on last 4 layers, 28.3M params.

Training

786K batch (vs 524K — smoother warmdown), Muon for matrix params, Adam for embeddings, EMA(0.997) + SWA blend in last 50%, wallclock-fraction warmdown (35%), QAT on MLP layers. Pre-quant: 1.1427 at step 5952/20000 (wallclock cap).

What didn't work

  • forward_logits() not compiled by torch.compile — only __call__/forward() get compiled. CastedLinear's float32→bfloat16 cast in the uncompiled path lost ~0.45 BPB. Fix: route TTT scoring through model(x) (compiled) by making forward() return logits when target_ids is None.
  • AdamW weight_decay in TTT — pushed weights toward zero, destroyed pre-trained representations. SGD momentum 0.9 worked.
  • Per-layer LR groups (PR Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) #481 recipe) — AdamW with 3x LR on mlp.proj destabilized this model. Got 1.1725 (worse than sliding).
  • Scale-only TTT (only 30K control params) — gave only -0.0007 BPB. Full block training with low LR worked better.
  • Last-2-blocks TTT — 1.1272, worse than full-block training.
  • Smaller TTT chunks (32K) — 4x more steps, 4x more overhead, no quality gain over 131K.
  • Higher LR (0.005, 0.002) — diminishing returns past lr=0.003.
  • EMA in TTT (decay 0.998) — dampens adaptation to nothing. Removed.
  • requires_grad_(False) on frozen blocks — corrupts torch.compile guards mid-eval. Removed all requires_grad changes.
  • CROWN-Q — improved training BPB but destroyed compressibility.
  • More layers (11L/12L/13L) — exceeds 16MB cap even with int5.

Reproduce

pip install -r requirements.txt
torchrun --standalone --nproc_per_node=8 train_gpt.py

Every hyperparameter is baked into the script. Data and tokenizer auto-download from HuggingFace on first run. No env vars, no shell scripts.

Credits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant