sp4096 + 10L 3.5x MLP + GPTQ + TTT (1.1266 BPB)#1431
Open
Idan3011 wants to merge 1 commit intoopenai:mainfrom
Open
sp4096 + 10L 3.5x MLP + GPTQ + TTT (1.1266 BPB)#1431Idan3011 wants to merge 1 commit intoopenai:mainfrom
Idan3011 wants to merge 1 commit intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
sp4096 Custom Tokenizer + 10L 3.5x MLP + GPTQ + Score-First TTT
val_bpb: 1.1266 (TTT) | 1.1277 (sliding) | 15.99 MB | 8xH100 SXM, 600s
Follow-up to my previous submission #996 (1.1478 sliding) — fully rebuilt around a custom tokenizer + score-first TTT.
Headline metrics
Key contributions
Custom sp4096 SentencePiece tokenizer — own dataset hosted on HuggingFace (
idan3011/parameter-golf-sp4096), ~26% fewer tokens/byte than sp1024. Auto-downloads on first run; no setup scripts.Mixed quantization scheme
Quant gap: 0.0012 BPB (1.1427 → 1.1439).
brotli + byte-shuffle compression — byte-shuffle pre-filter groups int8 high/low bytes column-wise, exploiting brotli's context modeling. Saves ~280KB vs LZMA. Final artifact: 15,989,376 bytes (10KB under cap).
GPTQ with AR self-generated calibration — model generates its own 16×512 calibration sequences via autoregressive sampling, perfectly matched to its own activation distribution.
Score-First TTT (legal under #402 / #1017)
348 chunks of 131K tokens, 20 SGD epochs each, lr=0.003 cosine decay across chunks, grad clip 1.0, all 10 blocks trainable. Distributed scoring + training across 8 GPUs with
dist.all_reduce(grad, AVG). Final TTT lift: -0.0011 BPB over sliding baseline.Architecture
10L, 512d, 8H/4KV GQA, MLP 3.5x (1792 hidden), tied embeddings, U-Net skip connections, LeakyReLU(0.5)² activation, logit softcap 30, XSA on last 4 layers, 28.3M params.
Training
786K batch (vs 524K — smoother warmdown), Muon for matrix params, Adam for embeddings, EMA(0.997) + SWA blend in last 50%, wallclock-fraction warmdown (35%), QAT on MLP layers. Pre-quant: 1.1427 at step 5952/20000 (wallclock cap).
What didn't work
forward_logits()not compiled bytorch.compile— only__call__/forward()get compiled. CastedLinear's float32→bfloat16 cast in the uncompiled path lost ~0.45 BPB. Fix: route TTT scoring throughmodel(x)(compiled) by makingforward()return logits whentarget_ids is None.mlp.projdestabilized this model. Got 1.1725 (worse than sliding).requires_grad_(False)on frozen blocks — corrupts torch.compile guards mid-eval. Removed all requires_grad changes.Reproduce
Every hyperparameter is baked into the script. Data and tokenizer auto-download from HuggingFace on first run. No env vars, no shell scripts.
Credits