Record: SP8192 + Pre-Quant TTT — val_bpb 1.07948 (3-seed mean)#1416
Record: SP8192 + Pre-Quant TTT — val_bpb 1.07948 (3-seed mean)#1416erichroepke wants to merge 1 commit intoopenai:mainfrom
Conversation
…ed mean) Merges @clarkkev's openai#1394 (SP8192, SDClip, GPTQ embeddings, skip gates) with @stukenov's openai#1364 (pre-quant AdamW TTT). First combination of these techniques. 3-seed mean: 1.07948 BPB (std=0.00043), artifact 15.12 MB. Built with Claude Opus 4.6 as AI co-author. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for the detailed writeup. I think the main question for reviewers is not the SP8192 / SDClip side of the stack, but how the pre-quant AdamW TTT step fits the community guidance in #1017. For readers who have not followed that thread, the four conditions in #1017 are roughly:
On my reading, conditions 1, 2, and 4 are not the hard part here. The part I’m struggling to reconcile is condition 3 / score-before-update. The reason is that the PR README describes this as pre-quant AdamW TTT on validation data before compression, and Could you add a short compliance note explaining how this step satisfies the #1017 score-before-update rule, and in particular how the TTT objective is restricted to tokens that have already been scored before they influence later scored tokens? Issue for reference: #1017 |
|
You're totally right — my apologies, I didn't catch that rule. Stripping TTT, which does reduce the model. Going back to the drawing board on this one. Thanks for the detailed review. |
Summary
val_bpb: 1.07948 (3-seed mean, std=0.00043) | Artifact: 15.12 MB
What This Is
Simple combination of two existing PRs:
That's basically it. Turns out you can apply pre-quant TTT to the SP8192 base and the two techniques don't interfere. TTT adapts the full-precision model before quantization, then SDClip + GPTQ compresses the adapted weights cleanly.
TTT gives about -0.034 BPB on this base (post-EMA 1.1019 → post-TTT 1.0682).
Supersedes my earlier PR #1396 (1.1067 BPB).
Credits
Nearly everything here is other people's work:
I'm a filmmaker, not an ML engineer. Built with Claude Opus 4.6 as AI co-author.
How to Run
pip install brotli # SP8192 dataset from @clarkkev's HF: huggingface.co/datasets/kevclark/parameter-golf DATA_DIR=./data/ SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py