Non-record: VR + GA + Late QAT + Full GPTQ — 1.1418 BPB, 15.7 MB by anantdgoel · Pull Request #601 · openai/parameter-golf

anantdgoel · 2026-03-24T06:32:12Z

val_bpb: 1.1418 | 15.7 MB | 1x NVIDIA RTX A6000, ~14 hours

Summary

11-layer GPT combining the community meta-stack with novel techniques Value Residual (VR) and Gated Attention (GA), plus Late QAT during training and a Full GPTQ + Int5 MLP post-training quantization pipeline. Achieves 1.1418 BPB (stride=128) in a 15.7 MB artifact that fits under the 16 MB limit.

Update pending: BH10240 (bigram hash 10240 buckets) variant currently evaluating — expect improved results soon.

Novel Contributions

Value Residual (VR): Layer-0 V vector shortcut for deep attention signal flow (−0.015 BPB). Inspired by arXiv:2410.17897.
Gated Attention (GA): Per-head learned sigmoid gate after SDPA (−0.003 BPB). Inspired by arXiv:2505.06708.
Late QAT: LR-threshold-based fake-quantize during final ~5% of training.
Full GPTQ + Int5 MLP post-training: Hessian-aware quantization + int5 MLP re-quantization (−0.028 BPB, −3.6 MB).
Finding: TTT hurts on GPTQ-quantized models (+0.030 BPB). Quantized weight space is incompatible with gradient-based test-time adaptation.

Ablation Results (stride=128)

Configuration	BPB	Delta
Base int6+zstd (no post-training)	1.1696	—
+ Full GPTQ + Int5 + GPTQ-lite	1.1418	−0.028
+ VR_V0_FP16 (asymmetric V0 quant)	1.1418	+0.000
+ SGD TTT (legal, cosine, per-layer)	1.1721	+0.030

Credits

Built on top of the excellent community meta-stack. Key techniques originated from:

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 — EMA + GPTQ-lite baseline
Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean) #473 — Legal TTT protocol (score-first backward-looking)
Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 — XSA (cross-sequence attention)
Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) #481 — Int5 MLP quantization
Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds) #180 — BigramHash embeddings
Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) #315 — SmearGate
Record: 11L LeakyReLU² + Full GPTQ + QAT Alignment (val_bpb: 1.1204) #535 — Full GPTQ post-training quantization

Files

train_gpt.py — Full training + eval script with all techniques
submission.json — Metadata
README.md — Detailed writeup with ablations and reproducibility commands

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Evidence from 4 independent configurations (PR openai#461, PR openai#601, PR openai#1326, and my own experiments) showing GPTQ's compensatory weight structure is destroyed by SGD-based test-time training. Key finding: SGD TTT gives -0.0165 BPB on simple int6 but provides negligible to negative improvement on GPTQ-quantized models (-0.0001 to +0.030 BPB). Includes complete SGD TTT implementation (sgd_ttt_eval.py) following PR openai#461 protocol, and LoRA TTT implementation (clark_ttt_eval.py). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

himanshudongre · 2026-04-04T12:15:12Z

I replicated this finding with LoRA TTT (rank-8 on Q,V projections) on a GPTQ int6 Clark 11L model: -0.0013 BPB, effectively zero improvement. PR #1326 (aryanbhosale) also independently confirmed with score-first SGD TTT on GPTQ: -0.0001 BPB.

I've written a systematic analysis aggregating all 4 known TTT+GPTQ configurations in PR #1341, including root cause analysis (GPTQ's column-wise error compensation is destroyed by SGD) and proposed fix directions.

Non-record: VR + GA + Late QAT + Full GPTQ, 1.1418 BPB at 15.7 MB

4357cbc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 24, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

himanshudongre mentioned this pull request Apr 4, 2026

Non-Record: TTT and GPTQ Are Fundamentally Incompatible — Quantized Weight Structure Defeats Test-Time Adaptation #1341

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: VR + GA + Late QAT + Full GPTQ — 1.1418 BPB, 15.7 MB#601

Non-record: VR + GA + Late QAT + Full GPTQ — 1.1418 BPB, 15.7 MB#601
anantdgoel wants to merge 1 commit intoopenai:mainfrom
anantdgoel:submission-lateqat-vr-ga-gptq

anantdgoel commented Mar 24, 2026

Uh oh!

himanshudongre commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anantdgoel commented Mar 24, 2026

Summary

Novel Contributions

Ablation Results (stride=128)

Credits

Files

Uh oh!

himanshudongre commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants