Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Legal TTT — val_bpb 1.0896 (3-seed mean)#1326
Conversation
…egal TTT — val_bpb 1.0896 (3-seed mean) SP4096 + MLP 4x + WD 0.090 + depth recurrence + parallel residuals + MuonEq-R + QK-Gain 5.0 + legal score-first TTT + full GPTQ int6 + brotli. 3-seed mean: 1.0896 BPB, delta -0.0251 vs merged SOTA (PR openai#1019).
Evidence from 4 independent configurations (PR openai#461, PR openai#601, PR openai#1326, and my own experiments) showing GPTQ's compensatory weight structure is destroyed by SGD-based test-time training. Key finding: SGD TTT gives -0.0165 BPB on simple int6 but provides negligible to negative improvement on GPTQ-quantized models (-0.0001 to +0.030 BPB). Includes complete SGD TTT implementation (sgd_ttt_eval.py) following PR openai#461 protocol, and LoRA TTT implementation (clark_ttt_eval.py). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I independently confirmed that TTT provides negligible improvement on GPTQ-quantized models. My LoRA TTT (rank-8 on Q,V) on a GPTQ int6 Clark 11L model gave -0.0013 BPB — consistent with your finding of -0.0001 BPB here. I wrote up a systematic analysis of why this happens: GPTQ's compensatory weight structure is destroyed by gradient-based updates. See PR #1341 for the full evidence table (4 configurations from 3 independent sources) and root cause analysis. |
|
@himanshudongre Thanks for the independent confirmation. My experience matches exactly — post-quant TTT on GPTQ models gives negligible or even negative returns because SGD disrupts the carefully calibrated quantization structure. I observed the same pattern across multiple attempts:
The conclusion is clear: GPTQ's error compensation creates a fragile weight structure that gradient updates destroy. TTT and GPTQ are fundamentally at odds. Will check out your PR #1341 for the full analysis. This is a useful negative result for the community. |
Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Legal TTT
val_bpb = 1.0896 (3-seed mean, std 0.0008) | ~15.99 MB | 8×H100 SXM
3-Seed Results
Merged SOTA (PR #1019): 1.1147 BPB. Delta: −0.0251 BPB.
Key Techniques
Compliance
Reproduction
Credits
PR #1218 @clarkkev, PR #1285 @dexhunter, PR #1204 @msisovic, PR #1289 @MatoTeziTanka, PR #1260 @dexhunter, PR #1019 @abaybektursun, PR #1287 @dentity007, PR #1217 @bigbag, PR #493 @parinzee, PR #461 @Christopher-Lee-McClendon