Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study#756
Conversation
…PTQ stack 6 experiments on the current SOTA stack (1.1142 BPB), all negative: - Qronos iterative Hessian (3 iters): +0.0007 worse - CDQuant coordinate descent (3 passes): +0.0005 worse - Full TTT (all params): +0.0001 worse - MLP-down-only TTT: +0.0001 neutral - MLP-all TTT: +0.0001 neutral Key finding: At int6, GPTQ algorithm is near-optimal. Remaining quant headroom is in the grid (what values to quantize to), not the algorithm (how to assign). TTT is dead on this stack — 25 total failed attempts across two stacks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration studyCompliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache AnalysisPR #756 — "Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study" Files changedOnly one file modified:
No Check 1: N-gram family bug (CLOSE if found)No BigramHash, XSA, Check 2: Pre-Quant TTT on val_tokens (CLOSE if found)The Check 3: Legal score-first TTT (CLEAN if found)The PR body describes score-first TTT experiments (score under Check 4: Scored-region SLOT (HOLD if found)Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission. Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually. |
Summary
Negative results on the 1.1142 BPB stack (GPTQ + XSA-all + BigramHash 3072×112): quantization algorithms, TTT, architecture experiments, and a self-generated GPTQ calibration study.
Self-Generated GPTQ Calibration Study
GPTQ calibration estimates H = X^T X (activation covariance) per layer to guide int6 rounding decisions. We tested whether the model can calibrate itself without any external data.
Trained once (seed 314, 6,942 steps), saved checkpoint, ran GPTQ with different calibration sources on the same frozen weights:
Row 2: the model generates 64 coherent sequences of 2048 tokens autoregressively from its own learned distribution (temperature=0.8, batch_size=8). No external data accessed. Confirmed on a separate checkpoint (BigramHash 2048×128, 8×H100); the relative gaps are consistent across stacks.
Findings:
Autoregressive self-generation closes 84% of the gap. The val-vs-random gap is 0.00204 BPB. Autoregressive generation recovers 0.00173 of that, leaving only 0.00031 BPB. The gap is predominantly natural language vs random noise — coherent text from the model's own distribution produces Hessians nearly identical to val data.
The remaining 0.0003 BPB is P_model vs P_data divergence. The model's output distribution is a 27M-parameter approximation of the training data distribution. This small residual gap measures how far the model's internal activation patterns have drifted from those of real FineWeb text. It is negligible.
Gibbs refinement does not help (1.11663 vs 1.11650 for plain random). Gibbs replaces tokens in-place conditioned on still-mostly-random neighbors — it does not produce coherent text. Autoregressive generation builds coherent sequences left-to-right, which is what produces natural-language-like activations.
More random tokens do not help. 131K and 25M tokens give identical BPB (1.11650). The Hessian converges quickly at int6 — it mainly needs to identify dead columns and relative importance, which are properties of the model's weights, not input statistics.
Self-generated calibration at 1.1165 beats SOTA (from our PR #549, 1.1194) by 0.003 BPB with zero legality risk. Autoregressive self-generation at 1.1148 comes within 0.0003 of val-calibrated performance.
Why random tokens work at int6: The Hessian diagonal and off-diagonal structure are dominated by the model's learned weights — embedding geometry, attention patterns, MLP scales. At 63 grid levels, the rounding decisions are coarse enough that the Hessian quality threshold is low.
Quantization Algorithm Experiments
Quant gap: +0.0036 BPB (pre-quant 1.1341 → roundtrip 1.1377). At int6, GPTQ is near-optimal.
TTT Experiments (Score-First, Legal)
Same protocol as our merged PR #549. 25 total TTT experiments have failed across two stacks.
SGD lr=0.002, momentum=0.9, 3 epochs, 32K chunks, cosine LR, grad_clip=1.0. Baselines differ per row because each TTT variant freezes different layers, changing the eval-time forward pass.
Why TTT fails on this stack but worked on our PR #549 (−0.0025 BPB):
Architecture and Eval-Time Experiments
What's Exhausted
Untested: Non-uniform quantization grid, rate-distortion quantization (CERWU), QK-Norm, Peri-LN.
🤖 Generated with Claude Code