Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study by abaybektursun · Pull Request #756 · openai/parameter-golf

abaybektursun · 2026-03-25T18:57:51Z

Summary

Negative results on the 1.1142 BPB stack (GPTQ + XSA-all + BigramHash 3072×112): quantization algorithms, TTT, architecture experiments, and a self-generated GPTQ calibration study.

Self-Generated GPTQ Calibration Study

GPTQ calibration estimates H = X^T X (activation covariance) per layer to guide int6 rounding decisions. We tested whether the model can calibrate itself without any external data.

Trained once (seed 314, 6,942 steps), saved checkpoint, ran GPTQ with different calibration sources on the same frozen weights:

#	Calibration Source	Tokens	Time	Sliding BPB	vs Val-calib
1	Val data	~50M	~5s	1.11446	—
2	Autoregressive self-generation	131K	186s	1.11477	+0.00031
3	Random tokens (64 batches)	131K	3.4s	1.11650	+0.00204
4	Random tokens (256×48 batches)	25M	35s	1.11650	+0.00204
5	Gibbs-refined (3 rounds, 64×48)	6.3M	24.4s	1.11663	+0.00217

Row 2: the model generates 64 coherent sequences of 2048 tokens autoregressively from its own learned distribution (temperature=0.8, batch_size=8). No external data accessed. Confirmed on a separate checkpoint (BigramHash 2048×128, 8×H100); the relative gaps are consistent across stacks.

Findings:

Autoregressive self-generation closes 84% of the gap. The val-vs-random gap is 0.00204 BPB. Autoregressive generation recovers 0.00173 of that, leaving only 0.00031 BPB. The gap is predominantly natural language vs random noise — coherent text from the model's own distribution produces Hessians nearly identical to val data.
The remaining 0.0003 BPB is P_model vs P_data divergence. The model's output distribution is a 27M-parameter approximation of the training data distribution. This small residual gap measures how far the model's internal activation patterns have drifted from those of real FineWeb text. It is negligible.
Gibbs refinement does not help (1.11663 vs 1.11650 for plain random). Gibbs replaces tokens in-place conditioned on still-mostly-random neighbors — it does not produce coherent text. Autoregressive generation builds coherent sequences left-to-right, which is what produces natural-language-like activations.
More random tokens do not help. 131K and 25M tokens give identical BPB (1.11650). The Hessian converges quickly at int6 — it mainly needs to identify dead columns and relative importance, which are properties of the model's weights, not input statistics.
Self-generated calibration at 1.1165 beats SOTA (from our PR #549, 1.1194) by 0.003 BPB with zero legality risk. Autoregressive self-generation at 1.1148 comes within 0.0003 of val-calibrated performance.

Why random tokens work at int6: The Hessian diagonal and off-diagonal structure are dominated by the model's learned weights — embedding geometry, attention patterns, MLP scales. At 63 grid levels, the rounding decisions are coarse enough that the Hessian quality threshold is low.

Quantization Algorithm Experiments

Quant gap: +0.0036 BPB (pre-quant 1.1341 → roundtrip 1.1377). At int6, GPTQ is near-optimal.

Technique	Paper	Result	Mechanism
Qronos (ICLR 2026)	arXiv:2505.11695	+0.0007 ❌	Re-collects Hessians from quantized activations. At int6, activation mismatch <0.1% — updated Hessians nearly identical.
CDQuant	arXiv:2406.17542	+0.0005 ❌	Coordinate descent re-visiting columns. At ~0.06 scale-unit spacing, most weights already at optimal grid point.

TTT Experiments (Score-First, Legal)

Same protocol as our merged PR #549. 25 total TTT experiments have failed across two stacks.

Approach	Params Unfrozen	TTT BPB	Baseline	Delta
Full TTT	27.1M (100%)	1.1146	1.1139	+0.0007 ❌
MLP-down	8.7M (32%)	1.1145	1.1144	+0.0001
MLP-all	17.3M (64%)	1.1144	1.1144	+0.0000

SGD lr=0.002, momentum=0.9, 3 epochs, 32K chunks, cosine LR, grad_clip=1.0. Baselines differ per row because each TTT variant freezes different layers, changing the eval-time forward pass.

Why TTT fails on this stack but worked on our PR #549 (−0.0025 BPB):

XSA-all already captures the inter-document context patterns that TTT was adapting to on the previous stack
At 27M params, score-first TTT cannot overcome the forgetting/adaptation tradeoff — early chunks get no benefit, and the model is too small for late-chunk gains to compensate

Architecture and Eval-Time Experiments

Technique	Result	Mechanism
Spectral Init (λ=10 on QKV, arXiv:2603.07162)	1.52 BPB, 650ms/step ❌	λ=10 is 200× Xavier init magnitude at 27M params. 924 steps in 600s vs 6,950 baseline. Paper tested ~100M models.
SLOT bias (512-dim additive, 3 AdamW steps/chunk)	+0.0013 ❌	Global shift cannot capture per-document patterns. Final-norm → logit pipeline already calibrated.

What's Exhausted

Quant algorithm: Qronos, CDQuant both negative
Eval-time adaptation: 3× TTT + SLOT all negative
Architecture: Spectral Init catastrophic; Gated Attention (+0.0011, PR #609), DiffTransformer (1.5× slower, PR #418), Attention Residuals (54% slower, PR #362) all dead

Untested: Non-uniform quantization grid, rate-distortion quantization (CERWU), QK-Norm, Peri-LN.

🤖 Generated with Claude Code

…PTQ stack 6 experiments on the current SOTA stack (1.1142 BPB), all negative: - Qronos iterative Hessian (3 iters): +0.0007 worse - CDQuant coordinate descent (3 passes): +0.0005 worse - Full TTT (all params): +0.0001 worse - MLP-down-only TTT: +0.0001 neutral - MLP-all TTT: +0.0001 neutral Key finding: At int6, GPTQ algorithm is near-optimal. Remaining quant headroom is in the grid (what values to quantize to), not the algorithm (how to assign). TTT is dead on this stack — 25 total failed attempts across two stacks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T13:49:57Z

Community Review — Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Analysis

PR #756 — "Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study"
Head SHA: e16f7aac737d8a8e960f740c0828c61b3bdc524b
Repo: abaybektursun/parameter-golf

Files changed

Only one file modified:

records/track_10min_16mb/2026-03-25_Negative_Results_Quant_Algorithm_TTT/README.md (added, +53 lines)

No train_gpt.py changes in this PR. The fork's train_gpt.py was read from head SHA to check for any eval-path modifications.

Check 1: N-gram family bug (CLOSE if found)

No BigramHash, XSA, ctx_hash, primes, or full_key patterns found anywhere in train_gpt.py. The grep for Bigram|bigram|BigramHash|XSA|hash|primes|target.*prime|full_key|ctx_hash returned zero matches on the model/eval code. The PR description references BigramHash 3072×112 as part of a separate merged stack (PR #549) — it is not present in this submission's code. Not found.

Check 2: Pre-Quant TTT on val_tokens (CLOSE if found)

The eval_val() function (lines 226–278) uses torch.inference_mode() throughout with no optimizer steps. The PR description documents TTT experiments were run, but the README explicitly states "Legal score-first TTT: score each chunk under inference_mode(), THEN train on it." The train_gpt.py eval path contains no AdamW/SGD on val_tokens at all — it is pure forward-pass inference. Not found.

Check 3: Legal score-first TTT (CLEAN if found)

The PR body describes score-first TTT experiments (score under inference_mode() before adapting), referencing the same protocol as PR #549. However, these experiments are documented as external runs on separate scripts, not wired into the submitted train_gpt.py. The submitted eval path (lines 250–266) is a plain inference_mode() loop with no TTT mechanism. All 3 TTT variants reported neutral/negative results (+0.0001 BPB or 0.0000 BPB) and were not merged into the submission code.

Check 4: Scored-region SLOT (HOLD if found)

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

notapplica mentioned this pull request Mar 25, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

abaybektursun mentioned this pull request Mar 26, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #728

Closed

abaybektursun changed the title ~~Non-record: Negative results — quantization algorithms & TTT on val-GPTQ stack~~ Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study Mar 26, 2026

This was referenced Mar 26, 2026

Non-record: Technique Taxonomy — Tier List, Interaction Effects, and BPB Verification Tools #891

Closed

Non-record: Technique Taxonomy — Tier List, Interaction Effects, and BPB Verification Tools #892

Open

abaybektursun mentioned this pull request Mar 28, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019

Merged

resouer mentioned this pull request Apr 3, 2026

Record: Causal SLOT + Pre-quant TTT — val_bpb 1.0846 (3-seed mean) #1306

Closed

stukenov mentioned this pull request Apr 4, 2026

Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean) #1364

Open

Its-Just-Crump mentioned this pull request Apr 5, 2026

Add record: SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + Brotli (1.1020 BPB) #1392

Open

abaybektursun mentioned this pull request Apr 6, 2026

Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean) #1420

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study#756

Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study#756
abaybektursun wants to merge 1 commit intoopenai:mainfrom
abaybektursun:negative-results-quant-algo-ttt

abaybektursun commented Mar 25, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abaybektursun commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Self-Generated GPTQ Calibration Study

Quantization Algorithm Experiments

TTT Experiments (Score-First, Legal)

Architecture and Eval-Time Experiments

What's Exhausted

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study

Analysis

Files changed

Check 1: N-gram family bug (CLOSE if found)

Check 2: Pre-Quant TTT on val_tokens (CLOSE if found)

Check 3: Legal score-first TTT (CLEAN if found)

Check 4: Scored-region SLOT (HOLD if found)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abaybektursun commented Mar 25, 2026 •

edited

Loading