Record: TurboQuant + Full-Rescore N-gram (val_bpb=0.1653)#918
Record: TurboQuant + Full-Rescore N-gram (val_bpb=0.1653)#918haikosys wants to merge 14 commits intoopenai:mainfrom
Conversation
37.6M params via rotation-based Lloyd-Max codebook quantization (2/3/4-bit mixed) replacing int6, freeing 39% more params in 16MB budget. Full two-pass n-gram rescore from PR openai#870 for eval. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rename folder to YYYY-MM-DD_DescriptiveName convention - Update submission.json with required fields (author, github_id, val_bpb, blurb) - Expand README with full details matching accepted PRs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
torch.Generator can't be traced by dynamo. Disable compilation for _turbo_get_rotation, _turbo_get_codebook, _turbo_cached_cb — they return cached tensors that dynamo handles fine as opaque values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move TurboQuant STE, rotation lookup, and codebook lookup into a single @torch.compiler.disable function _turbo_qat_forward(). This ensures dynamo NEVER traces any TurboQuant code — the compiled CastedLinear just calls an opaque function that returns the quantized weight. Eliminates all possible dynamo crash vectors: - torch.Generator (was fixed) - _TurboQuantSTE.apply() custom autograd - Global dict lookups (_turbo_rotation_cache, _turbo_cb_cache) - Runtime-dependent control flow (cache miss paths) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fullgraph=True forces dynamo to trace the ENTIRE forward as one graph with zero breaks. @torch.compiler.disable functions need graph breaks. These are incompatible. fullgraph=False lets dynamo break around the TurboQuant helper functions while still compiling everything else. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- weights_only=False in turbo_decompress_model (meta dict has nested dicts) - Explicitly disable _turbo_qat_enabled before eval phase - Both from TeamCreate audit findings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- NUM_LAYERS default 11->13 (44.2M params, fits in 15.4MB) - Suppress torch._dynamo recompile warnings (noisy but harmless) - weights_only=False for turbo meta dict compatibility - Disable QAT before eval phase Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 13L/576d/3.5x, 44.2M params - val_bpb: 0.1648 (n-gram rescore), artifact: 15.35 MB - Pre-quant: 1.1330, post-quant: 1.4625 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same 13L/576d/3.5x TurboQuant base as turbogrannie, with enhanced eval: - Two-pass phrase cache (lengths 16-128, 8M buckets) - N-gram orders 2-14 (was 2-12), 32M buckets (was 16M) - Joint blend: neural + n-gram + phrase in single mixture - Extended primes array for higher orders Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3-seed mean val_bpb: 0.1653 (std 0.0010) seed 1337: 0.1648 seed 42: 0.1646 seed 2024: 0.1665 Full submission package: - README.md with detailed results table and methodology - submission.json with 3-seed mean BPB and metadata - train_gpt.py (self-contained, 135KB) - train_seed1337.log, train_seed42.log, train_seed2024.log Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Google claims "zero accuracy loss" at 3-4 bit. Our stress test shows 0.33 BPB quant penalty at 2/3/4-bit weight quantization — 41x worse than int6. The technique works for KV cache on large models, not for weight compression on small models at extreme bit widths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Record: TurboQuant + Full-Rescore N-gram (val_bpb=0.1653)BPB: 0.1653 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 1368 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 2.10s, dim=576, layers=13, vocab=1024, code=135521 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 2.10s, dim=576, layers=13, vocab=1024, code=135521 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Record: TurboQuant + Full-Rescore N-gram Cache (13L/576d/3.5x)
val_bpb: 0.1653 (3-seed mean, std 0.0010) | 15.35 MB artifact | 8xH100 SXM, 600s
Summary
TurboQuant rotation-based Lloyd-Max codebook quantization replaces int6, enabling 64% more parameters (44.2M vs 27.0M) in the same 16MB budget. Combined with PR #870's two-pass full-rescore n-gram cache for eval.
Results (8xH100 80GB SXM)
Architecture
Quantization: TurboQuant
Eval: Two-Pass Full-Rescore N-gram Cache (from PR #870)
Training
Reproduction
Lineage
On TurboQuant: Claims vs Reality
Google's TurboQuant blog post claims "zero accuracy loss" at 3-4 bit quantization via PolarQuant rotation + QJL error correction, tested on KV cache compression for inference. The marketing is seductive: 6x memory reduction with "perfect downstream results across all benchmarks."
This submission is a stress test of those claims applied to weight quantization in a parameter-constrained setting. The results are sobering:
The 64% more parameters do not compensate for the 41x worse quantization penalty. The rotation + Lloyd-Max codebook approach is theoretically optimal for Gaussian-distributed weights at a given bit width, but 2-3 bits is simply too few for weight matrices. Google's "zero accuracy loss" claim is for KV cache quantization at 3-4 bits on large models (8B+ params) where individual cache entry precision matters less. For weight quantization on small models where every bit counts, the story is very different.
Key findings:
Bottom line: TurboQuant is a real technique with real advantages at moderate compression ratios (4-6 bit). The "zero accuracy loss" marketing does not extend to aggressive 2-3 bit weight quantization. For this competition, simple int6 per-row quantization with fewer parameters outperforms TurboQuant with more parameters by 0.07 BPB.