A from-scratch PyTorch reproduction of TurboQuant (ICLR 2026, Google Research) for LLM KV cache compression.
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni (Google Research) Published at ICLR 2026 | arXiv
TurboQuant is a data-oblivious vector quantization method that achieves near-optimal distortion rates for compressing LLM KV caches. The key insight: randomly rotating vectors induces a concentrated Beta distribution on coordinates, enabling optimal per-coordinate scalar quantization.
- Normalize — record vector norm, project to unit sphere
- Rotate — apply Haar random orthogonal matrix (via QR decomposition)
- Quantize — apply Lloyd-Max optimal scalar quantizer per coordinate
- Dequantize — centroid lookup → inverse rotation → rescale by norm
Two-stage approach for unbiased inner product preservation:
- Apply TurboQuant_MSE with (b-1) bits
- Apply QJL (Quantized Johnson-Lindenstrauss) on residual with 1 bit
Theoretical guarantee: D_mse ≤ (√3π/2) · (1/4^b), within 2.7× of the information-theoretic lower bound.
| Method | Bits/channel | PPL | Delta |
|---|---|---|---|
| FP16 (baseline) | 16 | 5.36 | — |
| TurboQuant K4V4 | 4 | 5.39 | +0.03 |
| TurboQuant K3V2 | ~2.5 | 5.48 | +0.12 |
| TurboQuant K2V2 | 2 | 5.79 | +0.43 |
Key findings matching the paper:
- 4-bit (K4V4): virtually lossless (+0.03 PPL), consistent with paper's "absolute quality neutrality at 3.5 bits"
- ~2.5-bit (K3V2): marginal degradation (+0.12 PPL), consistent with paper's "marginal quality degradation at 2.5 bits"
- 2-bit (K2V2): noticeable but still small degradation (+0.43 PPL)
| Method | KV Size (bits) | Average Score |
|---|---|---|
| Full Cache | 16 | 50.06 |
| TurboQuant | 3.5 | 50.06 |
| TurboQuant | 2.5 | 49.44 |
| KIVI | 3 | 48.50 |
| KIVI | 5 | 50.16 |
| PolarQuant | 3.9 | 49.78 |
| Method | Recall |
|---|---|
| Full-Precision | 0.997 |
| TurboQuant | 0.997 |
| PolarQuant | 0.995 |
| KIVI | 0.981 |
| SnapKV | 0.858 |
Prompt: "Explain what a transformer architecture is in 2 sentences."
[FP16]: A transformer architecture is a type of neural network model that uses self-attention
mechanisms to process sequential data, such as text or speech, by weighing the importance
of different input eleme...
[K4V4]: A transformer architecture is a type of neural network model that uses self-attention
mechanisms to process sequential data, such as text or speech, by weighing the importance
of different input eleme... (IDENTICAL to FP16)
[K3V2]: A transformer architecture is a type of neural network model that uses self-attention
mechanisms to process sequential data, such as text or speech, allowing it to weigh the
importance of different in... (minor rewording)
[K2V2]: A transformer architecture is a type of neural network model that uses self-attention
mechanisms to process sequential data, such as text or speech, in a parallel and efficient
manner... (coherent, different phrasing)
turboquant/
├── lloyd_max.py # Lloyd-Max optimal scalar quantizer for Beta distribution
├── codebook.py # Codebook pre-computation and caching
├── rotation.py # Haar random rotation matrix generation
├── quantizer_mse.py # TurboQuant_MSE (Algorithm 1)
├── quantizer_prod.py # TurboQuant_prod (Algorithm 2, MSE + QJL)
├── kv_cache.py # HuggingFace transformers Cache integration
└── utils.py # Bit-packing utilities
tests/ # 38 unit tests covering all components
eval/
├── perplexity.py # WikiText-2 PPL evaluation
└── needle_in_haystack.py # Needle-in-a-Haystack evaluation
pip install torch transformers datasets scipy numpy tqdm
# Run all tests (38 tests)
python -m pytest tests/ -v
# Pre-compute codebooks
PYTHONPATH=. python scripts/precompute_codebooks.py
# Evaluate perplexity
PYTHONPATH=. python -m eval.perplexity --model meta-llama/Llama-3.1-8B-Instruct --max-tokens 2048
# Use in your code
from turboquant.kv_cache import TurboQuantCache
cache = TurboQuantCache(model.config, nbits_key=4, nbits_value=4, residual_length=128)
output = model.generate(input_ids, past_key_values=cache, max_new_tokens=100)- Codebook: Lloyd-Max algorithm with
scipy.integrate.quadfor numerical integration on the Beta distribution - Rotation: Haar-distributed via QR decomposition with sign correction
- Cache integration: Chunk-based storage (no re-quantization) to avoid error accumulation; first update returns originals (prefill is exact)
- Multi-GPU: Compatible with
device_map='auto'— quantizers lazily initialize on the correct device
- PyTorch >= 2.0.0
- transformers >= 4.40.0
- scipy, numpy, datasets, tqdm