TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

A from-scratch PyTorch reproduction of TurboQuant (ICLR 2026, Google Research) for LLM KV cache compression.

Paper

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni (Google Research) Published at ICLR 2026 | arXiv

Algorithm Overview

TurboQuant is a data-oblivious vector quantization method that achieves near-optimal distortion rates for compressing LLM KV caches. The key insight: randomly rotating vectors induces a concentrated Beta distribution on coordinates, enabling optimal per-coordinate scalar quantization.

TurboQuant_MSE (Algorithm 1)

Normalize — record vector norm, project to unit sphere
Rotate — apply Haar random orthogonal matrix (via QR decomposition)
Quantize — apply Lloyd-Max optimal scalar quantizer per coordinate
Dequantize — centroid lookup → inverse rotation → rescale by norm

TurboQuant_prod (Algorithm 2)

Two-stage approach for unbiased inner product preservation:

Apply TurboQuant_MSE with (b-1) bits
Apply QJL (Quantized Johnson-Lindenstrauss) on residual with 1 bit

Theoretical guarantee: D_mse ≤ (√3π/2) · (1/4^b), within 2.7× of the information-theoretic lower bound.

Reproduction Results

WikiText-2 Perplexity (Llama-3.1-8B-Instruct, 2048 tokens, auto-regressive)

Method	Bits/channel	PPL	Delta
FP16 (baseline)	16	5.36	—
TurboQuant K4V4	4	5.39	+0.03
TurboQuant K3V2	~2.5	5.48	+0.12
TurboQuant K2V2	2	5.79	+0.43

Key findings matching the paper:

4-bit (K4V4): virtually lossless (+0.03 PPL), consistent with paper's "absolute quality neutrality at 3.5 bits"
~2.5-bit (K3V2): marginal degradation (+0.12 PPL), consistent with paper's "marginal quality degradation at 2.5 bits"
2-bit (K2V2): noticeable but still small degradation (+0.43 PPL)

Paper's LongBench-E Results (Llama-3.1-8B-Instruct)

Method	KV Size (bits)	Average Score
Full Cache	16	50.06
TurboQuant	3.5	50.06
TurboQuant	2.5	49.44
KIVI	3	48.50
KIVI	5	50.16
PolarQuant	3.9	49.78

Paper's Needle-in-a-Haystack Results (Llama-3.1-8B-Instruct, 4K-104K)

Method	Recall
Full-Precision	0.997
TurboQuant	0.997
PolarQuant	0.995
KIVI	0.981
SnapKV	0.858

Generation Quality (Llama-3.1-8B-Instruct)

Prompt: "Explain what a transformer architecture is in 2 sentences."

[FP16]: A transformer architecture is a type of neural network model that uses self-attention
        mechanisms to process sequential data, such as text or speech, by weighing the importance
        of different input eleme...

[K4V4]: A transformer architecture is a type of neural network model that uses self-attention
        mechanisms to process sequential data, such as text or speech, by weighing the importance
        of different input eleme...  (IDENTICAL to FP16)

[K3V2]: A transformer architecture is a type of neural network model that uses self-attention
        mechanisms to process sequential data, such as text or speech, allowing it to weigh the
        importance of different in...  (minor rewording)

[K2V2]: A transformer architecture is a type of neural network model that uses self-attention
        mechanisms to process sequential data, such as text or speech, in a parallel and efficient
        manner...  (coherent, different phrasing)

Project Structure

turboquant/
├── lloyd_max.py          # Lloyd-Max optimal scalar quantizer for Beta distribution
├── codebook.py           # Codebook pre-computation and caching
├── rotation.py           # Haar random rotation matrix generation
├── quantizer_mse.py      # TurboQuant_MSE (Algorithm 1)
├── quantizer_prod.py     # TurboQuant_prod (Algorithm 2, MSE + QJL)
├── kv_cache.py           # HuggingFace transformers Cache integration
└── utils.py              # Bit-packing utilities

tests/                    # 38 unit tests covering all components
eval/
├── perplexity.py         # WikiText-2 PPL evaluation
└── needle_in_haystack.py # Needle-in-a-Haystack evaluation

Quick Start

pip install torch transformers datasets scipy numpy tqdm

# Run all tests (38 tests)
python -m pytest tests/ -v

# Pre-compute codebooks
PYTHONPATH=. python scripts/precompute_codebooks.py

# Evaluate perplexity
PYTHONPATH=. python -m eval.perplexity --model meta-llama/Llama-3.1-8B-Instruct --max-tokens 2048

# Use in your code
from turboquant.kv_cache import TurboQuantCache
cache = TurboQuantCache(model.config, nbits_key=4, nbits_value=4, residual_length=128)
output = model.generate(input_ids, past_key_values=cache, max_new_tokens=100)

Key Implementation Details

Codebook: Lloyd-Max algorithm with scipy.integrate.quad for numerical integration on the Beta distribution
Rotation: Haar-distributed via QR decomposition with sign correction
Cache integration: Chunk-based storage (no re-quantization) to avoid error accumulation; first update returns originals (prefill is exact)
Multi-GPU: Compatible with device_map='auto' — quantizers lazily initialize on the correct device

Dependencies

PyTorch >= 2.0.0
transformers >= 4.40.0
scipy, numpy, datasets, tqdm

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
eval		eval
scripts		scripts
tests		tests
turboquant		turboquant
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper

Algorithm Overview

TurboQuant_MSE (Algorithm 1)

TurboQuant_prod (Algorithm 2)

Reproduction Results

WikiText-2 Perplexity (Llama-3.1-8B-Instruct, 2048 tokens, auto-regressive)

Paper's LongBench-E Results (Llama-3.1-8B-Instruct)

Paper's Needle-in-a-Haystack Results (Llama-3.1-8B-Instruct, 4K-104K)

Generation Quality (Llama-3.1-8B-Instruct)

Project Structure

Quick Start

Key Implementation Details

Dependencies

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper

Algorithm Overview

TurboQuant_MSE (Algorithm 1)

TurboQuant_prod (Algorithm 2)

Reproduction Results

WikiText-2 Perplexity (Llama-3.1-8B-Instruct, 2048 tokens, auto-regressive)

Paper's LongBench-E Results (Llama-3.1-8B-Instruct)

Paper's Needle-in-a-Haystack Results (Llama-3.1-8B-Instruct, 4K-104K)

Generation Quality (Llama-3.1-8B-Instruct)

Project Structure

Quick Start

Key Implementation Details

Dependencies

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages