Skip to content

xiehuanyi/TurboQuant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

A from-scratch PyTorch reproduction of TurboQuant (ICLR 2026, Google Research) for LLM KV cache compression.

Paper

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni (Google Research) Published at ICLR 2026 | arXiv

Algorithm Overview

TurboQuant is a data-oblivious vector quantization method that achieves near-optimal distortion rates for compressing LLM KV caches. The key insight: randomly rotating vectors induces a concentrated Beta distribution on coordinates, enabling optimal per-coordinate scalar quantization.

TurboQuant_MSE (Algorithm 1)

  1. Normalize — record vector norm, project to unit sphere
  2. Rotate — apply Haar random orthogonal matrix (via QR decomposition)
  3. Quantize — apply Lloyd-Max optimal scalar quantizer per coordinate
  4. Dequantize — centroid lookup → inverse rotation → rescale by norm

TurboQuant_prod (Algorithm 2)

Two-stage approach for unbiased inner product preservation:

  1. Apply TurboQuant_MSE with (b-1) bits
  2. Apply QJL (Quantized Johnson-Lindenstrauss) on residual with 1 bit

Theoretical guarantee: D_mse ≤ (√3π/2) · (1/4^b), within 2.7× of the information-theoretic lower bound.

Reproduction Results

WikiText-2 Perplexity (Llama-3.1-8B-Instruct, 2048 tokens, auto-regressive)

Method Bits/channel PPL Delta
FP16 (baseline) 16 5.36
TurboQuant K4V4 4 5.39 +0.03
TurboQuant K3V2 ~2.5 5.48 +0.12
TurboQuant K2V2 2 5.79 +0.43

Key findings matching the paper:

  • 4-bit (K4V4): virtually lossless (+0.03 PPL), consistent with paper's "absolute quality neutrality at 3.5 bits"
  • ~2.5-bit (K3V2): marginal degradation (+0.12 PPL), consistent with paper's "marginal quality degradation at 2.5 bits"
  • 2-bit (K2V2): noticeable but still small degradation (+0.43 PPL)

Paper's LongBench-E Results (Llama-3.1-8B-Instruct)

Method KV Size (bits) Average Score
Full Cache 16 50.06
TurboQuant 3.5 50.06
TurboQuant 2.5 49.44
KIVI 3 48.50
KIVI 5 50.16
PolarQuant 3.9 49.78

Paper's Needle-in-a-Haystack Results (Llama-3.1-8B-Instruct, 4K-104K)

Method Recall
Full-Precision 0.997
TurboQuant 0.997
PolarQuant 0.995
KIVI 0.981
SnapKV 0.858

Generation Quality (Llama-3.1-8B-Instruct)

Prompt: "Explain what a transformer architecture is in 2 sentences."

[FP16]: A transformer architecture is a type of neural network model that uses self-attention
        mechanisms to process sequential data, such as text or speech, by weighing the importance
        of different input eleme...

[K4V4]: A transformer architecture is a type of neural network model that uses self-attention
        mechanisms to process sequential data, such as text or speech, by weighing the importance
        of different input eleme...  (IDENTICAL to FP16)

[K3V2]: A transformer architecture is a type of neural network model that uses self-attention
        mechanisms to process sequential data, such as text or speech, allowing it to weigh the
        importance of different in...  (minor rewording)

[K2V2]: A transformer architecture is a type of neural network model that uses self-attention
        mechanisms to process sequential data, such as text or speech, in a parallel and efficient
        manner...  (coherent, different phrasing)

Project Structure

turboquant/
├── lloyd_max.py          # Lloyd-Max optimal scalar quantizer for Beta distribution
├── codebook.py           # Codebook pre-computation and caching
├── rotation.py           # Haar random rotation matrix generation
├── quantizer_mse.py      # TurboQuant_MSE (Algorithm 1)
├── quantizer_prod.py     # TurboQuant_prod (Algorithm 2, MSE + QJL)
├── kv_cache.py           # HuggingFace transformers Cache integration
└── utils.py              # Bit-packing utilities

tests/                    # 38 unit tests covering all components
eval/
├── perplexity.py         # WikiText-2 PPL evaluation
└── needle_in_haystack.py # Needle-in-a-Haystack evaluation

Quick Start

pip install torch transformers datasets scipy numpy tqdm

# Run all tests (38 tests)
python -m pytest tests/ -v

# Pre-compute codebooks
PYTHONPATH=. python scripts/precompute_codebooks.py

# Evaluate perplexity
PYTHONPATH=. python -m eval.perplexity --model meta-llama/Llama-3.1-8B-Instruct --max-tokens 2048

# Use in your code
from turboquant.kv_cache import TurboQuantCache
cache = TurboQuantCache(model.config, nbits_key=4, nbits_value=4, residual_length=128)
output = model.generate(input_ids, past_key_values=cache, max_new_tokens=100)

Key Implementation Details

  • Codebook: Lloyd-Max algorithm with scipy.integrate.quad for numerical integration on the Beta distribution
  • Rotation: Haar-distributed via QR decomposition with sign correction
  • Cache integration: Chunk-based storage (no re-quantization) to avoid error accumulation; first update returns originals (prefill is exact)
  • Multi-GPU: Compatible with device_map='auto' — quantizers lazily initialize on the correct device

Dependencies

  • PyTorch >= 2.0.0
  • transformers >= 4.40.0
  • scipy, numpy, datasets, tqdm

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages