Implementation of TurboQuant (ICLR 2026) for KV cache quantization on NVIDIA DGX Spark GB10.
Quantizes transformer key-value caches to 2-4 bits using PolarQuant (rotation + Lloyd-Max quantization) and QJL (1-bit error correction). Achieves 3.88x compression with 0.995 cosine similarity on quantized vectors.
- 1,000 experiments across Qwen3 (0.6B, 4B, 14B, 32B) and Mistral-7B
- 3.88x KV cache compression at 4-bit
- 8.4x Triton kernel speedup via autonomous optimization (51 iterations)
- Model-agnostic: validated on both Qwen3 and Mistral architectures
- Qwen3-32B runs at 1M token context on 128GB GPU (impossible at FP16)
turboquant/ # Core library
codebook.py # Lloyd-Max quantization
polarquant.py # PolarQuant (rotation + quantization)
qjl.py # Johnson-Lindenstrauss error correction
packing.py # Bit-packing (4-bit, 2-bit)
cache.py # TurboQuantCache (HF-compatible)
kernels/ # Triton fused attention kernel
tests/ # 140 tests (unit + adversarial + E2E)
autoresearch/ # Autonomous experiment runner
benchmarks/ # Context scaling benchmarks
server/ # FastAPI inference server
scripts/ # Utilities (model download, server start)
docs/ # Audit reports, research summary
pip install -r requirements.txt
# Run tests
TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas PYTHONPATH=. pytest tests/ -v
# Start inference server
./scripts/start-server.sh 8000
# Run autoresearch parameter sweep
python autoresearch/batch_run.py --budget 100TurboQuant integrates with Dendrite, a Rust inference engine, for production deployment without dequantization overhead. See the blog post for details.
Developed and tested on NVIDIA DGX Spark GB10 (128GB VRAM, Blackwell architecture).
MIT
If you use this code, please cite the original TurboQuant paper:
@inproceedings{turboquant2026,
title={TurboQuant: Efficient KV Cache Quantization via PolarQuant and QJL},
author={Google Research},
booktitle={ICLR},
year={2026}
}