Skip to content

BioInfo/turboquant-dgx

Repository files navigation

TurboQuant-DGX

Implementation of TurboQuant (ICLR 2026) for KV cache quantization on NVIDIA DGX Spark GB10.

What This Does

Quantizes transformer key-value caches to 2-4 bits using PolarQuant (rotation + Lloyd-Max quantization) and QJL (1-bit error correction). Achieves 3.88x compression with 0.995 cosine similarity on quantized vectors.

Key Results

  • 1,000 experiments across Qwen3 (0.6B, 4B, 14B, 32B) and Mistral-7B
  • 3.88x KV cache compression at 4-bit
  • 8.4x Triton kernel speedup via autonomous optimization (51 iterations)
  • Model-agnostic: validated on both Qwen3 and Mistral architectures
  • Qwen3-32B runs at 1M token context on 128GB GPU (impossible at FP16)

Structure

turboquant/           # Core library
  codebook.py         # Lloyd-Max quantization
  polarquant.py       # PolarQuant (rotation + quantization)
  qjl.py             # Johnson-Lindenstrauss error correction
  packing.py          # Bit-packing (4-bit, 2-bit)
  cache.py            # TurboQuantCache (HF-compatible)
  kernels/            # Triton fused attention kernel
tests/                # 140 tests (unit + adversarial + E2E)
autoresearch/         # Autonomous experiment runner
benchmarks/           # Context scaling benchmarks
server/               # FastAPI inference server
scripts/              # Utilities (model download, server start)
docs/                 # Audit reports, research summary

Quick Start

pip install -r requirements.txt

# Run tests
TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas PYTHONPATH=. pytest tests/ -v

# Start inference server
./scripts/start-server.sh 8000

# Run autoresearch parameter sweep
python autoresearch/batch_run.py --budget 100

Integration with Dendrite

TurboQuant integrates with Dendrite, a Rust inference engine, for production deployment without dequantization overhead. See the blog post for details.

Hardware

Developed and tested on NVIDIA DGX Spark GB10 (128GB VRAM, Blackwell architecture).

License

MIT

Citation

If you use this code, please cite the original TurboQuant paper:

@inproceedings{turboquant2026,
  title={TurboQuant: Efficient KV Cache Quantization via PolarQuant and QJL},
  author={Google Research},
  booktitle={ICLR},
  year={2026}
}

About

TurboQuant KV cache quantization (ICLR 2026) - 1,000 experiments on NVIDIA DGX Spark GB10

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages