TurboQuant-DGX

Implementation of TurboQuant (ICLR 2026) for KV cache quantization on NVIDIA DGX Spark GB10.

What This Does

Quantizes transformer key-value caches to 2-4 bits using PolarQuant (rotation + Lloyd-Max quantization) and QJL (1-bit error correction). Achieves 3.88x compression with 0.995 cosine similarity on quantized vectors.

Key Results

1,000 experiments across Qwen3 (0.6B, 4B, 14B, 32B) and Mistral-7B
3.88x KV cache compression at 4-bit
8.4x Triton kernel speedup via autonomous optimization (51 iterations)
Model-agnostic: validated on both Qwen3 and Mistral architectures
Qwen3-32B runs at 1M token context on 128GB GPU (impossible at FP16)

Structure

turboquant/           # Core library
  codebook.py         # Lloyd-Max quantization
  polarquant.py       # PolarQuant (rotation + quantization)
  qjl.py             # Johnson-Lindenstrauss error correction
  packing.py          # Bit-packing (4-bit, 2-bit)
  cache.py            # TurboQuantCache (HF-compatible)
  kernels/            # Triton fused attention kernel
tests/                # 140 tests (unit + adversarial + E2E)
autoresearch/         # Autonomous experiment runner
benchmarks/           # Context scaling benchmarks
server/               # FastAPI inference server
scripts/              # Utilities (model download, server start)
docs/                 # Audit reports, research summary

Quick Start

pip install -r requirements.txt

# Run tests
TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas PYTHONPATH=. pytest tests/ -v

# Start inference server
./scripts/start-server.sh 8000

# Run autoresearch parameter sweep
python autoresearch/batch_run.py --budget 100

Integration with Dendrite

TurboQuant integrates with Dendrite, a Rust inference engine, for production deployment without dequantization overhead. See the blog post for details.

Hardware

Developed and tested on NVIDIA DGX Spark GB10 (128GB VRAM, Blackwell architecture).

License

MIT

Citation

If you use this code, please cite the original TurboQuant paper:

@inproceedings{turboquant2026,
  title={TurboQuant: Efficient KV Cache Quantization via PolarQuant and QJL},
  author={Google Research},
  booktitle={ICLR},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
autoresearch		autoresearch
benchmarks		benchmarks
docs		docs
drafts		drafts
results		results
scripts		scripts
server		server
tests		tests
turboquant		turboquant
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
PROGRESS.md		PROGRESS.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TurboQuant-DGX

What This Does

Key Results

Structure

Quick Start

Integration with Dendrite

Hardware

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

TurboQuant-DGX

What This Does

Key Results

Structure

Quick Start

Integration with Dendrite

Hardware

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages