Updated April 2026 — after Google's TurboQuant publication at ICLR 2026.
quant.cpp is the single-header C reference implementation of TurboQuant and related KV cache quantization research. We are not competing with Google. We are not competing with llama.cpp. We are filling a gap nobody else can fill: running modern KV-quantized inference anywhere a C compiler runs.
In March–April 2026 the KV cache quantization landscape transformed:
- Google published TurboQuant at ICLR 2026 (Zandieh, Daliri, Hadian, Mirrokni). arXiv:2504.19874
- PolarQuant appeared at AISTATS 2026. arXiv:2502.02617
- Multiple competing OSS implementations sprang up in weeks: Rust, PyTorch, several llama.cpp forks
- llama.cpp Discussion #20969 has 6+ independent fork implementations, none merged, no convergence
The "TurboQuant" name is now a Google research brand. Our project must carefully position around it.
quant.cpp predates the Google publication. We were independently exploring the same algorithmic ideas (PolarQuant rotation, QJL sketch). When the official paper appeared, our codebase already had working implementations of the building blocks. We are now repositioning as the canonical embedded/portable C implementation, not as a competitor to the algorithm authors.
The single-header C implementation of Google TurboQuant — for iPhone, Android, browser, microcontrollers, game engines, and every place a Rust crate or Python package can't go.
| Engine class | Single-header reference C implementation of published KV quantization research |
| Audience | App developers, mobile, embedded, browser, game engine, research |
| Core artifact | quant.h — 628KB single header, 15.7K LOC, libc + libm only |
| License | Apache 2.0 |
| Algorithms shipped | TurboQuant (Polar+QJL), PolarQuant, QJL, Uniform 4b/2b, TurboKV 1b/3b/4b |
| Inference scope | End-to-end: GGUF loader → tokenizer → forward pass → sampling → text |
| Architectures | Llama, Llama 3, Qwen, Qwen3.5 hybrid, Gemma 3, Gemma 4 MoE, SmolLM, DeltaNet |
| Backends | CPU (NEON, AVX2, generic), Metal (partial), CUDA (compiles), WASM, MSVC |
| What proves the moat | The fact that embed_minimal links only against libSystem — no library, no framework, no runtime |
| Why we don't compete | |
|---|---|
| ❌ The fastest GPU inference engine | llama.cpp owns this with full Metal/CUDA tensor graphs |
| ❌ The highest-throughput batch server | vLLM owns this |
| ❌ The original TurboQuant authors | Google Research owns the algorithm |
| ❌ The most features | We deliberately stay minimal |
| ❌ A training framework | Use PyTorch/JAX |
| ❌ Production-grade for 100+ models | We verify 7 architectures end-to-end |
| Implementation | Lang | Size | Mobile | WASM | Embedded | End-to-end |
|---|---|---|---|---|---|---|
| quant.cpp | C11 | 628KB single header | ✅ | ✅ 192KB | ✅ | ✅ |
| RecursiveIntell/turbo-quant | Rust | Cargo crate | ❌ | ❌ | ❌ | kernel only |
| tonbistudio/turboquant-pytorch | Python | pip + Torch | ❌ | ❌ | ❌ | kernel only |
| OnlyTerp/turboquant | Python | pip | ❌ | ❌ | ❌ | kernel only |
| scos-lab/turboquant | Python | research | ❌ | ❌ | ❌ | kernel only |
| llama.cpp forks (#20969) | C++ | ggml fork | partial | ❌ | ❌ | depends on llama.cpp |
| Engine | KV quant | Size | Read-in-an-afternoon | Embeddable | Best for |
|---|---|---|---|---|---|
| quant.cpp | TurboQuant + 6 schemes | 72K LOC | ✅ | ✅ single header | Embedded / mobile / WASM / education |
| llama.cpp | Q8_0/Q5_0 (~2x) | 250K+ LOC | ❌ | library | Workstation speed |
| vLLM | none | 100K+ LOC | ❌ | framework | Batch serving |
| MLX | none | 50K+ LOC | ❌ | framework | Apple Silicon |
| ONNX RT | none | 500K+ LOC | ❌ | framework | Multi-platform serving |
- Implement Google TurboQuant precisely per the ICLR 2026 paper
- Verify our numbers reproduce the paper's published results within ±1%
- Cite the paper authors prominently in every README and docs page
- Submit to llama.cpp Discussion #20969 with a clean ggml type registration
- iOS demo app (Xcode project)
- Android NDK build guide
- WASM npm package
- Unity C# binding
- Unreal C++ integration
- Microcontroller (Cortex-M4 with FlexRAM) feasibility study
- Hard cap: forward pass in one file (
tq_transformer.c) - Hard cap: KV quantization plugin via 3 functions
- Hard cap: zero new dependencies in core
- Every PR that adds a feature must also add a unit test
- Always disclose: model, dataset, baseline, methodology
- Never claim "lossless" without PPL Δ on a specific dataset
- Always link to a reproducible script
- Match Google's published benchmarks (LongBench, NIH, ZeroSCROLLS, RULER, L-Eval) where feasible
| Term | What it means | Where to use |
|---|---|---|
| TurboQuant | Google's algorithm (Zandieh et al., ICLR 2026) | Always cite + link to arXiv |
| PolarQuant | The rotation + polar quantization step | Cite arXiv:2502.02617 |
| QJL | Quantized Johnson-Lindenstrauss residual sketch | Cite arXiv:2406.03482 |
| quant.cpp | This project — a C implementation | Project / repo name |
TQ_TURBO_* |
Our internal type identifiers (predates Google publication) | Code only — docs must clarify lineage |
| Goal | Metric | Owner |
|---|---|---|
| Repository stars | 1000+ | community |
| GitHub citations | 5+ academic | community |
| llama.cpp PR merged or formally reviewed | 1 | core |
| iOS demo app on App Store / TestFlight | shipped | core |
| npm @quantcpp/wasm package | published | core |
| arXiv tech report | submitted | core |
| Reproduce TurboQuant paper benchmarks | within ±1% | core |
In 6 months, when someone googles "TurboQuant llama.cpp" or "TurboQuant iOS" or "KV cache compression embedded", quant.cpp is the first or second result. The Google paper is the theoretical reference; quant.cpp is the practical implementation everyone reaches for when they need to actually ship something.