Production: uniform_4b KV cache (4–7x compression at +6% PPL on Llama 3.2 3B).
Research: building blocks for TurboQuant, PolarQuant, QJL — 7 KV quantization types in one engine.
72K LOC pure C, zero dependencies. Ships as quant.h — drop one file into any project.
Runs everywhere a C compiler does: iOS · Android · WASM · MSVC · microcontrollers.
LLM memory is dominated by the KV cache, not model weights. At 32K context, a 8B model's KV cache consumes 4GB — more than the model itself. Every existing engine stores KV in FP16. We compress it.
+------------+-------------------------------+
| | KV Cache (FP16) |
| Model(4GB) | ██████████████ 8K <-- OOM |
+------------+-------------------------------+
| | KV (4-bit) |
| Model(4GB) | ██ -------------> 350K ctx |
| | 6.9x smaller |
+------------+-------------------------------+
Same hardware. 4–7x longer context. PPL measured and disclosed.
| KV Config | Bits/elem | PPL | Δ vs FP32 | Notes |
|---|---|---|---|---|
| FP32 (baseline) | 32 | 13.56 | — | reference |
turbo_kv_4b ⭐ |
4 | 14.28 | +5.3% | RHT + 4-bit codebook, beats uniform_4b |
uniform_4b |
4 | 14.41 | +6.3% | per-block min-max |
turbo_kv_3b |
3 | 15.39 | +13.5% | RHT + 3-bit codebook |
llama.cpp q4_0 KV |
4 | ~14.99 | +10.6% | for comparison |
turbo_kv_4b is currently the best 4-bit KV cache quantization in the project — it beats both our previous production baseline (uniform_4b) and llama.cpp's q4_0 KV at the same bit budget. The Karpathy-loop history that produced it is in bench/results/turboquant_reproduction.md.
| Hardware | Model | FP16 KV ctx | quant.cpp ctx | KV Gain |
|---|---|---|---|---|
| 16GB Mac | Llama 3.2 3B | 50K tokens | 350K tokens | 6.9x |
| 16GB Mac | Gemma 4 26B MoE | 4K tokens | 14K tokens | 3.5x |
| 8GB Laptop | Llama 8B (Q4) | 16K tokens | 61K tokens | 3.8x |
| 24GB RTX 3090 | Llama 8B (Q4) | 147K tokens | 559K tokens | 3.8x |
LLM memory is dominated by the KV cache. quant.cpp is a minimal C engine that ships KV cache quantization that actually works, in a form factor nobody else offers: one single header, zero dependencies, runs on iOS/Android/WASM/MSVC/microcontrollers.
Two reasons to use it:
-
You need to embed LLM inference inside something. An app, a game, a web page, a device. quant.cpp is one file (
quant.h, 628KB) plus libc. Everywhere a C compiler runs, this runs. -
You want to study KV cache compression. quant.cpp implements 7 KV quantization schemes side by side:
uniform_4b/2b/3b,polar_3b/4b,qjl_1b,turbo_kv_*. You can read each one in a single C file and add a new one in 3 functions.
Honest disclosure: In April 2026 Google published TurboQuant (ICLR 2026). quant.cpp's turbo_kv_* types started as a port of that algorithmic structure (Random Hadamard Transform → Lloyd-Max codebook → 1-bit QJL residual). Through a Karpathy-loop ablation we discovered the QJL residual stage was contributing literally zero to scores, dropped it, and reinvested the freed bytes into a larger codebook. The result (turbo_kv_4b at 14.28 PPL on Llama 3.2 3B) beats our previous production champion uniform_4b and llama.cpp's q4_0 KV at the same 4-bit budget. The full optimization history is in bench/results/turboquant_reproduction.md.
Need the exact paper numbers in a paper? Use Google's reference. Need a small, readable C engine with KV compression that ships on a phone, browser, microcontroller, or game engine? Use quant.cpp.
# 1. Build
git clone https://github.com/quantumaikr/quant.cpp && cd quant.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc)
# 2. Download a model (135MB starter)
pip install huggingface_hub
hf download bartowski/SmolLM2-135M-Instruct-GGUF SmolLM2-135M-Instruct-Q8_0.gguf --local-dir models/
# 3. Run
./build/quant models/SmolLM2-135M-Instruct-Q8_0.gguf --chat -p "Hello!" -j 4
# 4. With KV compression (7x longer context)
./build/quant models/SmolLM2-135M-Instruct-Q8_0.gguf --chat -p "Hello!" -k uniform_4b -v q4Full API docs · WASM demo · Add your own KV type · Python:
pip install quantcpp
Load an entire novel into context and ask questions about it. llama.cpp runs out of memory. quant.cpp remembers the whole book.
# Load Alice in Wonderland (~27K tokens) with KV compression
bash bench/demo/book_chat.sh models/Llama-3.2-3B-Instruct-Q8_0.gguf
# Q: "What riddle did the Mad Hatter ask Alice?"
# A: "Why is a raven like a writing-desk?" — from Chapter 7, A Mad Tea-Party...On a 16GB Mac with Llama 3.2 3B: llama.cpp maxes out at ~50K tokens (FP16 KV). quant.cpp compresses KV 6.9x → 350K tokens — enough for 12 novels.
KV Quantization Quality (SmolLM2 1.7B, WikiText-2)
llama.cpp Q4_0 KV │██████████████████████████████████████ PPL +10.6%
│
llama.cpp Q8 K+Q5 V │▎ PPL ~+1% ← recommended (1.6x compression)
│
quant.cpp 4-bit │▏ PPL +0.0% ← lossless (3.8x compression)
│
quant.cpp 3-bit │█ PPL +1.3% ← delta compression (4.3x)
└────────────────────────────────────────────────
0% +12%
Perplexity Degradation →
Both are per-block methods. The quality gap comes from block size (128 vs 32), min-max range encoding, independent K/V treatment, and delta compression — not from a fundamental design flaw in llama.cpp. At ~1.6x compression, llama.cpp Q8+Q5 is excellent. quant.cpp targets the 4-7x range where the difference matters.
| quant.cpp | turbo-quant (Rust) | turboquant-pytorch | scos-lab/turboquant | |
|---|---|---|---|---|
| Language | Pure C11 | Rust | Python | Python |
| Single-header | ✅ quant.h (628KB) | ❌ Cargo crate | ❌ pip install | ❌ |
| Dependencies | libc + libm | Rust toolchain | PyTorch + CUDA | PyTorch |
| iOS / Android | ✅ | ❌ | ❌ | ❌ |
| WASM (browser) | ✅ 192KB | ❌ | ❌ | ❌ |
| MCU / embedded | ✅ | ❌ | ❌ | ❌ |
| Windows MSVC | ✅ | ✅ | (Python) | (Python) |
| GGUF model loading | ✅ 7 architectures | ❌ | ❌ | research only |
| End-to-end inference | ✅ | kernel only | kernel only | kernel only |
| quant.cpp | llama.cpp | vLLM | MLX | |
|---|---|---|---|---|
| KV quantization | TurboQuant + 6 schemes | Q8_0/Q5_0 (2x) | -- | -- |
| Code size | 72K LOC | 250K+ | 100K+ | 50K+ |
| Embeddable | single header | library | library | framework |
| Read in an afternoon | ✅ | ❌ | ❌ | ❌ |
| GPU throughput | basic | full | best | Metal |
Use llama.cpp for speed on a workstation. Use vLLM for batch serving. Use quant.cpp when you need to ship LLM inference inside something — an app, a game, a website, a device.
| Model | Params | Architecture | Speed (M1 Pro, 8T) | KV Compression |
|---|---|---|---|---|
| SmolLM2 135M | 135M | Llama | 103 tok/s | 2.4x |
| Llama 3.2 3B Instruct | 3B | Llama 3 (GQA) | 10 tok/s | 6.9x |
| Gemma 4 26B-A4B-it | 26B (4B active) | MoE 128 experts | 3.9 tok/s | 3.5x |
| Qwen3.5 0.8B | 752M | DeltaNet hybrid | 80 tok/s | 3.8x |
| Qwen3.5 4B | 4B | DeltaNet hybrid | 20 tok/s | 3.8x |
| SmolLM2 1.7B | 1.7B | Llama | 25 tok/s | 3.8x |
| Gemma 3 270M | 270M | Gemma 3 | 176 tok/s | 3.8x |
GGUF format. Load any llama.cpp-compatible model.
Gemma 4 26B-A4B architecture details
Full support for Gemma 4's hybrid MoE architecture:
- Dual-FFN: parallel Dense MLP + 128-expert MoE per layer
- Hybrid attention: 25 sliding (head_dim=256) + 5 full (head_dim=512) layers
- QK-norm aware KV compression: auto FP32 keys + Q4 values (3.5x savings)
- Learned RoPE with per-layer frequency factors
- IQ3_XXS/IQ4_NL fused dot with NEON optimization for MoE experts
- GeGLU activation (NEON-accelerated fast tanh approximation)
./build/quant gemma-4-26B-A4B-it-UD-Q3_K_M.gguf \
-p "<start_of_turn>user\nWhat is the capital of France?\n<end_of_turn>\n<start_of_turn>model\n" \
-n 50 -j 8 -T 0.0 -k uniform_4b -v q4
# Output: "The capital of France is **Paris**."Standard: Store every key as-is → 16 bits/element → FP16
quant.cpp: Quantize keys to 4-bit → 4 bits/element → 3.8x
+ quantize values to Q4 → 4 bits/element → 6.9x
+ delta encode adjacent keys → 3 bits/element → 8.5x
Like video compression: I-frames (FP32) every 64 tokens, P-frames (3-bit delta) between.
WikiText-2 PPL (SmolLM2 1.7B)
FP32 baseline 14.63 │ ●
4b K + FP16 V 14.63 │ ● identical
4b K + Q4 V 14.57 │ ● slightly better (!)
delta 3b K + Q4 V 14.82 │ ● +1.3%
llama.cpp Q8K+Q5V ~14.8 │ ● ~+1% (1.6x compression)
llama.cpp Q4_0 KV 16.18 │ ● +10.6% (3.8x compression)
3b K (no delta) —— │ ● +62%
└──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──
14 15 16 17 18 19 20 21+
| Config | Compression | PPL vs FP32 | Best for |
|---|---|---|---|
delta + 3b K + Q4 V |
~8.5x | +1.3% | Maximum context |
delta + 4b K + Q4 V |
~6.9x | ~0% | Quality + compression |
uniform_4b K + Q4 V |
6.9x | ~0% | Simple, no delta overhead |
uniform_4b K + FP16 V |
1.6x | +0.0% | Lossless baseline |
Models with QK-norm normalize keys to the unit sphere, creating extremely sparse distributions. quant.cpp auto-detects this and stores keys in FP32 while quantizing only values — preserving perfect precision with 3.5x V memory reduction.
# Delta compression (maximum context, 8.5x)
./build/quant model.gguf --chat -p "hello" -k uniform_3b -v q4 --delta
# Perplexity benchmark
./build/quant model.gguf --ppl input.txt -k uniform_4b -v q4
# Model info
./build/quant model.gguf --info
# Performance profiling
./build/quant model.gguf --chat -p "hello" -n 50 --profileCopy one file. Add LLM to any C project.
#define QUANT_IMPLEMENTATION
#include "quant.h"
int main() {
quant_model* m = quant_load("model.gguf");
quant_ctx* c = quant_new(m, NULL);
// Streaming
quant_generate(c, "Tell me a joke", print_token, NULL);
// Or one-shot
char* answer = quant_ask(c, "What is 2+2?");
printf("%s\n", answer);
free(answer);
quant_free_ctx(c);
quant_free_model(m);
}cc app.c -o app -lm -lpthread # that's it — no cmake, no framework15.7K LOC, 643KB, ~2s compile time. Full API:
| Function | Description |
|---|---|
quant_load(path) |
Load a GGUF model |
quant_new(model, config) |
Create inference context |
quant_generate(ctx, prompt, cb, ud) |
Stream tokens via callback |
quant_ask(ctx, prompt) |
Generate and return string |
quant_free_ctx(ctx) |
Free context |
quant_free_model(model) |
Free model |
192KB. The entire inference engine compiles to a WASM binary smaller than most JPEGs.
cd wasm && bash build.sh # Requires: emscripten
python3 -m http.server 8080 # Serve locally
# Open http://localhost:8080, drag & drop any GGUF modelEverything runs client-side. Nothing is uploaded. KV compression active by default.
Docker (zero-dependency, ~10MB image):
docker build -t quant.cpp .
docker run -v ./models:/models quant.cpp /models/model.gguf -p "hello" -k uniform_4b -v q4OpenAI-compatible server (/v1/chat/completions):
cmake -B build -DTQ_BUILD_SERVER=ON && cmake --build build
./build/quant-server model.gguf -p 8080 -k uniform_4b
# Works with the OpenAI Python SDK
curl http://localhost:8080/v1/chat/completions \
-d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":64}'Build with -DTQ_BUILD_SERVER=ON. Streaming SSE supported. KV compression configurable per request.
cd bindings/python && pip install .from quantcpp import Model
with Model("model.gguf", kv_compress=1) as m:
print(m.ask("What is the capital of France?"))
# Streaming
for token in m.generate("Once upon a time"):
print(token, end="", flush=True)Zero build dependencies beyond a C compiler. Compiles quant.h at install time.
| Backend | Platform | Status | Notes |
|---|---|---|---|
| NEON | ARM (Apple Silicon) | Production | 5.8x SIMD speedup |
| AVX2 | x86 | Production | |
| Metal | Apple GPU | Verified | Batch matmul dispatch |
| CUDA | NVIDIA GPU | Compiles | |
| Vulkan | Cross-platform | Compiles | |
| WASM | Browser | NEW | 192KB binary |
| MSVC | Windows | NEW | VS 2019/2022 |
Performance breakdown (Gemma 4 26B on M1 Pro)
| Component | ms/token | Share |
|---|---|---|
| Attention matmul (Q8_0 NEON) | 168 | 65% |
| MoE experts (IQ3_XXS/IQ4_NL NEON) | 72 | 28% |
| Attention scores | 3 | 1% |
| Other | 14 | 6% |
| Total | 257 | 3.9 tok/s |
How is this different from llama.cpp?
llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a minimal engine (72K LOC) you can read, modify, and embed. Different tools for different problems: llama.cpp optimizes speed, quant.cpp optimizes memory (KV compression) and embeddability (single header).
llama.cpp already has KV quantization. How is yours different?
llama.cpp supports KV cache quantization (Q8_0 K + Q5_0 V is the recommended config, ~1.6x compression with minimal quality loss). quant.cpp targets higher compression: 4-bit K + Q4 V gives 3.8x at +0.0% PPL, and delta compression pushes to 4.3x at +1.3% PPL. The quality advantage comes from 128-element min-max blocks (vs 32-element), independent K/V quantization methods, and delta encoding of adjacent keys — a technique llama.cpp doesn't have. Use llama.cpp's KV quant if 1.6x is enough; use quant.cpp if you need 4-7x.
How does this compare to Karpathy's llm.c?
Similar philosophy: minimal C, educational. Key differences: quant.cpp supports quantized weights (Q4_K_M, Q8_0, IQ2), multiple architectures (Llama, Qwen, Gemma, MoE), GGUF loading, and KV cache compression. Think of llm.c as the textbook and quant.cpp as the production-ready version.
Can I embed this in my app?
Yes. Two options:
- Single-header: Copy
quant.h,#define QUANT_IMPLEMENTATIONin one .c file. Done. - Full library: Link against
libturboquant.a.
Works on Linux, macOS, Windows (MSVC/MinGW), iOS, Android, and WASM.
Why is it slower than llama.cpp?
Three reasons: (1) llama.cpp has years of hand-tuned NEON/AVX2 assembly for every quant format, (2) llama.cpp offloads the full forward pass to Metal/CUDA GPU, (3) 250K+ LOC vs 72K LOC means more micro-optimizations. quant.cpp optimized for memory and embeddability first. Speed improvements (full Metal GPU offload, more SIMD kernels) are actively in progress — see v1.3 plan.
No GPU — is this useless?
If you need 100+ tok/s, use llama.cpp with Metal/CUDA. If you need to embed inference in an iOS app, WASM module, game engine, or IoT device — quant.cpp works. CPU on Apple Silicon: 25 tok/s (1.7B), 11.6 tok/s (3B), 3.9 tok/s (26B MoE).
Can it run in the browser?
Yes. cd wasm && bash build.sh. The WASM binary is 192KB. Drop a GGUF model and chat. Everything runs client-side.
What about sub-3-bit quantization?
Tested extensively (2-bit delta, NF2, online SVD, multi-hash). None reached acceptable quality. Per-step cosine 0.997 compounds to 0.885 after 200 steps. 3-bit + delta is the practical minimum.
| Document | Description |
|---|---|
| API Reference | Full C API for quant.h and libturboquant (730 lines) |
| Custom Quantization | Add your own KV type in 3 functions |
| H2H Benchmark | Reproducible quant.cpp vs llama.cpp comparison |
| KV Compression Landscape | Eviction vs Architecture vs Compression guide |
| ROADMAP | Project direction and planned features |
| CHANGELOG | Version history and release notes |
| Tech Report | Architecture and benchmarks (Arxiv draft) |
| WASM Demo | Try it in your browser — no install needed |
quant.cpp is an independent implementation of published research. Please cite the original papers:
- TurboQuant — Zandieh, Daliri, Hadian, Mirrokni. TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. ICLR 2026. arXiv:2504.19874
- PolarQuant — Quantizing KV Caches with Polar Transformation. AISTATS 2026. arXiv:2502.02617
- QJL — Quantized Johnson-Lindenstrauss Transform for KV Cache Compression. AAAI 2025. arXiv:2406.03482
- Google Research blog post on TurboQuant
If you use quant.cpp in academic work, please cite both the underlying paper(s) and this repository.
