-
Notifications
You must be signed in to change notification settings - Fork 39
TurboQuant paper reproduction gap: turbo_kv_* doesn't match arXiv:2504.19874 quality #14
Description
Summary
Our turbo_kv_3b and turbo_kv_4b types implement the algorithmic structure of Google TurboQuant (Zandieh et al., ICLR 2026) — Random Hadamard Transform → Lloyd-Max codebook → 1-bit QJL residual + ‖r‖ scalar — but do not yet reproduce the paper's reported quality.
Full reproduction report: bench/results/turboquant_reproduction.md
Measured numbers
Llama 3.2 3B Instruct, FP32 baseline = PPL 13.56
| Config | Bits | PPL | Δ vs FP32 |
|---|---|---|---|
uniform_4b (recommended) |
4 | 14.41 | +6.3% ✅ |
turbo_kv_4b (paper algo) |
4 | 16.03 | +18.2% |
turbo_kv_3b (paper algo) |
3 | 25.84 | +90.6% ❌ |
The paper reports near-zero degradation at 3.5-bit on Llama 3.1 8B. Our 3-bit shows +90.6%.
Ablation finding
Disabling the QJL correction stage gives byte-identical PPL. The QJL term is contributing ~0 regardless of input. So:
-
The MSE codebook stage is broken — Lloyd-Max-Gaussian centroids are strictly worse than
uniform_4b's simple per-block min-max at the same bit budget. Real keys after a single Hadamard rotation still have heavy tails the codebook clips. -
The QJL stage is silently dead — likely a wrong constant. Our
qjl_scale = √(π/2) / massumes Gaussian projection rows, but ourtkv_qjl_random_entryreturns Rademacher (±1).
Hypotheses (priority order)
- ❓ Per-channel outlier handling missing. The paper allocates 32 outlier channels at higher bit width ((32×3+96×2)/128 = 2.5 effective bits). We do uniform per-channel allocation. This alone could account for most of the gap.
- ❓ QJL constant wrong for Rademacher rows. The original QJL paper (arXiv:2406.03482) derives the unbiased constant for Gaussian rows. For Rademacher rows the constant differs.
- ❓ Block normalization vs vector normalization. Paper operates on full d-dim vector. We block at TQ_BK=128. For head_dim≤128 this is moot, but per-block stats vs per-vector stats may differ.
- ❓ Sketch dimension too small. QJL paper recommends
m ≥ 2d. We usem = d = head_dim, limited byblock.qjl_signs[16] = 128 bits. - ❓ Real key distribution after 1 RHT round. A single Hadamard rotation may not fully decorrelate channels for certain layer/head combinations. Multi-stage rotations or per-head rotation seeds may be needed.
Action items
- Add per-channel outlier extraction (start with the paper's 32-channel split)
- Verify QJL constant for Rademacher rows; compare with Gaussian-row variant
- Try
inv_stdfrom empirical per-block std rather than theoretical √d (requires storing it) - Increase sketch_dim to 2*d (requires bigger block layout)
- Per-head rotation seeds rather than global TKV_DEFAULT_SEED
- Reach out to paper authors for reference implementation pointers
- Add a regression test that fails if
turbo_kv_4bPPL on Llama 3.2 3B exceeds 14.5
Current recommendation for users
Use uniform_4b as the production KV quantization. It is competitive with llama.cpp's q4_0 KV (+6.3% vs +10.6% PPL on comparable benchmarks) at the same bit budget. turbo_kv_* should be considered experimental / research until this issue is closed.
Why this is public
We're being transparent because (a) misleading claims destroy reputation in OSS ML, (b) we believe in being honest about what works and what doesn't, and (c) other contributors may have insights into the gap. The README and project docs have been updated to reflect this status.
Related discussion: llama.cpp #20969 — multiple TurboQuant implementations are seeing convergence challenges, and the QJL usefulness is contested (Arclabs001's research suggests MSE-only > MSE+QJL on some configs).