TurboQuant paper reproduction gap: turbo_kv_* doesn't match arXiv:2504.19874 quality

## Summary

Our `turbo_kv_3b` and `turbo_kv_4b` types implement the algorithmic structure of [Google TurboQuant (Zandieh et al., ICLR 2026)](https://arxiv.org/abs/2504.19874) — Random Hadamard Transform → Lloyd-Max codebook → 1-bit QJL residual + ‖r‖ scalar — but **do not yet reproduce the paper's reported quality**.

Full reproduction report: [bench/results/turboquant_reproduction.md](../blob/main/bench/results/turboquant_reproduction.md)

## Measured numbers

### Llama 3.2 3B Instruct, FP32 baseline = PPL 13.56

| Config | Bits | PPL | Δ vs FP32 |
|---|---:|---:|---:|
| `uniform_4b` (recommended) | 4 | 14.41 | **+6.3%** ✅ |
| `turbo_kv_4b` (paper algo) | 4 | 16.03 | +18.2% ⚠️ |
| `turbo_kv_3b` (paper algo) | 3 | 25.84 | +90.6% ❌ |

The paper reports near-zero degradation at 3.5-bit on Llama 3.1 8B. Our 3-bit shows +90.6%.

## Ablation finding

Disabling the QJL correction stage gives **byte-identical PPL**. The QJL term is contributing ~0 regardless of input. So:

1. **The MSE codebook stage is broken** — Lloyd-Max-Gaussian centroids are strictly worse than `uniform_4b`'s simple per-block min-max at the same bit budget. Real keys after a single Hadamard rotation still have heavy tails the codebook clips.

2. **The QJL stage is silently dead** — likely a wrong constant. Our `qjl_scale = √(π/2) / m` assumes Gaussian projection rows, but our `tkv_qjl_random_entry` returns Rademacher (±1).

## Hypotheses (priority order)

1. ❓ **Per-channel outlier handling missing.** The paper allocates 32 outlier channels at higher bit width ((32×3+96×2)/128 = 2.5 effective bits). We do uniform per-channel allocation. This alone could account for most of the gap.
2. ❓ **QJL constant wrong for Rademacher rows.** The original QJL paper (arXiv:2406.03482) derives the unbiased constant for Gaussian rows. For Rademacher rows the constant differs.
3. ❓ **Block normalization vs vector normalization.** Paper operates on full d-dim vector. We block at TQ_BK=128. For head_dim≤128 this is moot, but per-block stats vs per-vector stats may differ.
4. ❓ **Sketch dimension too small.** QJL paper recommends `m ≥ 2d`. We use `m = d = head_dim`, limited by `block.qjl_signs[16] = 128 bits`.
5. ❓ **Real key distribution after 1 RHT round.** A single Hadamard rotation may not fully decorrelate channels for certain layer/head combinations. Multi-stage rotations or per-head rotation seeds may be needed.

## Action items

- [ ] Add per-channel outlier extraction (start with the paper's 32-channel split)
- [ ] Verify QJL constant for Rademacher rows; compare with Gaussian-row variant
- [ ] Try `inv_std` from empirical per-block std rather than theoretical √d (requires storing it)
- [ ] Increase sketch_dim to 2*d (requires bigger block layout)
- [ ] Per-head rotation seeds rather than global TKV_DEFAULT_SEED
- [ ] Reach out to paper authors for reference implementation pointers
- [ ] Add a regression test that fails if `turbo_kv_4b` PPL on Llama 3.2 3B exceeds 14.5

## Current recommendation for users

**Use `uniform_4b`** as the production KV quantization. It is competitive with llama.cpp's q4_0 KV (+6.3% vs +10.6% PPL on comparable benchmarks) at the same bit budget. `turbo_kv_*` should be considered **experimental / research** until this issue is closed.

## Why this is public

We're being transparent because (a) misleading claims destroy reputation in OSS ML, (b) we believe in being honest about what works and what doesn't, and (c) other contributors may have insights into the gap. The README and project docs have been updated to reflect this status.

Related discussion: [llama.cpp #20969](https://github.com/ggml-org/llama.cpp/discussions/20969) — multiple TurboQuant implementations are seeing convergence challenges, and the QJL usefulness is contested (Arclabs001's research suggests MSE-only > MSE+QJL on some configs).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TurboQuant paper reproduction gap: turbo_kv_* doesn't match arXiv:2504.19874 quality #14

Summary

Measured numbers

Llama 3.2 3B Instruct, FP32 baseline = PPL 13.56

Ablation finding

Hypotheses (priority order)

Action items

Current recommendation for users

Why this is public

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Config	Bits	PPL	Δ vs FP32
`uniform_4b` (recommended)	4	14.41	+6.3% ✅
`turbo_kv_4b` (paper algo)	4	16.03	+18.2% ⚠️
`turbo_kv_3b` (paper algo)	3	25.84	+90.6% ❌

TurboQuant paper reproduction gap: turbo_kv_* doesn't match arXiv:2504.19874 quality #14

Description

Summary

Measured numbers

Llama 3.2 3B Instruct, FP32 baseline = PPL 13.56

Ablation finding

Hypotheses (priority order)

Action items

Current recommendation for users

Why this is public

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions