Skip to content

TurboQuant paper reproduction gap: turbo_kv_* doesn't match arXiv:2504.19874 quality #14

@unamedkr

Description

@unamedkr

Summary

Our turbo_kv_3b and turbo_kv_4b types implement the algorithmic structure of Google TurboQuant (Zandieh et al., ICLR 2026) — Random Hadamard Transform → Lloyd-Max codebook → 1-bit QJL residual + ‖r‖ scalar — but do not yet reproduce the paper's reported quality.

Full reproduction report: bench/results/turboquant_reproduction.md

Measured numbers

Llama 3.2 3B Instruct, FP32 baseline = PPL 13.56

Config Bits PPL Δ vs FP32
uniform_4b (recommended) 4 14.41 +6.3%
turbo_kv_4b (paper algo) 4 16.03 +18.2% ⚠️
turbo_kv_3b (paper algo) 3 25.84 +90.6% ❌

The paper reports near-zero degradation at 3.5-bit on Llama 3.1 8B. Our 3-bit shows +90.6%.

Ablation finding

Disabling the QJL correction stage gives byte-identical PPL. The QJL term is contributing ~0 regardless of input. So:

  1. The MSE codebook stage is broken — Lloyd-Max-Gaussian centroids are strictly worse than uniform_4b's simple per-block min-max at the same bit budget. Real keys after a single Hadamard rotation still have heavy tails the codebook clips.

  2. The QJL stage is silently dead — likely a wrong constant. Our qjl_scale = √(π/2) / m assumes Gaussian projection rows, but our tkv_qjl_random_entry returns Rademacher (±1).

Hypotheses (priority order)

  1. Per-channel outlier handling missing. The paper allocates 32 outlier channels at higher bit width ((32×3+96×2)/128 = 2.5 effective bits). We do uniform per-channel allocation. This alone could account for most of the gap.
  2. QJL constant wrong for Rademacher rows. The original QJL paper (arXiv:2406.03482) derives the unbiased constant for Gaussian rows. For Rademacher rows the constant differs.
  3. Block normalization vs vector normalization. Paper operates on full d-dim vector. We block at TQ_BK=128. For head_dim≤128 this is moot, but per-block stats vs per-vector stats may differ.
  4. Sketch dimension too small. QJL paper recommends m ≥ 2d. We use m = d = head_dim, limited by block.qjl_signs[16] = 128 bits.
  5. Real key distribution after 1 RHT round. A single Hadamard rotation may not fully decorrelate channels for certain layer/head combinations. Multi-stage rotations or per-head rotation seeds may be needed.

Action items

  • Add per-channel outlier extraction (start with the paper's 32-channel split)
  • Verify QJL constant for Rademacher rows; compare with Gaussian-row variant
  • Try inv_std from empirical per-block std rather than theoretical √d (requires storing it)
  • Increase sketch_dim to 2*d (requires bigger block layout)
  • Per-head rotation seeds rather than global TKV_DEFAULT_SEED
  • Reach out to paper authors for reference implementation pointers
  • Add a regression test that fails if turbo_kv_4b PPL on Llama 3.2 3B exceeds 14.5

Current recommendation for users

Use uniform_4b as the production KV quantization. It is competitive with llama.cpp's q4_0 KV (+6.3% vs +10.6% PPL on comparable benchmarks) at the same bit budget. turbo_kv_* should be considered experimental / research until this issue is closed.

Why this is public

We're being transparent because (a) misleading claims destroy reputation in OSS ML, (b) we believe in being honest about what works and what doesn't, and (c) other contributors may have insights into the gap. The README and project docs have been updated to reflect this status.

Related discussion: llama.cpp #20969 — multiple TurboQuant implementations are seeing convergence challenges, and the QJL usefulness is contested (Arclabs001's research suggests MSE-only > MSE+QJL on some configs).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions