Skip to content

v0.7.2 — turbo_kv_5b_fast: near-lossless at fp32 parity speed (1-byte layout)

Latest

Choose a tag to compare

@unamedkr unamedkr released this 08 Apr 16:00
· 1 commit to main since this release

New Pareto point: near-lossless quality + parity speed

`turbo_kv_5b_fast` is a new variant that uses the same Variant F algorithm as `turbo_kv_5b` (RHT + 32-level Lloyd-Max codebook) but stores each 5-bit index as a full byte instead of bit-packed. This wastes 3 bits per index but eliminates the scalar bit-extraction overhead that kept `turbo_kv_5b` at -8.8% vs fp32 in v0.7.1.

Type Bytes/block Compression PPL Δ vs FP32 tok/s vs FP32 speed
FP32 KV 13.56 17.93 baseline
`turbo_kv_4b` ⭐ default 72 7.1× 14.08 +3.8% 18.13 +1.1%
`turbo_kv_5b` 🏆 quality 88 5.8× 13.65 +0.7% 16.93 -5.6%
`turbo_kv_5b_fast` 🆕 136 3.76× 13.65 +0.7% 17.53 -2.2% near-parity
`turbo_kv_3b` 56 9.1× 15.36 +13.3% 16.57 -10.1%

Why it works

The 5b inner loop bottleneck after Round 11 was the scalar bit-extraction needed to unpack 16 indices from 10 bytes (5 bits each, irregular byte boundaries). The SIMD lookup itself is 1 instruction (`vqtbl2q_s8`) but the unpack costs ~16 scalar shift+mask ops per iteration.

By storing 1 byte per index, the unpack becomes a single `vld1q_u8` (16 bytes load = 16 indices) and the inner loop becomes pure SIMD:

```c
for (d = 0; d + 15 < dim; d += 16) {
uint8x16_t indices = vld1q_u8(mi + d); // direct byte load
int8x16_t vals = vqtbl2q_s8(cb_vec, indices); // 32-entry table lookup
// ... int8 → fp32 → scale → fma
}
```

The cleanest implementation in the codebase. Same PPL as `turbo_kv_5b` (verified — 13.65 exactly, both share the same 32-level codebook).

Trade-off

`turbo_kv_5b_fast` is 1.55× larger per block than `turbo_kv_5b` (136 vs 88 bytes), so compression drops from 5.8× to 3.76×. In exchange you get fp32-parity speed for near-lossless quality.

Use case: "I want the +0.7% PPL of 5b but I can't accept the -5.6% speed gap, and I have memory to spare for the lower compression ratio."

Recommended Pareto choices:

Goal Use
Best speed × compression × decent quality `turbo_kv_4b` (default)
Near-lossless quality, max compression `turbo_kv_5b`
Near-lossless quality, parity speed `turbo_kv_5b_fast` (this release)
Max compression, OK with quality cost `turbo_kv_3b` (≥ 3B models only)

Tests

35/35 unit tests pass. Same regression test thresholds (cosine ≥ 0.999) as `turbo_kv_5b`.

What's not in v0.7.2

  • 3b/4bo/3bo NEON optimization — separate Pareto choices, lower priority
  • AVX2 / WASM SIMD ports of the NEON tbl pattern
  • Llama 3.1 8B paper baseline reproduction (memory-constrained)

Try it

```bash
./build/quant model.gguf -k turbo_kv_5b_fast -v fp16
```