Skip to content

TurboQuant vector compression for embedding index #371

@CalebisGross

Description

@CalebisGross

Summary

Implement Google's TurboQuant algorithm (QJL + PolarQuant) in pure Go to compress embedding vectors and accelerate similarity search. This reduces storage by 8-31x and speeds up cosine similarity by 5-20x.

Motivation

Current state:

  • Embeddings stored as full float32 BLOBs (4 bytes per dimension)
  • Cosine similarity: O(n) brute-force scan over all vectors
  • 34K memories × 128-dim bow = ~17MB index (manageable)
  • But with 384-dim ONNX embeddings: ~50MB, and growing

TurboQuant (ICLR 2026) provides:

  • 4-bit: 384-dim vector from 1536 bytes → 192 bytes (8x compression)
  • 1-bit: 384-dim → 48 bytes + 2-byte norm = 50 bytes (31x compression)
  • Similarity via XNOR + popcount: ~10-50 nanoseconds per comparison
  • No training data or calibration needed — works on any vector

Implementation Plan

1. internal/embedding/turboquant.go — Core algorithm

  • QJL projection: Random matrix (seeded PRNG for reproducibility) projects d-dim → m-dim
  • Sign quantization: 1-bit per projected dimension, packed into []uint64
  • PolarQuant: Optional multi-bit (3-4 bit) for higher fidelity
  • Norm storage: float16 or float32 scalar per vector
  • Similarity: XNOR(bits_a, bits_b)math/bits.OnesCount64 → scale by norms

2. internal/store/sqlite/embindex_quantized.go — Quantized index

  • Parallel to existing embindex.go (in-memory brute-force)
  • Store quantized vectors instead of float32
  • Search operates directly on compressed representations
  • Fallback to float32 for vectors that haven't been quantized yet

3. Storage schema

  • New column or replace existing embedding BLOB with quantized format
  • Header byte indicates format: 0x00 = float32, 0x01 = QJL-1bit, 0x04 = QJL-4bit
  • Migration: quantize existing embeddings on first load (one-time cost)

4. Config

embedding:
  quantization: "4bit"  # "none", "1bit", "4bit" (default: "none")

Complexity Assessment

The core algorithm is ~200-400 lines of Go:

  • Random projection matrix: gonum/mat or hand-rolled (seeded math/rand)
  • Sign quantization + bit packing: trivial
  • Popcount similarity: math/bits.OnesCount64 (hardware-accelerated)
  • Norm storage: single float per vector

The projection matrix (384×384 = ~590KB) is stored once and generated from a fixed seed.

Acceptance Criteria

  • Quantize/dequantize round-trip preserves ranking quality (test with benchmark-quality)
  • 1-bit search is >5x faster than float32 cosine on 10K vectors
  • Storage reduction measured and documented
  • Existing float32 path unaffected when quantization disabled
  • go test ./internal/embedding/... passes

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    component:storeSQLite store layerenhancementNew feature or requestv1.0Required for v1.0 public release

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions