-
Notifications
You must be signed in to change notification settings - Fork 0
TurboQuant vector compression for embedding index #371
Copy link
Copy link
Open
Labels
component:storeSQLite store layerSQLite store layerenhancementNew feature or requestNew feature or requestv1.0Required for v1.0 public releaseRequired for v1.0 public release
Description
Summary
Implement Google's TurboQuant algorithm (QJL + PolarQuant) in pure Go to compress embedding vectors and accelerate similarity search. This reduces storage by 8-31x and speeds up cosine similarity by 5-20x.
Motivation
Current state:
- Embeddings stored as full float32 BLOBs (4 bytes per dimension)
- Cosine similarity: O(n) brute-force scan over all vectors
- 34K memories × 128-dim bow = ~17MB index (manageable)
- But with 384-dim ONNX embeddings: ~50MB, and growing
TurboQuant (ICLR 2026) provides:
- 4-bit: 384-dim vector from 1536 bytes → 192 bytes (8x compression)
- 1-bit: 384-dim → 48 bytes + 2-byte norm = 50 bytes (31x compression)
- Similarity via XNOR + popcount: ~10-50 nanoseconds per comparison
- No training data or calibration needed — works on any vector
Implementation Plan
1. internal/embedding/turboquant.go — Core algorithm
- QJL projection: Random matrix (seeded PRNG for reproducibility) projects d-dim → m-dim
- Sign quantization: 1-bit per projected dimension, packed into
[]uint64 - PolarQuant: Optional multi-bit (3-4 bit) for higher fidelity
- Norm storage: float16 or float32 scalar per vector
- Similarity:
XNOR(bits_a, bits_b)→math/bits.OnesCount64→ scale by norms
2. internal/store/sqlite/embindex_quantized.go — Quantized index
- Parallel to existing
embindex.go(in-memory brute-force) - Store quantized vectors instead of float32
- Search operates directly on compressed representations
- Fallback to float32 for vectors that haven't been quantized yet
3. Storage schema
- New column or replace existing
embeddingBLOB with quantized format - Header byte indicates format:
0x00= float32,0x01= QJL-1bit,0x04= QJL-4bit - Migration: quantize existing embeddings on first load (one-time cost)
4. Config
embedding:
quantization: "4bit" # "none", "1bit", "4bit" (default: "none")Complexity Assessment
The core algorithm is ~200-400 lines of Go:
- Random projection matrix:
gonum/mator hand-rolled (seededmath/rand) - Sign quantization + bit packing: trivial
- Popcount similarity:
math/bits.OnesCount64(hardware-accelerated) - Norm storage: single float per vector
The projection matrix (384×384 = ~590KB) is stored once and generated from a fixed seed.
Acceptance Criteria
- Quantize/dequantize round-trip preserves ranking quality (test with benchmark-quality)
- 1-bit search is >5x faster than float32 cosine on 10K vectors
- Storage reduction measured and documented
- Existing float32 path unaffected when quantization disabled
-
go test ./internal/embedding/...passes
References
- TurboQuant paper: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
- QJL: Quantized Johnson-Lindenstrauss transform
- PolarQuant: Polar coordinate quantization (AISTATS 2026)
- Parent: Remove LLM dependency from cognitive pipeline — heuristic-first architecture #369
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
component:storeSQLite store layerSQLite store layerenhancementNew feature or requestNew feature or requestv1.0Required for v1.0 public releaseRequired for v1.0 public release