TurboQuant vector compression for embedding index

## Summary

Implement Google's TurboQuant algorithm (QJL + PolarQuant) in pure Go to compress embedding vectors and accelerate similarity search. This reduces storage by 8-31x and speeds up cosine similarity by 5-20x.

## Motivation

Current state:
- Embeddings stored as full float32 BLOBs (4 bytes per dimension)
- Cosine similarity: O(n) brute-force scan over all vectors
- 34K memories × 128-dim bow = ~17MB index (manageable)
- But with 384-dim ONNX embeddings: ~50MB, and growing

TurboQuant (ICLR 2026) provides:
- **4-bit**: 384-dim vector from 1536 bytes → 192 bytes (**8x compression**)
- **1-bit**: 384-dim → 48 bytes + 2-byte norm = 50 bytes (**31x compression**)
- Similarity via XNOR + popcount: **~10-50 nanoseconds** per comparison
- No training data or calibration needed — works on any vector

## Implementation Plan

### 1. `internal/embedding/turboquant.go` — Core algorithm
- **QJL projection**: Random matrix (seeded PRNG for reproducibility) projects d-dim → m-dim
- **Sign quantization**: 1-bit per projected dimension, packed into `[]uint64`
- **PolarQuant**: Optional multi-bit (3-4 bit) for higher fidelity
- **Norm storage**: float16 or float32 scalar per vector
- **Similarity**: `XNOR(bits_a, bits_b)` → `math/bits.OnesCount64` → scale by norms

### 2. `internal/store/sqlite/embindex_quantized.go` — Quantized index
- Parallel to existing `embindex.go` (in-memory brute-force)
- Store quantized vectors instead of float32
- Search operates directly on compressed representations
- Fallback to float32 for vectors that haven't been quantized yet

### 3. Storage schema
- New column or replace existing `embedding` BLOB with quantized format
- Header byte indicates format: `0x00` = float32, `0x01` = QJL-1bit, `0x04` = QJL-4bit
- Migration: quantize existing embeddings on first load (one-time cost)

### 4. Config
```yaml
embedding:
  quantization: "4bit"  # "none", "1bit", "4bit" (default: "none")
```

## Complexity Assessment

The core algorithm is **~200-400 lines of Go**:
- Random projection matrix: `gonum/mat` or hand-rolled (seeded `math/rand`)
- Sign quantization + bit packing: trivial
- Popcount similarity: `math/bits.OnesCount64` (hardware-accelerated)
- Norm storage: single float per vector

The projection matrix (384×384 = ~590KB) is stored once and generated from a fixed seed.

## Acceptance Criteria

- [ ] Quantize/dequantize round-trip preserves ranking quality (test with benchmark-quality)
- [ ] 1-bit search is >5x faster than float32 cosine on 10K vectors
- [ ] Storage reduction measured and documented
- [ ] Existing float32 path unaffected when quantization disabled
- [ ] `go test ./internal/embedding/...` passes

## References

- TurboQuant paper: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
- QJL: Quantized Johnson-Lindenstrauss transform
- PolarQuant: Polar coordinate quantization (AISTATS 2026)
- Parent: #369

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TurboQuant vector compression for embedding index #371

Summary

Motivation

Implementation Plan

1. `internal/embedding/turboquant.go` — Core algorithm

2. `internal/store/sqlite/embindex_quantized.go` — Quantized index

3. Storage schema

4. Config

Complexity Assessment

Acceptance Criteria

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TurboQuant vector compression for embedding index #371

Description

Summary

Motivation

Implementation Plan

1. internal/embedding/turboquant.go — Core algorithm

2. internal/store/sqlite/embindex_quantized.go — Quantized index

3. Storage schema

4. Config

Complexity Assessment

Acceptance Criteria

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `internal/embedding/turboquant.go` — Core algorithm

2. `internal/store/sqlite/embindex_quantized.go` — Quantized index