feat: heuristic embedding pipeline — remove all LLM dependency by CalebisGross · Pull Request #374 · AppSprout-dev/mnemonic

CalebisGross · 2026-03-30T18:26:56Z

Summary

Remove all 12 generative LLM (Complete) calls from 8 cognitive agents
Replace llm.Provider with new embedding.Provider interface (Embed/BatchEmbed/Health only)
Add 3 embedding providers: bow (128-dim, instant, air-gapped), hugot (384-dim MiniLM-L6-v2, pure Go), api (OpenAI-compatible)
Add RAKE keyword extraction for multi-word concept phrases
Add TurboQuant 1-bit vector compression (29.5x, ~8ns/comparison)
Integrate TurboQuant into embedding index with two-stage retrieval
Enhanced backfill endpoint for re-embedding with dimension mismatch detection
Migrate all CLI commands to embedding.Provider

Performance (measured on production DB, 34K memories)

Metric	LLM (Gemini)	Heuristic (bow)	Heuristic (hugot)
Encoding	39,426ms	6ms	~120ms
Recall	8,876ms	~1,800ms	~3,400ms
Failures	~20%	0%	0%
Network	2+ calls/memory	0	0 (after model download)
Binary	16MB	16MB	28MB

Commits (7)

0fae91c — Core: remove all LLM, new embedding package, migrate 10 agents (-1,805 lines)
cd78932 — RAKE keyword extraction (multi-word phrases)
5972933 — Hugot pure Go MiniLM-L6-v2 embeddings (384-dim, zero CGo)
9b4b730 — TurboQuant 1-bit vector compression
3bc5fe9 — Integrate TurboQuant into store + enhanced backfill
5a821f0 — Fix hugot text truncation for long inputs
5630a84 — Migrate CLI commands to embedding.Provider

Config

embedding:
  provider: bow    # "bow" (instant, air-gapped), "hugot" (transformer, air-gapped), "api" (cloud)

Test plan

Closes #369, closes #370, closes #372. Partially addresses #371 (algorithm done, index integrated).

🤖 Generated with Claude Code

Remove all 12 generative LLM (Complete) calls from 8 cognitive agents, replacing them with heuristic/algorithmic Go implementations. Introduce new embedding.Provider interface (Embed/BatchEmbed/Health only) to replace llm.Provider for agent dependencies. Key changes: - New internal/embedding/ package: Provider interface, BowProvider (128-dim bag-of-words), APIProvider, InstrumentedProvider, LLMAdapter - Perception: remove LLM gate, heuristic scoring is sole path - Encoding: promote fallbackCompression to primary, vocabulary-aware concept extraction via ExtractTopConcepts - Retrieval: drop LLM synthesis entirely (consuming agents synthesize) - Episoding: algorithmic time-window clustering with concept titles - Consolidation: highest-salience picker for gist, statistical concept co-occurrence for pattern detection - Dreaming: graph bridge detection replaces LLM insight generation - Abstraction: hierarchical concept clustering for principles/axioms - Reactor: static personality responses for @mentions - Config: new embedding.provider field ("bow" for air-gapped, "api" for OpenAI-compatible endpoint, auto-detect from llm config) Results on production DB (34K memories): - Encoding: 39,426ms → 6ms (6,571x faster) - Recall: 8,876ms → 6,200ms (30% faster) - Encoding failures: 20% → 0% - Network calls: 2+ per memory → 0 (fully air-gapped with bow) Net: -1,805 lines (581 added, 2,386 removed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Implement Rapid Automatic Keyword Extraction (RAKE) in pure Go for multi-word phrase detection. The encoding pipeline now uses a hybrid approach: RAKE extracts domain-adaptive phrases first, then vocabulary terms supplement with consistent single-word tags. Before (vocabulary only): "Docker build failing on ARM64 with exit code 137" → [docker, build] After (RAKE + vocabulary): → [exit code 137, oom killer, docker build failing, arm64, docker, build] New files: - internal/embedding/rake.go — RAKE algorithm (~160 lines) - internal/embedding/rake_test.go — 10 test cases Modified: - internal/embedding/bow.go — ExtractConcepts() hybrid function, GenerateEncodingResponse() uses RAKE-first extraction Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Integrate knights-analytics/hugot for transformer-quality embeddings with zero CGo, zero shared libraries — true single-binary deployment. The hugot provider uses GoMLX simplego backend to run all-MiniLM-L6-v2 entirely in Go. Model auto-downloads from HuggingFace on first use (~90MB, stored in ~/.mnemonic/models/). Config: embedding: provider: hugot # pure Go, 384-dim, air-gapped Performance (measured on production daemon): - Embedding latency: 108-325ms per text (CPU, pure Go) - Dimensions: 384 (vs bow-128, vs Gemini-3072) - Binary size: 16MB → 28MB (+12MB from GoMLX runtime) - No network calls after initial model download Quality: - RAKE concepts + 384-dim transformer embeddings - "docker buildx crashing", "exit code 137", "oom killer", "arm64" now all captured as concepts AND semantically searchable Note: Existing memories retain their old embeddings (3072-dim Gemini or 128-dim bow). A backfill is needed to re-embed with hugot for consistent retrieval quality. Use /api/v1/backfill-embeddings. Closes #370 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TurboQuant (QJL stage): pure Go implementation of 1-bit quantized Johnson-Lindenstrauss vector compression. Compresses 384-dim float32 vectors (1536 bytes) to 52 bytes (48 bits + 4-byte norm) — 29.5x compression. Similarity via XNOR + popcount (math/bits.OnesCount64). New files: - internal/embedding/turboquant.go — Quantizer, QuantizedVector, Similarity, packBits/getBit helpers - internal/embedding/turboquant_test.go — 7 tests + 2 benchmarks Backfill endpoint upgrade: - Supports ?mode=all to re-embed ALL memories (not just missing) - Detects dimension mismatch (e.g. 3072-dim Gemini vs 384-dim hugot) - Progress logging every 100 memories - 30-minute timeout (was 5 min) - Configurable ?limit (default 500, max 5000) Note: TurboQuant is implemented but not yet integrated into the embedding index. Integration requires replacing the float32 index with a quantized index that stores QuantizedVectors and uses Similarity() for search. This is a follow-up task. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Wire TurboQuant quantized index into SQLiteStore alongside the existing float32 index. SearchByEmbedding now uses two-stage retrieval: 1. TurboQuant approximate search (XNOR + popcount, ~8ns per comparison) 2. Exact cosine re-ranking on top candidates The quantized index runs in parallel with the float32 index. Both are populated on Add/Remove. Search prefers the quantized index when it has entries matching the query dimension, falling back to float32 for mixed dimension scenarios (backward compat during migration). New file: internal/store/sqlite/embindex_quantized.go Backfill endpoint enhanced: - ?mode=all re-embeds ALL memories (not just missing) - Detects dimension mismatch (e.g. 3072-dim → 384-dim) - Progress logging every 100 memories - 30-minute timeout, configurable ?limit (max 5000) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MiniLM-L6-v2 has a 256-token max sequence length. Long texts (>512 tokens) caused a shape mismatch panic in the GoMLX backend. Fixed by truncating input to 900 chars (~225 tokens) before passing to the pipeline. Also wires quantized index into SearchByEmbedding with fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update memory_cli.go, cycle.go, and diagnose.go to use initEmbeddingRuntime() instead of initRuntime(). CLI commands (remember, recall, consolidate, meta-cycle, dream-cycle, diagnose) now use the same embedding.Provider as the daemon. The diagnose command checks embedding provider health instead of LLM provider health. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The backfill endpoint was re-processing the same recent memories on every batch because ListMemories returns newest-first and mode=all did not skip already-correct dimensions. Fixed by paginating through all memories via offset and always skipping correct dimensions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add comprehensive benchmarks for float32 vs TurboQuant search at production scale (34K memories, 384-dim). Results (Ryzen 7 5800X): - Gemini 3072-dim float32: 72.6ms/search - Hugot 384-dim float32: 13.1ms/search (5.5x faster) - TurboQuant 1-bit 384-dim: 2.8ms/search (25.9x faster) - Storage: 400MB → 49MB → 1.7MB (235x compression) TurboQuant recall@10 improved from 26.5% to 53% by increasing candidate multiplier from 4x to 20x. The two-stage retrieval (quantized pre-filter → exact re-rank) compensates for 1-bit precision loss. Quality test: 53% recall@10 vs float32 ground truth (acceptable for pre-filtering, exact re-ranking ensures final accuracy). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two optimizations that dramatically reduce recall latency: 1. Cap fan-out to 15 strongest associations per node during spread activation. Hub memories (100-350 links) were causing exponential explosion in the traversal. Now follows only the top 15 by strength. 2. Defer Hebbian activation writes to a background goroutine instead of writing per-edge during search. This was causing a DB write for every association traversed. Combined with the earlier pruning of 220K dead ingest associations: Query Before After Speedup SQLite FTS5 retrieval 10,675ms 1,887ms 5.7x Go context timeout 13,822ms 3,670ms 3.8x SQL query associations 12,305ms 2,477ms 5.0x nil pointer consolidation 2,257ms 1,133ms 2.0x Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace PolarQuant (projected 4-bit) with simpler scalar int8 quantization that operates directly on embedding dimensions without a projection matrix. 3.9x compression, no projection overhead. More importantly: switch SearchByEmbedding to use float32 as the primary index (100% recall, 13ms at 34K) instead of quantized (53% recall). The quantized index is maintained in parallel for future use at larger scales (100K+). At 34K memories, the float32 brute-force search is 13ms — not the bottleneck. The spread activation fan-out optimization (previous commit) had far more impact on latency than any quantization scheme. Quantization options now available: - QJL 1-bit: 29.5x compression, 53% recall, 2.8ms search - Scalar int8: 3.9x compression, 42% recall, ~5ms search - Float32: 1x (baseline), 100% recall, 13ms search Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CalebisGross and others added 12 commits March 30, 2026 11:50

fix: use background context for backfill to avoid API request timeout

afa7fd6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: heuristic embedding pipeline — remove all LLM dependency#374

feat: heuristic embedding pipeline — remove all LLM dependency#374
CalebisGross wants to merge 12 commits intomainfrom
feat/heuristic-pipeline

CalebisGross commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CalebisGross commented Mar 30, 2026

Summary

Performance (measured on production DB, 34K memories)

Commits (7)

Config

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant