feat: heuristic embedding pipeline — remove all LLM dependency#374
Open
CalebisGross wants to merge 12 commits intomainfrom
Open
feat: heuristic embedding pipeline — remove all LLM dependency#374CalebisGross wants to merge 12 commits intomainfrom
CalebisGross wants to merge 12 commits intomainfrom
Conversation
Remove all 12 generative LLM (Complete) calls from 8 cognitive agents,
replacing them with heuristic/algorithmic Go implementations. Introduce
new embedding.Provider interface (Embed/BatchEmbed/Health only) to
replace llm.Provider for agent dependencies.
Key changes:
- New internal/embedding/ package: Provider interface, BowProvider
(128-dim bag-of-words), APIProvider, InstrumentedProvider, LLMAdapter
- Perception: remove LLM gate, heuristic scoring is sole path
- Encoding: promote fallbackCompression to primary, vocabulary-aware
concept extraction via ExtractTopConcepts
- Retrieval: drop LLM synthesis entirely (consuming agents synthesize)
- Episoding: algorithmic time-window clustering with concept titles
- Consolidation: highest-salience picker for gist, statistical concept
co-occurrence for pattern detection
- Dreaming: graph bridge detection replaces LLM insight generation
- Abstraction: hierarchical concept clustering for principles/axioms
- Reactor: static personality responses for @mentions
- Config: new embedding.provider field ("bow" for air-gapped, "api"
for OpenAI-compatible endpoint, auto-detect from llm config)
Results on production DB (34K memories):
- Encoding: 39,426ms → 6ms (6,571x faster)
- Recall: 8,876ms → 6,200ms (30% faster)
- Encoding failures: 20% → 0%
- Network calls: 2+ per memory → 0 (fully air-gapped with bow)
Net: -1,805 lines (581 added, 2,386 removed)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implement Rapid Automatic Keyword Extraction (RAKE) in pure Go for multi-word phrase detection. The encoding pipeline now uses a hybrid approach: RAKE extracts domain-adaptive phrases first, then vocabulary terms supplement with consistent single-word tags. Before (vocabulary only): "Docker build failing on ARM64 with exit code 137" → [docker, build] After (RAKE + vocabulary): → [exit code 137, oom killer, docker build failing, arm64, docker, build] New files: - internal/embedding/rake.go — RAKE algorithm (~160 lines) - internal/embedding/rake_test.go — 10 test cases Modified: - internal/embedding/bow.go — ExtractConcepts() hybrid function, GenerateEncodingResponse() uses RAKE-first extraction Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Integrate knights-analytics/hugot for transformer-quality embeddings
with zero CGo, zero shared libraries — true single-binary deployment.
The hugot provider uses GoMLX simplego backend to run all-MiniLM-L6-v2
entirely in Go. Model auto-downloads from HuggingFace on first use
(~90MB, stored in ~/.mnemonic/models/).
Config:
embedding:
provider: hugot # pure Go, 384-dim, air-gapped
Performance (measured on production daemon):
- Embedding latency: 108-325ms per text (CPU, pure Go)
- Dimensions: 384 (vs bow-128, vs Gemini-3072)
- Binary size: 16MB → 28MB (+12MB from GoMLX runtime)
- No network calls after initial model download
Quality:
- RAKE concepts + 384-dim transformer embeddings
- "docker buildx crashing", "exit code 137", "oom killer", "arm64"
now all captured as concepts AND semantically searchable
Note: Existing memories retain their old embeddings (3072-dim Gemini
or 128-dim bow). A backfill is needed to re-embed with hugot for
consistent retrieval quality. Use /api/v1/backfill-embeddings.
Closes #370
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TurboQuant (QJL stage): pure Go implementation of 1-bit quantized Johnson-Lindenstrauss vector compression. Compresses 384-dim float32 vectors (1536 bytes) to 52 bytes (48 bits + 4-byte norm) — 29.5x compression. Similarity via XNOR + popcount (math/bits.OnesCount64). New files: - internal/embedding/turboquant.go — Quantizer, QuantizedVector, Similarity, packBits/getBit helpers - internal/embedding/turboquant_test.go — 7 tests + 2 benchmarks Backfill endpoint upgrade: - Supports ?mode=all to re-embed ALL memories (not just missing) - Detects dimension mismatch (e.g. 3072-dim Gemini vs 384-dim hugot) - Progress logging every 100 memories - 30-minute timeout (was 5 min) - Configurable ?limit (default 500, max 5000) Note: TurboQuant is implemented but not yet integrated into the embedding index. Integration requires replacing the float32 index with a quantized index that stores QuantizedVectors and uses Similarity() for search. This is a follow-up task. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire TurboQuant quantized index into SQLiteStore alongside the existing float32 index. SearchByEmbedding now uses two-stage retrieval: 1. TurboQuant approximate search (XNOR + popcount, ~8ns per comparison) 2. Exact cosine re-ranking on top candidates The quantized index runs in parallel with the float32 index. Both are populated on Add/Remove. Search prefers the quantized index when it has entries matching the query dimension, falling back to float32 for mixed dimension scenarios (backward compat during migration). New file: internal/store/sqlite/embindex_quantized.go Backfill endpoint enhanced: - ?mode=all re-embeds ALL memories (not just missing) - Detects dimension mismatch (e.g. 3072-dim → 384-dim) - Progress logging every 100 memories - 30-minute timeout, configurable ?limit (max 5000) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MiniLM-L6-v2 has a 256-token max sequence length. Long texts (>512 tokens) caused a shape mismatch panic in the GoMLX backend. Fixed by truncating input to 900 chars (~225 tokens) before passing to the pipeline. Also wires quantized index into SearchByEmbedding with fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update memory_cli.go, cycle.go, and diagnose.go to use initEmbeddingRuntime() instead of initRuntime(). CLI commands (remember, recall, consolidate, meta-cycle, dream-cycle, diagnose) now use the same embedding.Provider as the daemon. The diagnose command checks embedding provider health instead of LLM provider health. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The backfill endpoint was re-processing the same recent memories on every batch because ListMemories returns newest-first and mode=all did not skip already-correct dimensions. Fixed by paginating through all memories via offset and always skipping correct dimensions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add comprehensive benchmarks for float32 vs TurboQuant search at production scale (34K memories, 384-dim). Results (Ryzen 7 5800X): - Gemini 3072-dim float32: 72.6ms/search - Hugot 384-dim float32: 13.1ms/search (5.5x faster) - TurboQuant 1-bit 384-dim: 2.8ms/search (25.9x faster) - Storage: 400MB → 49MB → 1.7MB (235x compression) TurboQuant recall@10 improved from 26.5% to 53% by increasing candidate multiplier from 4x to 20x. The two-stage retrieval (quantized pre-filter → exact re-rank) compensates for 1-bit precision loss. Quality test: 53% recall@10 vs float32 ground truth (acceptable for pre-filtering, exact re-ranking ensures final accuracy). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two optimizations that dramatically reduce recall latency: 1. Cap fan-out to 15 strongest associations per node during spread activation. Hub memories (100-350 links) were causing exponential explosion in the traversal. Now follows only the top 15 by strength. 2. Defer Hebbian activation writes to a background goroutine instead of writing per-edge during search. This was causing a DB write for every association traversed. Combined with the earlier pruning of 220K dead ingest associations: Query Before After Speedup SQLite FTS5 retrieval 10,675ms 1,887ms 5.7x Go context timeout 13,822ms 3,670ms 3.8x SQL query associations 12,305ms 2,477ms 5.0x nil pointer consolidation 2,257ms 1,133ms 2.0x Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace PolarQuant (projected 4-bit) with simpler scalar int8 quantization that operates directly on embedding dimensions without a projection matrix. 3.9x compression, no projection overhead. More importantly: switch SearchByEmbedding to use float32 as the primary index (100% recall, 13ms at 34K) instead of quantized (53% recall). The quantized index is maintained in parallel for future use at larger scales (100K+). At 34K memories, the float32 brute-force search is 13ms — not the bottleneck. The spread activation fan-out optimization (previous commit) had far more impact on latency than any quantization scheme. Quantization options now available: - QJL 1-bit: 29.5x compression, 53% recall, 2.8ms search - Scalar int8: 3.9x compression, 42% recall, ~5ms search - Float32: 1x (baseline), 100% recall, 13ms search Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Complete) calls from 8 cognitive agentsllm.Providerwith newembedding.Providerinterface (Embed/BatchEmbed/Health only)Performance (measured on production DB, 34K memories)
Commits (7)
0fae91c— Core: remove all LLM, new embedding package, migrate 10 agents (-1,805 lines)cd78932— RAKE keyword extraction (multi-word phrases)5972933— Hugot pure Go MiniLM-L6-v2 embeddings (384-dim, zero CGo)9b4b730— TurboQuant 1-bit vector compression3bc5fe9— Integrate TurboQuant into store + enhanced backfill5a821f0— Fix hugot text truncation for long inputs5630a84— Migrate CLI commands to embedding.ProviderConfig
Test plan
go build ./...— cleango test ./...— 21 packages pass, 0 failuresgo vet ./...— cleanCloses #369, closes #370, closes #372. Partially addresses #371 (algorithm done, index integrated).
🤖 Generated with Claude Code