Skip to content

feat: heuristic embedding pipeline — remove all LLM dependency#374

Open
CalebisGross wants to merge 12 commits intomainfrom
feat/heuristic-pipeline
Open

feat: heuristic embedding pipeline — remove all LLM dependency#374
CalebisGross wants to merge 12 commits intomainfrom
feat/heuristic-pipeline

Conversation

@CalebisGross
Copy link
Copy Markdown
Collaborator

Summary

  • Remove all 12 generative LLM (Complete) calls from 8 cognitive agents
  • Replace llm.Provider with new embedding.Provider interface (Embed/BatchEmbed/Health only)
  • Add 3 embedding providers: bow (128-dim, instant, air-gapped), hugot (384-dim MiniLM-L6-v2, pure Go), api (OpenAI-compatible)
  • Add RAKE keyword extraction for multi-word concept phrases
  • Add TurboQuant 1-bit vector compression (29.5x, ~8ns/comparison)
  • Integrate TurboQuant into embedding index with two-stage retrieval
  • Enhanced backfill endpoint for re-embedding with dimension mismatch detection
  • Migrate all CLI commands to embedding.Provider

Performance (measured on production DB, 34K memories)

Metric LLM (Gemini) Heuristic (bow) Heuristic (hugot)
Encoding 39,426ms 6ms ~120ms
Recall 8,876ms ~1,800ms ~3,400ms
Failures ~20% 0% 0%
Network 2+ calls/memory 0 0 (after model download)
Binary 16MB 16MB 28MB

Commits (7)

  1. 0fae91c — Core: remove all LLM, new embedding package, migrate 10 agents (-1,805 lines)
  2. cd78932 — RAKE keyword extraction (multi-word phrases)
  3. 5972933 — Hugot pure Go MiniLM-L6-v2 embeddings (384-dim, zero CGo)
  4. 9b4b730 — TurboQuant 1-bit vector compression
  5. 3bc5fe9 — Integrate TurboQuant into store + enhanced backfill
  6. 5a821f0 — Fix hugot text truncation for long inputs
  7. 5630a84 — Migrate CLI commands to embedding.Provider

Config

embedding:
  provider: bow    # "bow" (instant, air-gapped), "hugot" (transformer, air-gapped), "api" (cloud)

Test plan

  • go build ./... — clean
  • go test ./... — 21 packages pass, 0 failures
  • go vet ./... — clean
  • Daemon starts with all 3 provider modes
  • MCP tools work (recall, remember, status)
  • Dashboard loads
  • Encoding pipeline processes events
  • Backfill endpoint re-embeds with correct dimensions
  • TurboQuant similarity tests pass (ordering preserved, compression verified)
  • Full backfill of 34K memories to hugot 384-dim (in progress)

Closes #369, closes #370, closes #372. Partially addresses #371 (algorithm done, index integrated).

🤖 Generated with Claude Code

CalebisGross and others added 12 commits March 30, 2026 11:50
Remove all 12 generative LLM (Complete) calls from 8 cognitive agents,
replacing them with heuristic/algorithmic Go implementations. Introduce
new embedding.Provider interface (Embed/BatchEmbed/Health only) to
replace llm.Provider for agent dependencies.

Key changes:
- New internal/embedding/ package: Provider interface, BowProvider
  (128-dim bag-of-words), APIProvider, InstrumentedProvider, LLMAdapter
- Perception: remove LLM gate, heuristic scoring is sole path
- Encoding: promote fallbackCompression to primary, vocabulary-aware
  concept extraction via ExtractTopConcepts
- Retrieval: drop LLM synthesis entirely (consuming agents synthesize)
- Episoding: algorithmic time-window clustering with concept titles
- Consolidation: highest-salience picker for gist, statistical concept
  co-occurrence for pattern detection
- Dreaming: graph bridge detection replaces LLM insight generation
- Abstraction: hierarchical concept clustering for principles/axioms
- Reactor: static personality responses for @mentions
- Config: new embedding.provider field ("bow" for air-gapped, "api"
  for OpenAI-compatible endpoint, auto-detect from llm config)

Results on production DB (34K memories):
- Encoding: 39,426ms → 6ms (6,571x faster)
- Recall: 8,876ms → 6,200ms (30% faster)
- Encoding failures: 20% → 0%
- Network calls: 2+ per memory → 0 (fully air-gapped with bow)

Net: -1,805 lines (581 added, 2,386 removed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implement Rapid Automatic Keyword Extraction (RAKE) in pure Go for
multi-word phrase detection. The encoding pipeline now uses a hybrid
approach: RAKE extracts domain-adaptive phrases first, then vocabulary
terms supplement with consistent single-word tags.

Before (vocabulary only):
  "Docker build failing on ARM64 with exit code 137"
  → [docker, build]

After (RAKE + vocabulary):
  → [exit code 137, oom killer, docker build failing, arm64, docker, build]

New files:
- internal/embedding/rake.go — RAKE algorithm (~160 lines)
- internal/embedding/rake_test.go — 10 test cases

Modified:
- internal/embedding/bow.go — ExtractConcepts() hybrid function,
  GenerateEncodingResponse() uses RAKE-first extraction

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Integrate knights-analytics/hugot for transformer-quality embeddings
with zero CGo, zero shared libraries — true single-binary deployment.

The hugot provider uses GoMLX simplego backend to run all-MiniLM-L6-v2
entirely in Go. Model auto-downloads from HuggingFace on first use
(~90MB, stored in ~/.mnemonic/models/).

Config:
  embedding:
    provider: hugot  # pure Go, 384-dim, air-gapped

Performance (measured on production daemon):
- Embedding latency: 108-325ms per text (CPU, pure Go)
- Dimensions: 384 (vs bow-128, vs Gemini-3072)
- Binary size: 16MB → 28MB (+12MB from GoMLX runtime)
- No network calls after initial model download

Quality:
- RAKE concepts + 384-dim transformer embeddings
- "docker buildx crashing", "exit code 137", "oom killer", "arm64"
  now all captured as concepts AND semantically searchable

Note: Existing memories retain their old embeddings (3072-dim Gemini
or 128-dim bow). A backfill is needed to re-embed with hugot for
consistent retrieval quality. Use /api/v1/backfill-embeddings.

Closes #370

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TurboQuant (QJL stage): pure Go implementation of 1-bit quantized
Johnson-Lindenstrauss vector compression. Compresses 384-dim float32
vectors (1536 bytes) to 52 bytes (48 bits + 4-byte norm) — 29.5x
compression. Similarity via XNOR + popcount (math/bits.OnesCount64).

New files:
- internal/embedding/turboquant.go — Quantizer, QuantizedVector,
  Similarity, packBits/getBit helpers
- internal/embedding/turboquant_test.go — 7 tests + 2 benchmarks

Backfill endpoint upgrade:
- Supports ?mode=all to re-embed ALL memories (not just missing)
- Detects dimension mismatch (e.g. 3072-dim Gemini vs 384-dim hugot)
- Progress logging every 100 memories
- 30-minute timeout (was 5 min)
- Configurable ?limit (default 500, max 5000)

Note: TurboQuant is implemented but not yet integrated into the
embedding index. Integration requires replacing the float32 index
with a quantized index that stores QuantizedVectors and uses
Similarity() for search. This is a follow-up task.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire TurboQuant quantized index into SQLiteStore alongside the existing
float32 index. SearchByEmbedding now uses two-stage retrieval:
1. TurboQuant approximate search (XNOR + popcount, ~8ns per comparison)
2. Exact cosine re-ranking on top candidates

The quantized index runs in parallel with the float32 index. Both are
populated on Add/Remove. Search prefers the quantized index when it has
entries matching the query dimension, falling back to float32 for mixed
dimension scenarios (backward compat during migration).

New file: internal/store/sqlite/embindex_quantized.go

Backfill endpoint enhanced:
- ?mode=all re-embeds ALL memories (not just missing)
- Detects dimension mismatch (e.g. 3072-dim → 384-dim)
- Progress logging every 100 memories
- 30-minute timeout, configurable ?limit (max 5000)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MiniLM-L6-v2 has a 256-token max sequence length. Long texts (>512
tokens) caused a shape mismatch panic in the GoMLX backend. Fixed by
truncating input to 900 chars (~225 tokens) before passing to the
pipeline.

Also wires quantized index into SearchByEmbedding with fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update memory_cli.go, cycle.go, and diagnose.go to use
initEmbeddingRuntime() instead of initRuntime(). CLI commands
(remember, recall, consolidate, meta-cycle, dream-cycle, diagnose)
now use the same embedding.Provider as the daemon.

The diagnose command checks embedding provider health instead of
LLM provider health.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The backfill endpoint was re-processing the same recent memories on
every batch because ListMemories returns newest-first and mode=all
did not skip already-correct dimensions. Fixed by paginating through
all memories via offset and always skipping correct dimensions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add comprehensive benchmarks for float32 vs TurboQuant search at
production scale (34K memories, 384-dim).

Results (Ryzen 7 5800X):
- Gemini 3072-dim float32: 72.6ms/search
- Hugot 384-dim float32: 13.1ms/search (5.5x faster)
- TurboQuant 1-bit 384-dim: 2.8ms/search (25.9x faster)
- Storage: 400MB → 49MB → 1.7MB (235x compression)

TurboQuant recall@10 improved from 26.5% to 53% by increasing
candidate multiplier from 4x to 20x. The two-stage retrieval
(quantized pre-filter → exact re-rank) compensates for 1-bit
precision loss.

Quality test: 53% recall@10 vs float32 ground truth (acceptable
for pre-filtering, exact re-ranking ensures final accuracy).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two optimizations that dramatically reduce recall latency:

1. Cap fan-out to 15 strongest associations per node during spread
   activation. Hub memories (100-350 links) were causing exponential
   explosion in the traversal. Now follows only the top 15 by strength.

2. Defer Hebbian activation writes to a background goroutine instead
   of writing per-edge during search. This was causing a DB write for
   every association traversed.

Combined with the earlier pruning of 220K dead ingest associations:

Query                        Before    After     Speedup
SQLite FTS5 retrieval        10,675ms  1,887ms   5.7x
Go context timeout           13,822ms  3,670ms   3.8x
SQL query associations       12,305ms  2,477ms   5.0x
nil pointer consolidation     2,257ms  1,133ms   2.0x

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace PolarQuant (projected 4-bit) with simpler scalar int8
quantization that operates directly on embedding dimensions without
a projection matrix. 3.9x compression, no projection overhead.

More importantly: switch SearchByEmbedding to use float32 as the
primary index (100% recall, 13ms at 34K) instead of quantized
(53% recall). The quantized index is maintained in parallel for
future use at larger scales (100K+).

At 34K memories, the float32 brute-force search is 13ms — not the
bottleneck. The spread activation fan-out optimization (previous
commit) had far more impact on latency than any quantization scheme.

Quantization options now available:
- QJL 1-bit: 29.5x compression, 53% recall, 2.8ms search
- Scalar int8: 3.9x compression, 42% recall, ~5ms search
- Float32: 1x (baseline), 100% recall, 13ms search

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant