Pure Go word2vec/hugot embedding upgrade path

## Summary

Research and prototype higher-quality pure Go embedding options that don't require ONNX Runtime or any shared library — keeping the single-binary, zero-dependency promise.

## Motivation

The embedding quality ladder for mnemonic:
1. **bow-128** (current) — zero deps, fast, coarse
2. **ONNX MiniLM** (#370) — great quality, requires shared library
3. **This issue** — middle ground: better than bow, no shared library

Two promising approaches emerged from research:

### Option A: Word2Vec (50K words, 100d)
- Library: `github.com/sajari/word2vec` (pure Go, loads binary Word2Vec models)
- Ship a pruned 50K-word, 100d model (~20MB, go:embed feasible)
- Sentence embedding: average word vectors
- Quality: ~58-65 STS Spearman (vs bow ~40, transformer ~84)
- Speed: 10K-50K embeddings/sec

### Option B: hugot pure Go transformers
- Library: `github.com/knights-analytics/hugot`
- Pure Go ONNX runtime backend (no CGo, no shared library)
- Can run MiniLM-L6-v2 entirely in Go
- Quality: transformer-level (~84 STS)
- Speed: slower than C ONNX Runtime, but functional
- Maturity: newer, less battle-tested

## Research Tasks

- [ ] Benchmark `sajari/word2vec` with pruned GloVe/fasttext model on mnemonic's retrieval benchmark
- [ ] Benchmark `knights-analytics/hugot` pure Go backend with MiniLM-L6-v2
- [ ] Measure: latency, memory footprint, binary size impact, retrieval quality (nDCG@5)
- [ ] Compare both against bow-128 baseline and ONNX MiniLM
- [ ] Determine if either is production-ready for mnemonic

## Decision Criteria

Pick the approach that best satisfies:
1. Single binary (no shared libraries)
2. <10ms embedding latency on CPU
3. <50MB binary size increase
4. Measurable retrieval quality improvement over bow-128
5. Cross-platform (Linux, macOS ARM, Windows)

## References

- sajari/word2vec: https://github.com/sajari/word2vec
- knights-analytics/hugot: https://github.com/knights-analytics/hugot
- GloVe pretrained: https://nlp.stanford.edu/projects/glove/
- Parent: #369

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pure Go word2vec/hugot embedding upgrade path #373

Summary

Motivation

Option A: Word2Vec (50K words, 100d)

Option B: hugot pure Go transformers

Research Tasks

Decision Criteria

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pure Go word2vec/hugot embedding upgrade path #373

Description

Summary

Motivation

Option A: Word2Vec (50K words, 100d)

Option B: hugot pure Go transformers

Research Tasks

Decision Criteria

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions