Skip to content

Pure Go word2vec/hugot embedding upgrade path #373

@CalebisGross

Description

@CalebisGross

Summary

Research and prototype higher-quality pure Go embedding options that don't require ONNX Runtime or any shared library — keeping the single-binary, zero-dependency promise.

Motivation

The embedding quality ladder for mnemonic:

  1. bow-128 (current) — zero deps, fast, coarse
  2. ONNX MiniLM (Embedded ONNX embedding provider (MiniLM-L6-v2 INT8) #370) — great quality, requires shared library
  3. This issue — middle ground: better than bow, no shared library

Two promising approaches emerged from research:

Option A: Word2Vec (50K words, 100d)

  • Library: github.com/sajari/word2vec (pure Go, loads binary Word2Vec models)
  • Ship a pruned 50K-word, 100d model (~20MB, go:embed feasible)
  • Sentence embedding: average word vectors
  • Quality: ~58-65 STS Spearman (vs bow ~40, transformer ~84)
  • Speed: 10K-50K embeddings/sec

Option B: hugot pure Go transformers

  • Library: github.com/knights-analytics/hugot
  • Pure Go ONNX runtime backend (no CGo, no shared library)
  • Can run MiniLM-L6-v2 entirely in Go
  • Quality: transformer-level (~84 STS)
  • Speed: slower than C ONNX Runtime, but functional
  • Maturity: newer, less battle-tested

Research Tasks

  • Benchmark sajari/word2vec with pruned GloVe/fasttext model on mnemonic's retrieval benchmark
  • Benchmark knights-analytics/hugot pure Go backend with MiniLM-L6-v2
  • Measure: latency, memory footprint, binary size impact, retrieval quality (nDCG@5)
  • Compare both against bow-128 baseline and ONNX MiniLM
  • Determine if either is production-ready for mnemonic

Decision Criteria

Pick the approach that best satisfies:

  1. Single binary (no shared libraries)
  2. <10ms embedding latency on CPU
  3. <50MB binary size increase
  4. Measurable retrieval quality improvement over bow-128
  5. Cross-platform (Linux, macOS ARM, Windows)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions