Skip to content

RAKE/YAKE keyword extraction for concept pipeline #372

@CalebisGross

Description

@CalebisGross

Summary

Replace the basic word-frequency concept extraction with RAKE (Rapid Automatic Keyword Extraction) or YAKE (Yet Another Keyword Extractor) implemented in pure Go. This improves concept quality for encoding, pattern detection, and retrieval without requiring an LLM.

Motivation

Current concept extraction (embedding.ExtractTopConcepts) uses a fixed vocabulary with word counting:

  • Only recognizes ~130 predefined terms
  • Unknown words are ignored (or hashed for embeddings)
  • No phrase detection ("spread activation" extracted as two separate words)
  • No statistical weighting (term frequency, position, co-occurrence)

RAKE/YAKE would provide:

  • Multi-word phrases: "spread activation", "bag of words", "air gapped"
  • Statistical ranking: Terms weighted by frequency, position, and co-occurrence
  • Domain-adaptive: No fixed vocabulary needed — learns from content
  • Pure Go: No external dependencies, microsecond latency

Implementation Plan

Option A: RAKE (simpler)

  1. Split text on stop words → candidate phrases
  2. Score each phrase by word frequency, word degree, and word score
  3. Return top-N phrases ranked by score
  4. ~150 lines of Go + stop word list

Option B: YAKE (better quality)

  1. Statistical features: term frequency, position, sentence context, co-occurrence
  2. No training needed — unsupervised
  3. Better at handling technical text
  4. ~300 lines of Go

Integration

  • New file: internal/embedding/keywords.go
  • ExtractKeywords(text string, n int) []string — returns ranked keyword phrases
  • Update GenerateEncodingResponse() to use RAKE/YAKE instead of vocabulary counting
  • Keep ExtractTopConcepts() as fallback for backward compat
  • Concept vocabulary still used for synonym grouping, not as the extraction source

Config

encoding:
  keyword_extractor: "rake"  # "vocabulary" (current), "rake", "yake"

Acceptance Criteria

  • Extracts multi-word phrases (not just single tokens)
  • Pure Go, no external dependencies
  • <1ms per extraction on typical memory content
  • Improves pattern detection quality (more specific concepts = fewer false positives)
  • go test ./internal/embedding/... passes
  • Backward compatible — vocabulary mode still works

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions