-
Notifications
You must be signed in to change notification settings - Fork 1
RAKE/YAKE keyword extraction for concept pipeline #372
Copy link
Copy link
Open
Labels
Description
Summary
Replace the basic word-frequency concept extraction with RAKE (Rapid Automatic Keyword Extraction) or YAKE (Yet Another Keyword Extractor) implemented in pure Go. This improves concept quality for encoding, pattern detection, and retrieval without requiring an LLM.
Motivation
Current concept extraction (embedding.ExtractTopConcepts) uses a fixed vocabulary with word counting:
- Only recognizes ~130 predefined terms
- Unknown words are ignored (or hashed for embeddings)
- No phrase detection ("spread activation" extracted as two separate words)
- No statistical weighting (term frequency, position, co-occurrence)
RAKE/YAKE would provide:
- Multi-word phrases: "spread activation", "bag of words", "air gapped"
- Statistical ranking: Terms weighted by frequency, position, and co-occurrence
- Domain-adaptive: No fixed vocabulary needed — learns from content
- Pure Go: No external dependencies, microsecond latency
Implementation Plan
Option A: RAKE (simpler)
- Split text on stop words → candidate phrases
- Score each phrase by word frequency, word degree, and word score
- Return top-N phrases ranked by score
- ~150 lines of Go + stop word list
Option B: YAKE (better quality)
- Statistical features: term frequency, position, sentence context, co-occurrence
- No training needed — unsupervised
- Better at handling technical text
- ~300 lines of Go
Integration
- New file:
internal/embedding/keywords.go ExtractKeywords(text string, n int) []string— returns ranked keyword phrases- Update
GenerateEncodingResponse()to use RAKE/YAKE instead of vocabulary counting - Keep
ExtractTopConcepts()as fallback for backward compat - Concept vocabulary still used for synonym grouping, not as the extraction source
Config
encoding:
keyword_extractor: "rake" # "vocabulary" (current), "rake", "yake"Acceptance Criteria
- Extracts multi-word phrases (not just single tokens)
- Pure Go, no external dependencies
- <1ms per extraction on typical memory content
- Improves pattern detection quality (more specific concepts = fewer false positives)
-
go test ./internal/embedding/...passes - Backward compatible —
vocabularymode still works
References
- RAKE paper: Rose et al., "Automatic Keyword Extraction from Individual Documents"
- YAKE paper: Campos et al., "YAKE! Keyword Extraction from Single Documents"
- Parent: Remove LLM dependency from cognitive pipeline — heuristic-first architecture #369
Reactions are currently unavailable