Directed acyclic context graph for LLM context management — tag-based retrieval replacing linear sliding windows.
Status: Phase 3 complete — native OpenClaw plugin live; shadow memory injection in validation (targeting MEMORY.md integration). Recent fixes: envelope stripping, IDF tag filtering, /quality health endpoint.
Standard LLM context management is temporal (flat sliding window). Compaction blends unrelated topics into noise, and old-but-relevant context gets lost while recent-but-irrelevant context takes up token budget. Users waste tokens re-establishing context that should already be available.
Every message/response pair is tagged with contextual labels. Context assembly pulls from two layers:
- Recency layer (25% of budget) — most recent messages regardless of tag
- Topic layer (75% of budget) — messages retrieved by inferred tags for the incoming message, deduplicated against the recency layer
The underlying structure is a DAG (directed acyclic graph): time-ordered, multi-tag membership, no cycles. The graph grows continuously and is never discarded.
Incoming message
│
▼
FeatureExtractor ──► EnsembleTagger ──► inferred tags
├── v0 baseline │
└── GP-evolved │
▼
ContextAssembler
├── RecencyLayer (most recent N)
└── TopicLayer (by tag, deduped)
│
▼
Assembled context (oldest-first)
│
▼
QualityAgent
├── Context density scoring
└── Reframing rate detection
Shadow mode evaluation across 812 interactions, 4000-token budget:
| Context Graph | Linear Window | |
|---|---|---|
| Messages/query | 23.6 | 22.0 |
| Tokens/query | 3,423 | 3,717 |
| Composition | 9.0 recency + 14.6 topic | 22.0 recency only |
| Metric | Value | Target | Status |
|---|---|---|---|
| Topic retrieval rate | 92.1% | — | — |
| Context density | 58.2% | > 60% | ❌ (see note) |
| Reframing rate | 1.5% | < 5% | ✅ |
| Composite quality score | 0.743 | — | — |
| Novel topic msgs/query | 14.6 | — | — |
| Token efficiency | -294/query vs. linear | — | ✅ |
-
The graph delivers 14.6 topically-retrieved messages per query that a linear window would never surface — older but on-topic exchanges that would have been compacted away or pushed out of the sliding window.
-
More relevant context in fewer tokens. Graph assembly uses 294 fewer tokens per query while delivering more messages. This is because topic retrieval targets relevant material rather than blindly packing the most recent exchanges regardless of relevance.
-
Reframing rate of 1.5% means users rarely need to re-establish context that was available in the graph. This is well under the 5% success target.
-
Density at 58.2% is just under the 60% target. This is a structural artifact: the recency layer is fixed at 25% of token budget (~9 messages), so even perfect topic retrieval caps density around 62%. Adjustable by tuning the recency/topic budget split.
When running shadow evaluation locally — not injecting into a live context window —
the --budget flag is meaningless. Blow it open:
python3 scripts/shadow.py --report --budget 999999With an uncapped budget, the linear baseline expands to the entire history (~583 messages in a mature corpus), while the graph still selects ~22 targeted messages. This is the clearest demonstration of what the graph actually does: semantic selection vs. a firehose.
--budget 999999, the recency layer also expands and dilutes
the ratio — density will fail even when the graph is working correctly. The metrics that
remain meaningful at any budget:
| Metric | Still valid? |
|---|---|
| Reframing rate | ✅ Always |
| Topic retrieval rate | ✅ Always |
| Novel msgs delivered | ✅ Always |
| Context density | ❌ Budget-dependent — ignore with large budgets |
Top-performing tags (fitness ≥ 0.90):
code, infrastructure, networking, question, shopping-list, llm,
openclaw, voice-pwa, research, ai, deployment, devops, security
Mid-range (0.70–0.90): planning, context-management, rl
Low-data tags (0.495): api, debugging, personal, yapCAD
| File | Purpose |
|---|---|
store.py |
SQLite MessageStore + tag index |
features.py |
Feature extraction (NLP + structural) |
tagger.py |
Rule-based baseline tagger (v0) |
gp_tagger.py |
Genetically-evolved tagger (DEAP) |
ensemble.py |
Weighted mixture model over tagger family |
assembler.py |
Context assembly (recency + topic layers) |
quality.py |
Quality agent (density + reframing scoring) |
reframing.py |
Reframing signal detection |
logger.py |
Interaction logging |
cli.py |
CLI for manual testing |
scripts/harvester.py |
Nightly interaction collection |
scripts/evolve.py |
GP tagger retraining |
scripts/replay.py |
Ensemble retagging of full corpus |
scripts/shadow.py |
Phase 2 shadow mode evaluation |
utils/text.py |
Shared text utilities: strip_envelope() strips channel metadata before indexing |
scripts/update_memory_dynamic.py |
Inject assembled context into MEMORY.md (shadow → live) |
pip install -r requirements.txt
python -m spacy download en_core_web_sm # optional but recommended# Add a message/response pair
python3 cli.py add "user text" "assistant text" [--tags extra_tag]
# Assemble context for an incoming message
python3 cli.py query "how do I fix the gateway?"
# Inspect the tag index
python3 cli.py tags
# View recent messages
python3 cli.py recent [--n 10]
# Run Phase 2 shadow evaluation
python3 scripts/shadow.py --report --verboseThe Python API (api/server.py) must be running for the OpenClaw plugin to function.
It's managed as a launchd service (com.contextgraph.api) so it survives reboots
and restarts automatically on crash.
cd /path/to/tag-context
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtInstall the launchd service using the provided script (auto-detects your Python path):
./scripts/install-service.shThe script reads service/com.contextgraph.api.plist.template, substitutes your local
paths, writes the rendered plist to ~/Library/LaunchAgents/, and loads it.
The rendered plist is .gitignore'd so local paths never end up in the repo.
To use a specific Python interpreter (e.g. pyenv shim):
./scripts/install-service.sh --python ~/.pyenv/shims/python3# Status (PID present = running, just exit code = crashed)
launchctl list | grep tag-context
# Start / stop
launchctl start com.glados.tag-context
launchctl stop com.glados.tag-context
# Restart (e.g. after code changes — must unload+load to re-read plist)
launchctl unload ~/Library/LaunchAgents/com.glados.tag-context.plist
launchctl load ~/Library/LaunchAgents/com.glados.tag-context.plist
# Logs
tail -f /tmp/tag-context.log# Service up?
curl http://localhost:8300/health
# → {"status":"ok","messages_in_store":..., "engine":"contextgraph"}
# Retrieval actually working?
curl http://localhost:8300/quality
# → {"zero_return_rate":0.04,"tag_entropy":3.6,"alert":false,...}Note:
/healthtells you the service is running./qualitytells you whether retrieval is actually working. Always check both — a service can be healthy while silently returning empty context. See Retrieval Quality Monitoring.
Note: Never run the server manually (
python3 api/server.pyoruvicorn ...) while the launchd service is also active — port 8300 conflicts will cause both to crash-loop. Always uselaunchctl stopfirst, orlaunchctl unloadto disable launchd management.
The plugin lives in plugin/index.ts. After making changes:
# Copy updated plugin to OpenClaw extension directory
cp plugin/index.ts ~/.openclaw/extensions/contextgraph/index.ts
# Graceful reload (keeps active sessions alive)
openclaw gateway reload
⚠️ Do not useopenclaw gateway stoporgateway restart— these orphan the LaunchAgent and disconnect all active sessions (Telegram, Discord, Voice, etc.). Usegateway reload(SIGUSR1) instead. See Notes for Agents.
Toggle graph mode at runtime (in chat):
/graph on # enable context graph
/graph off # fall back to linear window
/graph # show current status + API health
The /quality endpoint provides retrieval health metrics that /health does not:
curl http://localhost:8300/quality | python3 -m json.tool{
"turns_evaluated": 50,
"zero_return_turns": 2,
"zero_return_rate": 0.04,
"avg_topic_messages": 3.2,
"tag_entropy": 3.65,
"corpus_size": 1024,
"top_tags": [...],
"alert": false,
"alert_reasons": []
}Alert thresholds:
zero_return_rate > 0.25— more than 25% of recent turns returned no graph contexttag_entropy < 2.0— tags are over-generic, topic layer is near-useless
When alert: true, check alert_reasons for which threshold was breached.
Common causes of high zero_return_rate:
- Envelope pollution — channel metadata was being indexed as user text (fixed as of v1.1)
- Over-generic tags — all messages tagged the same; IDF filtering mitigates this automatically
- Empty corpus — not enough messages stored yet for topic retrieval to have anything to return
With graph mode on, after each turn the plugin calls /compare and appends a JSON
record to ~/.tag-context/comparison-log.jsonl with:
- Graph vs. linear message/token counts
- Tags used for retrieval
- Sticky pin count (active tool chains)
- Whether the last turn had tool calls
tail -f ~/.tag-context/comparison-log.jsonl | python3 -m json.tool
# or via API:
curl http://localhost:8300/comparison-logDo NOT use openclaw gateway stop / gateway restart to reload the plugin.
These commands disconnect all active sessions and orphan the LaunchAgent.
Use instead:
openclaw gateway reload # SIGUSR1 graceful reload, keeps connections alive/health returns {"status":"ok"} even when the graph is silently returning
empty context. Always check /quality when diagnosing retrieval problems:
curl http://localhost:8300/quality | python3 -c "import json,sys; q=json.load(sys.stdin); print('alert:', q['alert'], q.get('alert_reasons'))"python3 -m pytest tests/ -v- Phase 1 — Passive Collection. Harvest interactions, build the graph, evolve taggers. Corpus: 812+ interactions, 16 active tags.
- Phase 2 — Shadow Mode. Validate graph assembly against linear baseline. Result: graph delivers more relevant context in fewer tokens.
- Phase 3 — Native Plugin (Plan of Record). OpenClaw context engine plugin
live.
/graph on|offtoggles at runtime. Sticky threads auto-activate on tool chains. Comparison logging writes~/.tag-context/comparison-log.jsonlevery turn. Seedocs/PLAN_B_NATIVE_PLUGIN.mdfor the full implementation plan. - Phase 3.5 — Shadow Memory Injection.
scripts/update_memory_dynamic.pyqueries/assemblenightly and writes a## Dynamic Contextsection into a shadow memory file for validation. Once output quality is confirmed stable (target: ~1 day of shadow runs), the script will switch to writing directly toMEMORY.mdvia--liveflag. Replace-section logic uses HTML comment markers so the curated long-term memory above is never touched. - Phase 4 — Graph-Primary. After validation, graph becomes the default context engine. Linear window available as fallback.
docs/MEMORY_INTEGRATION.md— How Context Graph works with the existing MEMORY.md / daily log paradigm. Start here if you're integrating Context Graph into an existing deployment without replacing the old memory system. Includes ghost mode validation checklist and Phase 3.5 upgrade path.docs/AGENT_SETUP.md— Operational guide for agents: full setup, service management, nightly scripts, diagnostics, and transition status. Start here if you're taking over maintenance.docs/CONTEXT_TRANSITION.md— Design doc: the problem with linear context, the DAG vision, transition phases.docs/PLAN_B_NATIVE_PLUGIN.md— Implementation plan for the native OpenClaw context engine plugin (Plan of Record).
MIT