Fair evaluation framework for agent memory
Agent memory systems are hard to compare fairly. They are typically evaluated with different LLMs, embedding models, and metrics. MemEval standardizes the setup: same LLM, same embeddings, same scoring pipeline, and end-to-end token cost tracking across ingestion, retrieval, and answer generation. Cost reporting matters because LLM calls often differ by an order of magnitude across architectures.
Evaluation combines token F1 with LLM-as-judge scores, with per-category breakdowns to show where each system actually wins. The framework ships with 9 memory systems and 2 benchmarks (LoCoMo and LongMemEval), and adding new systems or datasets is straightforward.
We also introduce PropMem, which provides the strongest measured quality-to-cost tradeoff in our runs. It extracts atomic facts, tags them by entity, and filters retrieval by entity at query time. See PROPMEM.md for the design.
| Rank | System | F1 | Judge | Tokens |
|---|---|---|---|---|
| 1 | PropMem | 0.605 | 0.823 | 5.9M |
| 2 | OpenClaw | 0.557 | 0.725 | 16.4M |
| 3 | Full Context | 0.542 | 0.709 | 37.5M |
| 4 | Hindsight | 0.489 | 0.676 | 24.2M |
| 5 | Graphiti | 0.416 | 0.573 | 5.1M |
| 6 | Memory-R1 | 0.389 | 0.569 | 3.4M |
| 7 | SimpleMem | 0.358 | 0.478 | 11.4M |
| 8 | Mem0 | 0.344 | 0.497 | 3.0M |
| 9 | MemU | 0.299 | 0.399 | 6.7M |
`Tokens` = total system LLM prompt + completion tokens across ingestion, retrieval, and answering; excludes embedding and judge calls.
Per-category F1 (table)
| System | Factual | Temporal | Multi-hop | Inferential | Adversarial |
|---|---|---|---|---|---|
| PropMem | 0.431 | 0.615 | 0.599 | 0.289 | 0.794 |
| OpenClaw | 0.464 | 0.482 | 0.670 | 0.213 | 0.528 |
| Full Context | 0.517 | 0.369 | 0.674 | 0.197 | 0.509 |
| Hindsight | 0.431 | 0.306 | 0.526 | 0.206 | 0.647 |
| Graphiti | 0.296 | 0.151 | 0.349 | 0.120 | 0.873 |
| Memory-R1 | 0.370 | 0.116 | 0.460 | 0.193 | 0.504 |
| SimpleMem | 0.245 | 0.320 | 0.237 | 0.136 | 0.734 |
| Mem0 | 0.267 | 0.104 | 0.330 | 0.174 | 0.629 |
| MemU | 0.190 | 0.068 | 0.233 | 0.076 | 0.704 |
10 conversations, 1,986 QA pairs. LLM: gpt-4.1-mini. Embeddings: text-embedding-3-small. Judge: gpt-5.2 (avg of relevance, completeness, accuracy).
| Rank | System | F1 | Judge | Tokens |
|---|---|---|---|---|
| 1 | PropMem | 0.550 | 0.716 | 23.1M |
| 2 | SimpleMem | 0.480 | 0.667 | 20.8M |
| 3 | OpenClaw | 0.244 | 0.598 | 0.7M |
| 4 | Full Context | 0.222 | 0.520 | 10.6M |
Per-category scores (table)
| System | SS-U | SS-A | SS-P | MS | Temp | K-Update |
|---|---|---|---|---|---|---|
| PropMem | 0.851 | 0.767 | 0.147 | 0.582 | 0.424 | 0.528 |
| SimpleMem | 0.752 | 0.566 | 0.126 | 0.382 | 0.578 | 0.475 |
| OpenClaw | 0.401 | 0.432 | 0.127 | 0.082 | 0.185 | 0.234 |
| Full Context | 0.265 | 0.415 | 0.177 | 0.062 | 0.212 | 0.202 |
SS-U = Single-Session User, SS-A = Single-Session Assistant, SS-P = Single-Session Preference, MS = Multi-Session, Temp = Temporal, K-Update = Knowledge Update.
Stratified sample of 102 questions (17 per category), conversations up to 500 turns. LLM: gpt-4.1. Embeddings: text-embedding-3-small. Judge: gpt-4o (LongMemEval native binary accuracy, matches the paper's evaluation protocol).
MemEval ships with 9 memory systems spanning different architectural approaches:
| System | Architecture | Retrieval |
|---|---|---|
| PropMem | Entity-filtered propositions | Entity-scoped proposition search + CoT reasoning |
| OpenClaw | Chunk-and-search | Hybrid BM25 + vector search, top-K chunks to LLM |
| Full Context | Brute force | Entire conversation in the prompt |
| Hindsight | Structured memory networks | 4-network architecture (world, bank, opinion, observation) with retain-recall-reflect |
| Graphiti | Temporal knowledge graph | Graph search over entity nodes and relationship edges |
| SimpleMem | Structured compression | 3-stage pipeline: semantic compression, online synthesis, intent-aware retrieval |
| Mem0 | Fact extraction + search | Vector search over extracted facts |
| Memory-R1 | Two-agent RL | SFT+GRPO fine-tuned Qwen-2.5-7B (Memory Manager + Answer Agent) |
| MemU | Hierarchical memory | Memory-as-filesystem with auto-categorization and proactive context loading |
Requirements: Python >=3.11 and OPENAI_API_KEY set in your environment (or .env).
For full parity with all 9 systems (including MemU and Memory-R1), use Python >=3.13.
uv sync --all-extrasReproduce LoCoMo results:
uv run python scripts/run_full_benchmark.py --systems all --num-samples 10 --llm-model gpt-4.1-miniRun a single system with a specific LLM (no judge):
uv run python scripts/run_full_benchmark.py --systems propmem --num-samples 1 --llm-model gpt-4.1-mini --skip-judgeRun a single system with judge enabled:
uv run python scripts/run_full_benchmark.py --systems propmem --num-samples 1 --llm-model gpt-4.1-miniReproduce LongMemEval results:
uv run python scripts/run_full_benchmark.py --benchmark longmemeval --data-file data/longmemeval_s_stratified_102.json --systems propmem,simplemem,openclaw,fullcontext --num-samples 102 --llm-model gpt-4.1Generate the LongMemEval stratified sample used in this README:
uv run python scripts/stratified_sample.py --split s --total 102 --output data/longmemeval_s_stratified_102.jsonResults are saved to data/ as JSON files.
You can use PropMem directly as an app memory layer (no benchmark runner required):
from agents_memory import PropMemMemory
memory = PropMemMemory(
user_name="John",
assistant_name="Assistant",
llm_model="gpt-4.1-mini",
)
memory.add_session(
[
{"speaker": "John", "text": "I prefer quiet coffee shops for work."},
{"speaker": "Assistant", "text": "Noted. You prefer quiet coffee shops."},
],
session_date="2026-03-01 10:30:00",
)
answer = memory.ask("Where does John prefer to work?")
print(answer)For multiple conversations, call add_session(...) for each new session, then query
with ask(...) or ask_with_details(...).
Write an adapter function and register it in the SYSTEMS dict in scripts/run_full_benchmark.py:
def run_mysystem(conv: dict, llm_model: str, run_judge: bool) -> list[dict]:
"""Your system: describe what it does."""
your_system = MyMemorySystem(model=llm_model)
your_system.ingest(conv)
return _qa_results(conv, lambda q: your_system.answer(q), run_judge)
SYSTEMS: dict[str, dict] = {
# ... existing systems ...
"mysystem": {
"fn": run_mysystem,
"architecture": "your architecture description",
"infrastructure": "your dependencies",
},
}Then run:
# Quick test: your system vs PropMem on 1 conversation
uv run python scripts/run_full_benchmark.py --systems mysystem,propmem --num-samples 1 --skip-judge
# Full benchmark with judge (10 conversations, 1986 QA pairs)
uv run python scripts/run_full_benchmark.py --systems mysystem --num-samples 10Any QA dataset works. Register a loader in scripts/run_full_benchmark.py and run with --benchmark mybench. See CONTRIBUTING.md for the data format and details.
- Graphiti uses the open-source
graphiti-corelibrary with Kuzu (embedded graph DB), not the commercial Zep platform which uses Neo4j + BGE-m3 reranking. Zep's published numbers (75-80% accuracy) use a different metric (LLM-judge accuracy, not token F1) and their commercial infrastructure. The Mem0 paper independently measured Zep's platform at token F1 ~0.35-0.50 per category. Our 0.416 with the open-source library is in the same range. - Mem0: At evaluation time, there was a reported timestamp-handling issue on the Mem0 platform (mem0ai/mem0#3944) that may affect temporal reasoning. Our Mem0 temporal F1 (0.104) is materially lower than the paper's reported value (0.489), which may depress overall Mem0 performance in this benchmark.
- MemU claims "92% accuracy" on LoCoMo but uses LLM-judge binary accuracy, a fundamentally different metric from token F1. Not directly comparable.
- Hindsight builds both summaries and chunks, explaining the high token count (24.2M).
- Memory-R1 is the only system using a fine-tuned local model (Qwen-2.5-7B) rather than API-based LLMs. Results here use a model trained for 100 GRPO steps (undertrained vs. the paper’s schedule). Token usage is 3.4M total (1,986 questions; ~1,705 prompt / ~5.3 completion per question), between Mem0 (3.0M) and Graphiti (5.1M) in efficiency.
Apache License 2.0. See LICENSE. Third-party attribution and notices: NOTICE.
MemEval is provided "as is," without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages, or other liability, whether in an action of contract, tort, or otherwise, arising from, out of, or in connection with the software or the use or other dealings in the software.
- LoCoMo (benchmark)
- LongMemEval (benchmark)
- Memory in the Age of AI Agents (survey)
- PropMem Design (this repo)
- Mem0
- SimpleMem
- Graphiti
- Hindsight
- OpenClaw Memory
- MemU
- Memory-R1


