The first unified, open benchmark for evaluating AGI components — memory, reasoning, and metacognition — across any AI model.
The AGI research field is flying blind.
- Team A claims their memory system is best → tested on their own benchmark
- Team B claims their reasoning is best → tested on a different benchmark
- Team C claims metacognition works → no standard test exists at all
Nobody can compare systems. Nobody knows what's actually better. Every paper defines "memory" and "reasoning" differently. The field has no shared language, no shared ruler.
AGI-Bench is that standard ruler.
One unified evaluation framework that tests any AI model or agent system on the three core components believed to be the final bottlenecks before AGI:
┌──────────────────────────────────────────────────────────────┐
│ AGI-BENCH │
│ │
│ Module 1: MEMORY Module 2: REASONING │
│ ├── Retention ├── Causal │
│ ├── Interference ├── Counterfactual │
│ ├── Retrieval Precision ├── Multi-hop │
│ └── Cross-session └── Consistency │
│ │
│ Module 3: METACOGNITION │
│ ├── Calibration │
│ ├── Abstention │
│ ├── Self-correction │
│ └── Boundary Detection │
│ │
│ Output → AGI-Score (0–100) + Live Multi-Model Leaderboard │
└──────────────────────────────────────────────────────────────┘
- Standardized Test Cases — 300+ carefully curated challenges with rigorous scoring
- Model-Agnostic Setup — Plug in Claude, GPT-4, Gemini, Groq (Llama, Mixtral), or local models via Ollama
- Reproducible Validation — Same test, same exact programmatic scoring, every time
- Beautiful Visual Dashboard — Generates an interactive HTML dashboard mapping radar charts and ECE Calibration curves
Run a full AGI diagnostic on an AI model in under 5 minutes.
git clone https://github.com/ParthivPandya/agi-bench
cd agi-bench
python -m venv venv
# Windows
.\venv\Scripts\Activate.ps1
# Mac/Linux
source venv/bin/activate
pip install -r requirements.txtThe project uses a standard .env file for configuration. Copy the example file and add your keys:
cp .env.example .envInside your .env file, configure the providers you wish to use:
# Anthropic (Claude Models)
ANTHROPIC_API_KEY="sk-ant-..."
# OpenAI (GPT-4, etc.)
OPENAI_API_KEY="sk-..."
# Google (Gemini Models)
GOOGLE_API_KEY="AIza..."
# Groq (Llama3, Mixtral, etc.)
GROQ_API_KEY="gsk_..."(Note: Ollama running locally requires no API key!)
# Test GPT-4o natively
python run_benchmark.py --model openai --model-name gpt-4o
# Test Claude 3.5 Sonnet
python run_benchmark.py --model claude --output results/claude_sonnet.json
# Test Groq (Ultra-fast Inference)
python run_benchmark.py --model groq --model-name llama-3.3-70b-versatile
# Test a Local Model via Ollama (No API Key Required!)
python run_benchmark.py --model ollama --model-name llama3You will see a live evaluation stream in your terminal:
============================================================
✅ Retention Score: 84.1%
✅ Interference Score: 71.3%
✅ Retrieval Score: 79.4%
...
📊 AGI-SCORE: 78.2 / 100
============================================================The runner output emits a JSON artifact into the results/ folder.
Simply double-click dashboard/index.html (or drag it into your browser) to interact precisely with module breakdowns, capability radars, and calibration charts.
| Subtest | What It Measures | Target Failure Mode |
|---|---|---|
| Retention | Recalling facts far back in a huge context | Forgetting facts pushed past 10k tokens |
| Interference | Resistance to catastrophic forgetting | Newer memories overwriting old ones |
| Retrieval | Extracting exact fragments amid noise | Hallucinated mix-ups on similar facts |
| Cross-Session | Persistent agentic memory states | Dropping session state entirely |
| Subtest | What It Measures | Target Failure Mode |
|---|---|---|
| Causal | True causation vs. spurious correlation | Simple pattern-matching LLM tendencies |
| Counterfactual | "What if..." variable manipulations | Failing to alter dependent nodes |
| Multi-Hop | Chaining 3 to 10+ logical leaps | Hallucinated gap fills, breaking chains |
| Consistency | Maintaining logic across diverse reframes | Contradictory statements |
| Subtest | What It Measures | Target Failure Mode |
|---|---|---|
| Calibration | Expected Calibration Error (ECE) mapping | Being 90% confident but 60% accurate |
| Abstention | Capacity to say "I don't know" | Sycophancy / Hallucinating to please prompt |
| Self-Correction | Refactoring answers without human clues | Doubling down on errors or overcorrecting |
| Boundary | Awareness of knowledge cutoffs | Making up facts on highly obscure topics |
The benchmark derives a weighted, normalized index score (0-100) balancing all metrics:
View Sub-Score Calculations
Memory = (Retention + Interference + Retrieval + CrossSession) / 4
Reasoning = (Causal + Counterfactual + MultiHop + Consistency) / 4
Metacognition = (Calibration + Abstention + SelfCorrection + Boundary) / 4To test a new proprietary or local model, you just implement the unified BaseAdapter class. That's it. It takes less than 30 lines of code.
# adapters/my_model_adapter.py
from adapters.base_adapter import BaseAdapter
class MyModelAdapter(BaseAdapter):
def query(self, prompt: str, system: str = "") -> str:
# 1. Send prompt to your model
# 2. Return response string
return my_api.generate(prompt)
def query_with_confidence(self, prompt: str) -> dict:
# Return {"response": str, "confidence": float 0.0-1.0}
return {"response": text, "confidence": 0.8}
def reset_session(self):
# Clear context memory
self.history = []Then run python run_benchmark.py --model mymodel. Want your score public? Benchmark your system and open a PR with the result JSON file in results/ to be automatically added to the community leaderboard.
If you use this benchmark in research or publications:
@software{agibench2026,
author = {Parthiv Pandya},
title = {AGI-Bench: A Unified Benchmark for Evaluating Memory, Reasoning, and Metacognition in AI Systems},
year = {2026},
url = {https://github.com/ParthivPandya/agi-bench},
version = {1.0.0}
}Built by Parthiv Pandya
Licensed under MIT