🧠 AGI-Bench

The first unified, open benchmark for evaluating AGI components — memory, reasoning, and metacognition — across any AI model.

🛑 The Problem

The AGI research field is flying blind.

Team A claims their memory system is best → tested on their own benchmark
Team B claims their reasoning is best → tested on a different benchmark
Team C claims metacognition works → no standard test exists at all

Nobody can compare systems. Nobody knows what's actually better. Every paper defines "memory" and "reasoning" differently. The field has no shared language, no shared ruler.

AGI-Bench is that standard ruler.

🎯 What AGI-Bench Does

One unified evaluation framework that tests any AI model or agent system on the three core components believed to be the final bottlenecks before AGI:

┌──────────────────────────────────────────────────────────────┐
│                        AGI-BENCH                             │
│                                                              │
│   Module 1: MEMORY          Module 2: REASONING              │
│   ├── Retention             ├── Causal                       │
│   ├── Interference          ├── Counterfactual               │
│   ├── Retrieval Precision   ├── Multi-hop                    │
│   └── Cross-session         └── Consistency                  │
│                                                              │
│   Module 3: METACOGNITION                                    │
│   ├── Calibration                                            │
│   ├── Abstention                                             │
│   ├── Self-correction                                        │
│   └── Boundary Detection                                     │
│                                                              │
│   Output → AGI-Score (0–100) + Live Multi-Model Leaderboard  │
└──────────────────────────────────────────────────────────────┘

✨ Key Features

Standardized Test Cases — 300+ carefully curated challenges with rigorous scoring
Model-Agnostic Setup — Plug in Claude, GPT-4, Gemini, Groq (Llama, Mixtral), or local models via Ollama
Reproducible Validation — Same test, same exact programmatic scoring, every time
Beautiful Visual Dashboard — Generates an interactive HTML dashboard mapping radar charts and ECE Calibration curves

🚀 Quickstart

Run a full AGI diagnostic on an AI model in under 5 minutes.

1. Installation

git clone https://github.com/ParthivPandya/agi-bench
cd agi-bench
python -m venv venv

# Windows
.\venv\Scripts\Activate.ps1
# Mac/Linux
source venv/bin/activate

pip install -r requirements.txt

2. Configure Your API Keys

The project uses a standard .env file for configuration. Copy the example file and add your keys:

cp .env.example .env

Inside your .env file, configure the providers you wish to use:

# Anthropic (Claude Models)
ANTHROPIC_API_KEY="sk-ant-..."

# OpenAI (GPT-4, etc.)
OPENAI_API_KEY="sk-..."

# Google (Gemini Models)
GOOGLE_API_KEY="AIza..."

# Groq (Llama3, Mixtral, etc.)
GROQ_API_KEY="gsk_..."

(Note: Ollama running locally requires no API key!)

3. Run the Benchmark!

# Test GPT-4o natively
python run_benchmark.py --model openai --model-name gpt-4o

# Test Claude 3.5 Sonnet
python run_benchmark.py --model claude --output results/claude_sonnet.json

# Test Groq (Ultra-fast Inference)
python run_benchmark.py --model groq --model-name llama-3.3-70b-versatile

# Test a Local Model via Ollama (No API Key Required!)
python run_benchmark.py --model ollama --model-name llama3

You will see a live evaluation stream in your terminal:

============================================================
  ✅ Retention     Score: 84.1% 
  ✅ Interference  Score: 71.3% 
  ✅ Retrieval     Score: 79.4%
  ...
  📊 AGI-SCORE: 78.2 / 100
============================================================

4. Open the Interactive Dashboard

The runner output emits a JSON artifact into the results/ folder. Simply double-click dashboard/index.html (or drag it into your browser) to interact precisely with module breakdowns, capability radars, and calibration charts.

🧩 Benchmark Modules Details

🗃️ Module 1: Memory

Subtest	What It Measures	Target Failure Mode
Retention	Recalling facts far back in a huge context	Forgetting facts pushed past 10k tokens
Interference	Resistance to catastrophic forgetting	Newer memories overwriting old ones
Retrieval	Extracting exact fragments amid noise	Hallucinated mix-ups on similar facts
Cross-Session	Persistent agentic memory states	Dropping session state entirely

🔗 Module 2: Reasoning

Subtest	What It Measures	Target Failure Mode
Causal	True causation vs. spurious correlation	Simple pattern-matching LLM tendencies
Counterfactual	"What if..." variable manipulations	Failing to alter dependent nodes
Multi-Hop	Chaining 3 to 10+ logical leaps	Hallucinated gap fills, breaking chains
Consistency	Maintaining logic across diverse reframes	Contradictory statements

🪞 Module 3: Metacognition

Subtest	What It Measures	Target Failure Mode
Calibration	Expected Calibration Error (ECE) mapping	Being 90% confident but 60% accurate
Abstention	Capacity to say "I don't know"	Sycophancy / Hallucinating to please prompt
Self-Correction	Refactoring answers without human clues	Doubling down on errors or overcorrecting
Boundary	Awareness of knowledge cutoffs	Making up facts on highly obscure topics

📐 The AGI-Score Formula

The benchmark derives a weighted, normalized index score (0-100) balancing all metrics:

$$ \text{AGI-Score} = \left( \overline{\text{Memory}} \times 0.33 + \overline{\text{Reasoning}} \times 0.33 + \overline{\text{Metacog}} \times 0.34 \right) \times 100 $$

View Sub-Score Calculations

Memory = (Retention + Interference + Retrieval + CrossSession) / 4
Reasoning = (Causal + Counterfactual + MultiHop + Consistency) / 4
Metacognition = (Calibration + Abstention + SelfCorrection + Boundary) / 4

🤝 How to Add Your System (Leaderboard Submission)

To test a new proprietary or local model, you just implement the unified BaseAdapter class. That's it. It takes less than 30 lines of code.

# adapters/my_model_adapter.py
from adapters.base_adapter import BaseAdapter

class MyModelAdapter(BaseAdapter):
    def query(self, prompt: str, system: str = "") -> str:
        # 1. Send prompt to your model
        # 2. Return response string
        return my_api.generate(prompt)

    def query_with_confidence(self, prompt: str) -> dict:
        # Return {"response": str, "confidence": float 0.0-1.0}
        return {"response": text, "confidence": 0.8}

    def reset_session(self):
        # Clear context memory
        self.history = []

Then run python run_benchmark.py --model mymodel. Want your score public? Benchmark your system and open a PR with the result JSON file in results/ to be automatically added to the community leaderboard.

📜 Citation

If you use this benchmark in research or publications:

@software{agibench2026,
  author    = {Parthiv Pandya},
  title     = {AGI-Bench: A Unified Benchmark for Evaluating Memory, Reasoning, and Metacognition in AI Systems},
  year      = {2026},
  url       = {https://github.com/ParthivPandya/agi-bench},
  version   = {1.0.0}
}

"You can't improve what you can't measure. AGI-Bench is the measure."

Built by Parthiv Pandya
Licensed under MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
adapters		adapters
assets		assets
benchmarks		benchmarks
dashboard		dashboard
results		results
scoring		scoring
test_cases		test_cases
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
push_error.txt		push_error.txt
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 AGI-Bench

🛑 The Problem

🎯 What AGI-Bench Does

✨ Key Features

🚀 Quickstart

1. Installation

2. Configure Your API Keys

3. Run the Benchmark!

4. Open the Interactive Dashboard

🧩 Benchmark Modules Details

🗃️ Module 1: Memory

🔗 Module 2: Reasoning

🪞 Module 3: Metacognition

📐 The AGI-Score Formula

🤝 How to Add Your System (Leaderboard Submission)

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 AGI-Bench

🛑 The Problem

🎯 What AGI-Bench Does

✨ Key Features

🚀 Quickstart

1. Installation

2. Configure Your API Keys

3. Run the Benchmark!

4. Open the Interactive Dashboard

🧩 Benchmark Modules Details

🗃️ Module 1: Memory

🔗 Module 2: Reasoning

🪞 Module 3: Metacognition

📐 The AGI-Score Formula

🤝 How to Add Your System (Leaderboard Submission)

📜 Citation

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages