Skip to content

ParthivPandya/agi-bench

Repository files navigation

🧠 AGI-Bench

The first unified, open benchmark for evaluating AGI components — memory, reasoning, and metacognition — across any AI model.

License: MIT Python 3.10+ arXiv Leaderboard PRs Welcome


AGI-Bench Dashboard Overview

AGI-Bench Dashboard Charts

AGI-Bench Dashboard Leaderboard




🛑 The Problem

The AGI research field is flying blind.

  • Team A claims their memory system is best → tested on their own benchmark
  • Team B claims their reasoning is best → tested on a different benchmark
  • Team C claims metacognition works → no standard test exists at all

Nobody can compare systems. Nobody knows what's actually better. Every paper defines "memory" and "reasoning" differently. The field has no shared language, no shared ruler.

AGI-Bench is that standard ruler.


🎯 What AGI-Bench Does

One unified evaluation framework that tests any AI model or agent system on the three core components believed to be the final bottlenecks before AGI:

┌──────────────────────────────────────────────────────────────┐
│                        AGI-BENCH                             │
│                                                              │
│   Module 1: MEMORY          Module 2: REASONING              │
│   ├── Retention             ├── Causal                       │
│   ├── Interference          ├── Counterfactual               │
│   ├── Retrieval Precision   ├── Multi-hop                    │
│   └── Cross-session         └── Consistency                  │
│                                                              │
│   Module 3: METACOGNITION                                    │
│   ├── Calibration                                            │
│   ├── Abstention                                             │
│   ├── Self-correction                                        │
│   └── Boundary Detection                                     │
│                                                              │
│   Output → AGI-Score (0–100) + Live Multi-Model Leaderboard  │
└──────────────────────────────────────────────────────────────┘

✨ Key Features

  1. Standardized Test Cases — 300+ carefully curated challenges with rigorous scoring
  2. Model-Agnostic Setup — Plug in Claude, GPT-4, Gemini, Groq (Llama, Mixtral), or local models via Ollama
  3. Reproducible Validation — Same test, same exact programmatic scoring, every time
  4. Beautiful Visual Dashboard — Generates an interactive HTML dashboard mapping radar charts and ECE Calibration curves

🚀 Quickstart

Run a full AGI diagnostic on an AI model in under 5 minutes.

1. Installation

git clone https://github.com/ParthivPandya/agi-bench
cd agi-bench
python -m venv venv

# Windows
.\venv\Scripts\Activate.ps1
# Mac/Linux
source venv/bin/activate

pip install -r requirements.txt

2. Configure Your API Keys

The project uses a standard .env file for configuration. Copy the example file and add your keys:

cp .env.example .env

Inside your .env file, configure the providers you wish to use:

# Anthropic (Claude Models)
ANTHROPIC_API_KEY="sk-ant-..."

# OpenAI (GPT-4, etc.)
OPENAI_API_KEY="sk-..."

# Google (Gemini Models)
GOOGLE_API_KEY="AIza..."

# Groq (Llama3, Mixtral, etc.)
GROQ_API_KEY="gsk_..."

(Note: Ollama running locally requires no API key!)

3. Run the Benchmark!

# Test GPT-4o natively
python run_benchmark.py --model openai --model-name gpt-4o

# Test Claude 3.5 Sonnet
python run_benchmark.py --model claude --output results/claude_sonnet.json

# Test Groq (Ultra-fast Inference)
python run_benchmark.py --model groq --model-name llama-3.3-70b-versatile

# Test a Local Model via Ollama (No API Key Required!)
python run_benchmark.py --model ollama --model-name llama3

You will see a live evaluation stream in your terminal:

============================================================
  ✅ Retention     Score: 84.1% 
  ✅ Interference  Score: 71.3% 
  ✅ Retrieval     Score: 79.4%
  ...
  📊 AGI-SCORE: 78.2 / 100
============================================================

4. Open the Interactive Dashboard

The runner output emits a JSON artifact into the results/ folder. Simply double-click dashboard/index.html (or drag it into your browser) to interact precisely with module breakdowns, capability radars, and calibration charts.


🧩 Benchmark Modules Details

🗃️ Module 1: Memory

Subtest What It Measures Target Failure Mode
Retention Recalling facts far back in a huge context Forgetting facts pushed past 10k tokens
Interference Resistance to catastrophic forgetting Newer memories overwriting old ones
Retrieval Extracting exact fragments amid noise Hallucinated mix-ups on similar facts
Cross-Session Persistent agentic memory states Dropping session state entirely

🔗 Module 2: Reasoning

Subtest What It Measures Target Failure Mode
Causal True causation vs. spurious correlation Simple pattern-matching LLM tendencies
Counterfactual "What if..." variable manipulations Failing to alter dependent nodes
Multi-Hop Chaining 3 to 10+ logical leaps Hallucinated gap fills, breaking chains
Consistency Maintaining logic across diverse reframes Contradictory statements

🪞 Module 3: Metacognition

Subtest What It Measures Target Failure Mode
Calibration Expected Calibration Error (ECE) mapping Being 90% confident but 60% accurate
Abstention Capacity to say "I don't know" Sycophancy / Hallucinating to please prompt
Self-Correction Refactoring answers without human clues Doubling down on errors or overcorrecting
Boundary Awareness of knowledge cutoffs Making up facts on highly obscure topics

📐 The AGI-Score Formula

The benchmark derives a weighted, normalized index score (0-100) balancing all metrics:

$$ \text{AGI-Score} = \left( \overline{\text{Memory}} \times 0.33 + \overline{\text{Reasoning}} \times 0.33 + \overline{\text{Metacog}} \times 0.34 \right) \times 100 $$


View Sub-Score Calculations
Memory = (Retention + Interference + Retrieval + CrossSession) / 4
Reasoning = (Causal + Counterfactual + MultiHop + Consistency) / 4
Metacognition = (Calibration + Abstention + SelfCorrection + Boundary) / 4

🤝 How to Add Your System (Leaderboard Submission)

To test a new proprietary or local model, you just implement the unified BaseAdapter class. That's it. It takes less than 30 lines of code.

# adapters/my_model_adapter.py
from adapters.base_adapter import BaseAdapter

class MyModelAdapter(BaseAdapter):
    def query(self, prompt: str, system: str = "") -> str:
        # 1. Send prompt to your model
        # 2. Return response string
        return my_api.generate(prompt)

    def query_with_confidence(self, prompt: str) -> dict:
        # Return {"response": str, "confidence": float 0.0-1.0}
        return {"response": text, "confidence": 0.8}

    def reset_session(self):
        # Clear context memory
        self.history = []

Then run python run_benchmark.py --model mymodel. Want your score public? Benchmark your system and open a PR with the result JSON file in results/ to be automatically added to the community leaderboard.


📜 Citation

If you use this benchmark in research or publications:

@software{agibench2026,
  author    = {Parthiv Pandya},
  title     = {AGI-Bench: A Unified Benchmark for Evaluating Memory, Reasoning, and Metacognition in AI Systems},
  year      = {2026},
  url       = {https://github.com/ParthivPandya/agi-bench},
  version   = {1.0.0}
}

"You can't improve what you can't measure. AGI-Bench is the measure."

Built by Parthiv Pandya
Licensed under MIT

About

The first unified, open benchmark for evaluating AGI components — memory, reasoning, and metacognition — across any AI model.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors