VibeBench

VibeBench is an automated, extensible Python framework for the holistic evaluation of LLM-generated code. It goes beyond functional correctness by integrating static quality heuristics with sandboxed dynamic execution to measure the true production-readiness of AI-generated software.

Why VibeBench?

Existing benchmarks like HumanEval and MBPP only check if code runs correctly. VibeBench additionally checks if code is maintainable, secure, and efficient — the metrics that matter in real-world software engineering.

Metric	HumanEval	MBPP	VibeBench
Functional correctness	✅	✅	✅
Halstead complexity	❌	❌	✅
Cyclomatic complexity	❌	❌	✅
Docstring coverage	❌	❌	✅
Hardcoded credential detection	❌	❌	✅
Ghost comment detection	❌	❌	✅
Sandboxed execution with resource limits	❌	❌	✅
Operational parity vs human baseline	❌	❌	✅

Installation

Requirements: Python 3.8+, Unix-based OS (Linux/macOS) for sandboxed execution.

# Clone the repository
git clone https://github.com/umayer16/VIBEBENCH.git
cd VIBEBENCH

# Install dependencies
pip install -r requirements.txt

Quick Start

Analyze a single code snippet

from core.analyzer import CodeAnalyzer

code = """
def add(a, b):
    return a + b
"""

analyzer = CodeAnalyzer(code)

print(analyzer.calculate_halstead_metrics())
# {'vocabulary': 4, 'volume': 8.0}

print(analyzer.get_docstring_coverage())
# 0.0

print(analyzer.detect_bad_practices())
# []

Run the full benchmark

python vibebench.py

Results are saved as a timestamped JSON file (e.g. vibebench_multimodel_20260224_1912.json) and a leaderboard is generated at VibeBench_Leaderboard.md.

Output Format

VibeBench produces a JSON results file with the following structure per model:

{
  "model": "gpt-4o",
  "task": "fibonacci",
  "halstead_volume": 42.5,
  "cyclomatic_complexity": 3,
  "docstring_coverage": 100.0,
  "bad_practices": [],
  "execution_success": true,
  "execution_time_ms": 12.4,
  "operational_parity": 0.95
}

Leaderboard

Current benchmark results across evaluated models:

See VibeBench_Leaderboard.md for full results.

Project Structure

VIBEBENCH/
├── core/
│   ├── analyzer.py      # Static analysis engine (AST-based)
│   ├── executor.py      # Sandboxed dynamic execution
│   └── reporter.py      # Leaderboard and visualization
├── datasets/            # Benchmark task definitions
├── figures/             # Architecture and leaderboard figures
├── tests/               # pytest test suite
├── vibebench.py         # Main entry point
├── paper.md             # JOSS paper
└── requirements.txt

Running Tests

pip install pytest
pytest tests/

Reproducing Benchmark Results

To reproduce the findings from our v1.2.0 release:

Ensure your API keys are set in a .env file (see .env.example).

Run the full suite:

python vibebench.py benchmark --tasks datasets/prompts.json --verbose

Citation

If you use VibeBench in your research, please cite:

@software{arif2026vibebench,
  author = {Arif, Muktadir},
  title  = {VibeBench: An Automated Framework for the Holistic Evaluation of LLM-Generated Code},
  year   = {2026},
  doi    = {10.5281/zenodo.18758578},
  url    = {https://github.com/umayer16/VIBEBENCH}
}

Contributing

Contributions are welcome! Please read CONTRIBUTING.md before opening a pull request.

License

This project is licensed under the MIT License — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.github/workflows		.github/workflows
core		core
datasets		datasets
figures		figures
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
VibeBench_Leaderboard.md		VibeBench_Leaderboard.md
data.csv		data.csv
paper.bib		paper.bib
paper.md		paper.md
requirements.txt		requirements.txt
vibebench.py		vibebench.py
vibebench_multimodel_20260309_1926.json		vibebench_multimodel_20260309_1926.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VibeBench

Why VibeBench?

Installation

Quick Start

Analyze a single code snippet

Run the full benchmark

Output Format

Leaderboard

Project Structure

Running Tests

Reproducing Benchmark Results

Citation

Contributing

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VibeBench

Why VibeBench?

Installation

Quick Start

Analyze a single code snippet

Run the full benchmark

Output Format

Leaderboard

Project Structure

Running Tests

Reproducing Benchmark Results

Citation

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages