VibeBench is an automated, extensible Python framework for the holistic evaluation of LLM-generated code. It goes beyond functional correctness by integrating static quality heuristics with sandboxed dynamic execution to measure the true production-readiness of AI-generated software.
Existing benchmarks like HumanEval and MBPP only check if code runs correctly. VibeBench additionally checks if code is maintainable, secure, and efficient — the metrics that matter in real-world software engineering.
| Metric | HumanEval | MBPP | VibeBench |
|---|---|---|---|
| Functional correctness | ✅ | ✅ | ✅ |
| Halstead complexity | ❌ | ❌ | ✅ |
| Cyclomatic complexity | ❌ | ❌ | ✅ |
| Docstring coverage | ❌ | ❌ | ✅ |
| Hardcoded credential detection | ❌ | ❌ | ✅ |
| Ghost comment detection | ❌ | ❌ | ✅ |
| Sandboxed execution with resource limits | ❌ | ❌ | ✅ |
| Operational parity vs human baseline | ❌ | ❌ | ✅ |
Requirements: Python 3.8+, Unix-based OS (Linux/macOS) for sandboxed execution.
# Clone the repository
git clone https://github.com/umayer16/VIBEBENCH.git
cd VIBEBENCH
# Install dependencies
pip install -r requirements.txtfrom core.analyzer import CodeAnalyzer
code = """
def add(a, b):
return a + b
"""
analyzer = CodeAnalyzer(code)
print(analyzer.calculate_halstead_metrics())
# {'vocabulary': 4, 'volume': 8.0}
print(analyzer.get_docstring_coverage())
# 0.0
print(analyzer.detect_bad_practices())
# []python vibebench.pyResults are saved as a timestamped JSON file (e.g. vibebench_multimodel_20260224_1912.json)
and a leaderboard is generated at VibeBench_Leaderboard.md.
VibeBench produces a JSON results file with the following structure per model:
{
"model": "gpt-4o",
"task": "fibonacci",
"halstead_volume": 42.5,
"cyclomatic_complexity": 3,
"docstring_coverage": 100.0,
"bad_practices": [],
"execution_success": true,
"execution_time_ms": 12.4,
"operational_parity": 0.95
}Current benchmark results across evaluated models:
See VibeBench_Leaderboard.md for full results.
VIBEBENCH/
├── core/
│ ├── analyzer.py # Static analysis engine (AST-based)
│ ├── executor.py # Sandboxed dynamic execution
│ └── reporter.py # Leaderboard and visualization
├── datasets/ # Benchmark task definitions
├── figures/ # Architecture and leaderboard figures
├── tests/ # pytest test suite
├── vibebench.py # Main entry point
├── paper.md # JOSS paper
└── requirements.txt
pip install pytest
pytest tests/To reproduce the findings from our v1.2.0 release:
- Ensure your API keys are set in a
.envfile (see.env.example). - Run the full suite:
python vibebench.py benchmark --tasks datasets/prompts.json --verbose
If you use VibeBench in your research, please cite:
@software{arif2026vibebench,
author = {Arif, Muktadir},
title = {VibeBench: An Automated Framework for the Holistic Evaluation of LLM-Generated Code},
year = {2026},
doi = {10.5281/zenodo.18758578},
url = {https://github.com/umayer16/VIBEBENCH}
}Contributions are welcome! Please read CONTRIBUTING.md before opening a pull request.
This project is licensed under the MIT License — see LICENSE for details.