ComputeBench

ComputeBench is a benchmark for long, step-by-step arithmetic. It generates datasets, runs models via OpenRouter, and scores how well outputs follow the required step format.

Metrics

Exact match rate: output matches expected exactly from Step 1 through Answer.
Answer accuracy: final Answer: N matches, even if steps differ.
Format OK rate: every line from Step 1 onward is a valid Step or Answer line.
Avg prefix match: fraction of expected steps matching in order until first mismatch.

Quick start

Build the site from existing runs:

bash scripts/build_site.sh

Run a full bench (generate data + call OpenRouter + summarize):

OPENROUTER_API_KEY=... bash scripts/run_bench.sh --model openai/gpt-4o-mini

Score runs manually (graphs require matplotlib):

python3 scripts/score_runs.py runs/run-*.jsonl --out-csv build/summary.csv --plot build/summary

Diagnose failures in a run:

python3 scripts/compare_runs.py --data runs/data-<name>.jsonl runs/run-<name>.jsonl

Dependencies

Rust toolchain (cargo) for dataset generation and the OpenRouter runner.
Python 3.10+ for the scoring/build scripts (CI uses 3.11).
Optional plots: python3 -m pip install -r requirements.txt (matplotlib).
OpenRouter API key (OPENROUTER_API_KEY) to run models.

Repository layout

prompts/: system prompt template.
runs/: generated datasets and model outputs (JSONL).
scripts/: scoring, site building, and diagnostics.
site/index.template.html: HTML template for the report page.
build/: generated site output (gitignored).

GitHub Pages

The workflow builds the site in build/ and deploys it to GitHub Pages, installing the Python plotting dependency so graphs render.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
prompts		prompts
runs		runs
scripts		scripts
site		site
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ComputeBench

Metrics

Quick start

Dependencies

Repository layout

GitHub Pages

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ComputeBench

Metrics

Quick start

Dependencies

Repository layout

GitHub Pages

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages