ComputeBench is a benchmark for long, step-by-step arithmetic. It generates datasets, runs models via OpenRouter, and scores how well outputs follow the required step format.
- Exact match rate: output matches expected exactly from Step 1 through Answer.
- Answer accuracy: final
Answer: Nmatches, even if steps differ. - Format OK rate: every line from Step 1 onward is a valid Step or Answer line.
- Avg prefix match: fraction of expected steps matching in order until first mismatch.
Build the site from existing runs:
bash scripts/build_site.shRun a full bench (generate data + call OpenRouter + summarize):
OPENROUTER_API_KEY=... bash scripts/run_bench.sh --model openai/gpt-4o-miniScore runs manually (graphs require matplotlib):
python3 scripts/score_runs.py runs/run-*.jsonl --out-csv build/summary.csv --plot build/summaryDiagnose failures in a run:
python3 scripts/compare_runs.py --data runs/data-<name>.jsonl runs/run-<name>.jsonl- Rust toolchain (cargo) for dataset generation and the OpenRouter runner.
- Python 3.10+ for the scoring/build scripts (CI uses 3.11).
- Optional plots:
python3 -m pip install -r requirements.txt(matplotlib). - OpenRouter API key (
OPENROUTER_API_KEY) to run models.
prompts/: system prompt template.runs/: generated datasets and model outputs (JSONL).scripts/: scoring, site building, and diagnostics.site/index.template.html: HTML template for the report page.build/: generated site output (gitignored).
The workflow builds the site in build/ and deploys it to GitHub Pages, installing
the Python plotting dependency so graphs render.