Skip to content

dianfishekqi/computebench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ComputeBench

ComputeBench is a benchmark for long, step-by-step arithmetic. It generates datasets, runs models via OpenRouter, and scores how well outputs follow the required step format.

Metrics

  • Exact match rate: output matches expected exactly from Step 1 through Answer.
  • Answer accuracy: final Answer: N matches, even if steps differ.
  • Format OK rate: every line from Step 1 onward is a valid Step or Answer line.
  • Avg prefix match: fraction of expected steps matching in order until first mismatch.

Quick start

Build the site from existing runs:

bash scripts/build_site.sh

Run a full bench (generate data + call OpenRouter + summarize):

OPENROUTER_API_KEY=... bash scripts/run_bench.sh --model openai/gpt-4o-mini

Score runs manually (graphs require matplotlib):

python3 scripts/score_runs.py runs/run-*.jsonl --out-csv build/summary.csv --plot build/summary

Diagnose failures in a run:

python3 scripts/compare_runs.py --data runs/data-<name>.jsonl runs/run-<name>.jsonl

Dependencies

  • Rust toolchain (cargo) for dataset generation and the OpenRouter runner.
  • Python 3.10+ for the scoring/build scripts (CI uses 3.11).
  • Optional plots: python3 -m pip install -r requirements.txt (matplotlib).
  • OpenRouter API key (OPENROUTER_API_KEY) to run models.

Repository layout

  • prompts/: system prompt template.
  • runs/: generated datasets and model outputs (JSONL).
  • scripts/: scoring, site building, and diagnostics.
  • site/index.template.html: HTML template for the report page.
  • build/: generated site output (gitignored).

GitHub Pages

The workflow builds the site in build/ and deploys it to GitHub Pages, installing the Python plotting dependency so graphs render.

Releases

No releases published

Packages

 
 
 

Contributors