Add automated perturbation benchmark for any paper by r-uben · Pull Request #41 · ChicagoHAI/OpenAIReview

r-uben · 2026-03-13T10:49:02Z

Summary

Generalizes the seeded-perturbation benchmark (from PR #34) into a reusable module that works on any paper. Addresses @chenhaot's comment asking to "add more papers and make this seeded error part more general."

Two-stage pipeline: (1) deterministic regex extraction identifies candidate spans (equations, numeric values, definitions, assumptions, claims, cross-references, conditions), then LLM picks minimal edits; (2) free-form LLM proposals validated by fuzzy match against the paper text
Six error categories: numeric_parameter, operator_or_sign, symbol_binding, index_or_subscript, condition_or_assumption, claim_strengthening
Validation layer: overlap detection, truncation checks (unbalanced delimiters), boundary garbage prevention
Scoring: fuzzy + LLM-based matching of reviewer comments against injected errors
Client fixes: OPENAI_BASE_URL for custom endpoints (EU, Azure); max_completion_tokens for GPT-5+/o-series models

CLI Usage

# Generate perturbations for any paper
openaireview perturb paper.pdf --model gpt-4.1 --output-dir ./results

# Review the corrupted paper
openaireview review results/paper_corrupted.md --method zero_shot

# Score the review against ground truth
openaireview score results/paper_perturbations.json results/paper_review.json

Initial Results (Nakamura & Steinsson 2018, QJE)

20 perturbations injected across 6 categories:

Review Method	Model	Comments	Recall	FP Rate
`zero_shot`	gpt-4.1	9	0%	100%
`zero_shot`	Claude Sonnet 4	4	0%	100%
`consistency_check`	gpt-4.1	11	25–30%	45–55%

The zero_shot prompt produces generic peer-review comments (ambiguity, exposition) but misses every injected error. A targeted consistency-checking prompt catches numeric, sign, and direction errors — demonstrating that the benchmark works and the current method has a blind spot for verifiable internal inconsistencies.

Test plan

Pipeline runs end-to-end on Nakamura & Steinsson (2018)
Scoring correctly matches reviewer comments to perturbations (fuzzy + LLM)
Validates against truncated spans, overlapping edits, boundary collisions
Client works with OpenAI EU endpoint (OPENAI_BASE_URL)
Client works with GPT-5+ models (max_completion_tokens)
Test on additional papers (neuroscience paper in progress)
Add unit tests for extract/validate/inject

Injects 12 known errors into a clean paper (Nakamura & Steinsson 2018) and measures reviewer recall against ground truth. Error categories: sign flips, parameter errors, definition inconsistencies, subscript swaps, and overstated claims. Zero-shot result: 92% recall (11/12 detected), only missing an overstated claim. All math and parameter errors caught with substantive explanations. Includes: - benchmarks/seed_errors.py: injection + scoring framework - benchmarks/REPORT.md: updated with seeded benchmark methodology and results

Progressive catches claim overstatement that zero-shot missed but loses two parameter errors during consolidation. Documents the complementary strengths of both methods and motivates adversarial adjudication (ChicagoHAI#35, ChicagoHAI#36).

Two-stage pipeline for seeded-error evaluation of review methods: - Stage 1: deterministic regex extraction + LLM selection from candidates - Stage 2: free-form LLM proposals validated by fuzzy match - Validation: overlap detection, truncation checks, boundary checks - Scoring: fuzzy + LLM-based matching of reviewer comments to injected errors Six error categories: numeric_parameter, operator_or_sign, symbol_binding, index_or_subscript, condition_or_assumption, claim_strengthening. Also fixes client.py for OpenAI GPT-5/o-series (max_completion_tokens) and adds OPENAI_BASE_URL support for custom endpoints (EU, Azure). CLI: `openaireview perturb <paper>` and `openaireview score <manifest> <review>`

r-uben mentioned this pull request Mar 13, 2026

Add seeded-perturbation benchmark for ground-truth evaluation #34

Closed

3 tasks

r-uben added 3 commits March 13, 2026 12:36

r-uben force-pushed the feat/auto-perturbation branch from ac64e39 to 3b0b334 Compare March 13, 2026 11:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add automated perturbation benchmark for any paper#41

Add automated perturbation benchmark for any paper#41
r-uben wants to merge 3 commits intoChicagoHAI:mainfrom
r-uben:feat/auto-perturbation

r-uben commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

r-uben commented Mar 13, 2026

Summary

CLI Usage

Initial Results (Nakamura & Steinsson 2018, QJE)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant