Skip to content

Add automated perturbation benchmark for any paper#41

Open
r-uben wants to merge 3 commits intoChicagoHAI:mainfrom
r-uben:feat/auto-perturbation
Open

Add automated perturbation benchmark for any paper#41
r-uben wants to merge 3 commits intoChicagoHAI:mainfrom
r-uben:feat/auto-perturbation

Conversation

@r-uben
Copy link
Contributor

@r-uben r-uben commented Mar 13, 2026

Summary

Generalizes the seeded-perturbation benchmark (from PR #34) into a reusable module that works on any paper. Addresses @chenhaot's comment asking to "add more papers and make this seeded error part more general."

  • Two-stage pipeline: (1) deterministic regex extraction identifies candidate spans (equations, numeric values, definitions, assumptions, claims, cross-references, conditions), then LLM picks minimal edits; (2) free-form LLM proposals validated by fuzzy match against the paper text
  • Six error categories: numeric_parameter, operator_or_sign, symbol_binding, index_or_subscript, condition_or_assumption, claim_strengthening
  • Validation layer: overlap detection, truncation checks (unbalanced delimiters), boundary garbage prevention
  • Scoring: fuzzy + LLM-based matching of reviewer comments against injected errors
  • Client fixes: OPENAI_BASE_URL for custom endpoints (EU, Azure); max_completion_tokens for GPT-5+/o-series models

CLI Usage

# Generate perturbations for any paper
openaireview perturb paper.pdf --model gpt-4.1 --output-dir ./results

# Review the corrupted paper
openaireview review results/paper_corrupted.md --method zero_shot

# Score the review against ground truth
openaireview score results/paper_perturbations.json results/paper_review.json

Initial Results (Nakamura & Steinsson 2018, QJE)

20 perturbations injected across 6 categories:

Review Method Model Comments Recall FP Rate
zero_shot gpt-4.1 9 0% 100%
zero_shot Claude Sonnet 4 4 0% 100%
consistency_check gpt-4.1 11 25–30% 45–55%

The zero_shot prompt produces generic peer-review comments (ambiguity, exposition) but misses every injected error. A targeted consistency-checking prompt catches numeric, sign, and direction errors — demonstrating that the benchmark works and the current method has a blind spot for verifiable internal inconsistencies.

Test plan

  • Pipeline runs end-to-end on Nakamura & Steinsson (2018)
  • Scoring correctly matches reviewer comments to perturbations (fuzzy + LLM)
  • Validates against truncated spans, overlapping edits, boundary collisions
  • Client works with OpenAI EU endpoint (OPENAI_BASE_URL)
  • Client works with GPT-5+ models (max_completion_tokens)
  • Test on additional papers (neuroscience paper in progress)
  • Add unit tests for extract/validate/inject

r-uben added 3 commits March 13, 2026 12:36
Injects 12 known errors into a clean paper (Nakamura & Steinsson 2018)
and measures reviewer recall against ground truth. Error categories:
sign flips, parameter errors, definition inconsistencies, subscript
swaps, and overstated claims.

Zero-shot result: 92% recall (11/12 detected), only missing an
overstated claim. All math and parameter errors caught with substantive
explanations.

Includes:
- benchmarks/seed_errors.py: injection + scoring framework
- benchmarks/REPORT.md: updated with seeded benchmark methodology and results
Progressive catches claim overstatement that zero-shot missed but loses
two parameter errors during consolidation. Documents the complementary
strengths of both methods and motivates adversarial adjudication (ChicagoHAI#35, ChicagoHAI#36).
Two-stage pipeline for seeded-error evaluation of review methods:
- Stage 1: deterministic regex extraction + LLM selection from candidates
- Stage 2: free-form LLM proposals validated by fuzzy match
- Validation: overlap detection, truncation checks, boundary checks
- Scoring: fuzzy + LLM-based matching of reviewer comments to injected errors

Six error categories: numeric_parameter, operator_or_sign, symbol_binding,
index_or_subscript, condition_or_assumption, claim_strengthening.

Also fixes client.py for OpenAI GPT-5/o-series (max_completion_tokens)
and adds OPENAI_BASE_URL support for custom endpoints (EU, Azure).

CLI: `openaireview perturb <paper>` and `openaireview score <manifest> <review>`
@r-uben r-uben force-pushed the feat/auto-perturbation branch from ac64e39 to 3b0b334 Compare March 13, 2026 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant