Add automated perturbation benchmark for any paper#41
Open
r-uben wants to merge 3 commits intoChicagoHAI:mainfrom
Open
Add automated perturbation benchmark for any paper#41r-uben wants to merge 3 commits intoChicagoHAI:mainfrom
r-uben wants to merge 3 commits intoChicagoHAI:mainfrom
Conversation
3 tasks
Injects 12 known errors into a clean paper (Nakamura & Steinsson 2018) and measures reviewer recall against ground truth. Error categories: sign flips, parameter errors, definition inconsistencies, subscript swaps, and overstated claims. Zero-shot result: 92% recall (11/12 detected), only missing an overstated claim. All math and parameter errors caught with substantive explanations. Includes: - benchmarks/seed_errors.py: injection + scoring framework - benchmarks/REPORT.md: updated with seeded benchmark methodology and results
Progressive catches claim overstatement that zero-shot missed but loses two parameter errors during consolidation. Documents the complementary strengths of both methods and motivates adversarial adjudication (ChicagoHAI#35, ChicagoHAI#36).
Two-stage pipeline for seeded-error evaluation of review methods: - Stage 1: deterministic regex extraction + LLM selection from candidates - Stage 2: free-form LLM proposals validated by fuzzy match - Validation: overlap detection, truncation checks, boundary checks - Scoring: fuzzy + LLM-based matching of reviewer comments to injected errors Six error categories: numeric_parameter, operator_or_sign, symbol_binding, index_or_subscript, condition_or_assumption, claim_strengthening. Also fixes client.py for OpenAI GPT-5/o-series (max_completion_tokens) and adds OPENAI_BASE_URL support for custom endpoints (EU, Azure). CLI: `openaireview perturb <paper>` and `openaireview score <manifest> <review>`
ac64e39 to
3b0b334
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Generalizes the seeded-perturbation benchmark (from PR #34) into a reusable module that works on any paper. Addresses @chenhaot's comment asking to "add more papers and make this seeded error part more general."
numeric_parameter,operator_or_sign,symbol_binding,index_or_subscript,condition_or_assumption,claim_strengtheningOPENAI_BASE_URLfor custom endpoints (EU, Azure);max_completion_tokensfor GPT-5+/o-series modelsCLI Usage
Initial Results (Nakamura & Steinsson 2018, QJE)
20 perturbations injected across 6 categories:
zero_shotzero_shotconsistency_checkThe
zero_shotprompt produces generic peer-review comments (ambiguity, exposition) but misses every injected error. A targeted consistency-checking prompt catches numeric, sign, and direction errors — demonstrating that the benchmark works and the current method has a blind spot for verifiable internal inconsistencies.Test plan
OPENAI_BASE_URL)max_completion_tokens)