Skip to content

blue-eclectus/EvidenceSynthBench

Repository files navigation

EvidenceSynthBench

A benchmark for evidence synthesis reasoning

Note

Under active development. The three current task categories are stable and ready for use. Additional categories are planned.

Overview

This benchmark evaluates AI agents on the expert reasoning tasks that underpin scientific evidence synthesis, focusing on systematic reviews. The current release includes 594 tasks across three evaluation categories: assessing risk of bias, determining study inclusion/exclusion, and judging inferential distance from studies to a hypothesis. For these categories, ground truth is drawn from 24 open-access Cochrane systematic reviews spanning 16 clinical areas.

Existing AI-for-science benchmarks targeting scientific literature tend to focus on information retrieval, comprehension, and extraction. EvidenceSynthBench isolates the evaluative reasoning needed to reliably synthesize complex bodies of evidence.

Future categories under development include GRADE assessment of bodies of evidence, heterogeneity assessment for meta-analysis, and external validity reasoning.


Task Types

Task Input Judgement Tasks
Risk of Bias Fulltext paper + bias domain Low / High / Unclear 317
Eligibility Review objectives + fulltext paper Include / Exclude 122
Directness Hypothesis + 3 study abstracts Most → least direct 155

Risk of Bias (RoB)

Given the full text of an open-access clinical study and a specific bias domain (e.g., "Allocation concealment"), the agent must produce a risk-of-bias judgement (Low / High / Unclear or Some concerns). Ground truth is drawn from Cochrane RoB assessments using the RoB 1 tool or RoB 2 tool. Scored by judgement accuracy and weighted kappa.

Eligibility

Given a review's title and objectives, plus the full text of an open-access clinical study, the agent must decide whether the study should be included in or excluded from the review. The agent receives the review objectives but not the detailed inclusion/exclusion criteria listed in the review. This tests whether it can infer appropriate criteria rather than mechanically applying a given checklist. Scored by decision accuracy (Include/Exclude).

Directness

Given a hypothesis drawn from a Cochrane review and the abstracts of three thematically related studies, the agent must rank them from most to least direct — that is, by the inferential distance needed to connect each study's results to the hypothesis. Each task contains one direct study (included in the Cochrane review), one indirect study (excluded but topically related), and one distal study (more distant from the hypothesis but chosen to be superficially similar to it). Scored by exact-match and pairwise ranking accuracy.


Ground-Truth Validation

Every task instance has been audited for validity. Tasks where the ground truth is unfair given the information available to the agent — due to reviewer reliance on unpublished data, criteria not stated in the review objectives, or obvious cases of reviewer oversight — were flagged and excluded.


Getting Started

Installation

Requires Python 3.11+.

git clone https://github.com/blue-eclectus/EvidenceSynthBench.git
cd EvidenceSynthBench
pip install -r requirements.txt

Set API keys in a .env file (only the keys for providers you plan to use):

ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...
OPENROUTER_API_KEY=sk-or-...
Downloading Papers

Task instances reference papers by PMID. Abstracts and fulltext PDFs are not included in the repository and must be downloaded before running evaluations.

Step 1: Download abstracts (required for all tasks)

python3 setup_papers.py download --abstracts-only

This fetches abstracts from PubMed for all 450 studies and backfills them into directness task files. Takes a few minutes.

Step 2: Download open-access PDFs (required for RoB and eligibility tasks)

python3 setup_papers.py download

This fetches abstracts (if not already present) and downloads open-access PDFs via OpenAlex and PubMed Central. Some PDFs will not be available due to publisher restrictions on automated downloads — the script writes data/pdfs/missing.json listing any that could not be retrieved. You can manually download or import data/studies.ris into a citation manager (e.g. Zotero, EndNote) to try to bulk-download missing PDFs, then place them as data/pdfs/{pmid}.pdf.

Step 3: Parse PDFs (required for RoB and eligibility tasks)

GROBID is the included pdf parsing option, but you can use any PDF-to-text solution.

GROBID parsing option

Start a GROBID server:

docker run -d --name grobid -p 8070:8070 lfoppiano/grobid:0.8.0

Then parse the downloaded PDFs:

python3 setup_papers.py parse

This extracts structured fulltext from each PDF and stores it in the paper store. You can re-run both download and parse safely — they skip already-completed work.

Alternative PDF parsers

The evaluation harness reads fulltext from JSON files in data/paper_store/{pmid}.json. To use your own parser, populate these files with the following structure:

{
  "pmid": "12345678",
  "title": "Paper title",
  "doi": "10.1234/example",
  "abstract": "Abstract text...",
  "fulltext": {
    "sections": {
      "Abstract": "...",
      "1 Introduction": "...",
      "2 Methods": "..."
    },
    "source": "your-parser-name"
  }
}

The fulltext.sections dictionary should map section headings to their text content. After populating the json files, you can proceed directly to running evaluations.

Running Evaluations
# Evaluate on eligibility (public subset)
python3 evaluate.py --task-type eligibility \
  --model claude-haiku-4-5-20251001 \
  --subset eligibility

# Evaluate on RoB with extended thinking
python3 evaluate.py --task-type rob \
  --model claude-sonnet-4-5-20250929 \
  --subset rob \
  --extended-thinking

# Evaluate on directness
python3 evaluate.py --task-type directness \
  --model gpt-5-mini-2025-08-07 \
  --subset directness

# Quick test with a small sample
python3 evaluate.py --task-type eligibility \
  --model claude-haiku-4-5-20251001 \
  --subset eligibility \
  --limit 20

# Score an existing run without re-running
python3 evaluate.py --score-only RUN_ID

Use --seed N for reproducible task ordering across runs. Use --concurrency N (default 4) to control parallel API calls.

Custom Agents

You can evaluate your own agent by implementing a Python file with an Agent class:

class Agent:
    def assess(self, prompt: str) -> str:
        """Return the model's response as a string or dict. Can be sync or async."""
        # Your implementation here
        return response_text

Then run:

python3 evaluate.py --task-type eligibility \
  --agent path/to/my_agent.py \
  --subset eligibility
Supported Models

The evaluation harness includes built-in adapters for:

  • Anthropic: claude-* (supports extended thinking)
  • OpenAI: gpt-*, o1-*, o3-* (supports reasoning models)
  • Google: gemini-* (supports thinking config for 2.5 models)
  • OpenRouter: openrouter:provider/model

Data Format

RoB Task Instance
{
  "task_id": "CD006689_akinyotu-2018_allocation-concealment",
  "review_id": "CD006689",
  "study_name": "Akinyotu 2018",
  "pmid": "29719927",
  "doi": "10.1002/ijgo.12516",
  "rob_tool": "RoB1",
  "clinical_area": "infectious disease",
  "domain_name": "Allocation concealment (selection bias)",
  "outcome": null,
  "valid_judgements": ["Low risk", "Unclear risk", "High risk"],
  "ground_truth": {
    "judgement": "Low risk",
    "support": "The investigators were masked to allocation because the drugs were pre-packaged on..."
  }
}
Eligibility Task Instance
{
  "task_id": "CD007491_screening_caplan-1999",
  "task_type": "eligibility",
  "review_id": "CD007491",
  "review_title": "Admission avoidance hospital at home",
  "review_objectives": "To determine the effectiveness and cost of managing patients with...",
  "synthesis_type": "meta-analysis",
  "study_name": "Caplan 1999",
  "pmid": "16127109",
  "ground_truth": {
    "decision": "Include",
    "reason": null,
    "reason_category": null
  }
}
Directness Task Instance
{
  "task_id": "CD006689_directness_v2_001",
  "task_type": "directness",
  "review_id": "CD006689",
  "hypothesis": "Intermittent preventive treatment is effective for malaria prevention in...",
  "studies": [
    {"label": "A", "pmid": "12345678"},
    {"label": "B", "pmid": "23456789"},
    {"label": "C", "pmid": "34567890"}
  ],
  "ground_truth": {
    "ranking": ["A", "C", "B"],
    "tiers": {"A": "direct", "B": "distal", "C": "indirect"}
  }
}

Study abstracts are backfilled into directness task files by setup_papers.py download. Fulltext papers for RoB and eligibility tasks are stored separately in data/paper_store/{pmid}.json and joined at evaluation time.

A bibliographic file (data/studies.ris) is included with RIS entries for all studies referenced in the benchmark.


Citation

Technical report forthcoming. For now, please cite the repository:

@misc{evidencesynthbench2026,
  title={EvidenceSynthBench: A Benchmark for Evidence Synthesis Reasoning},
  author={Begun, Michael W.},
  year={2026},
  url={https://github.com/blue-eclectus/EvidenceSynthBench}
}

Acknowledgements

EvidenceSynthBench is built using published, open-access reviews from Cochrane. Abstracts are sourced from PubMed. Open-access primary studies are sourced via Unpaywall / OpenAlex and parsed using GROBID.

About

A benchmark for evidence synthesis reasoning

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages