EvidenceSynthBench

A benchmark for evidence synthesis reasoning

Note

Under active development. The three current task categories are stable and ready for use. Additional categories are planned.

Overview

This benchmark evaluates AI agents on the expert reasoning tasks that underpin scientific evidence synthesis, focusing on systematic reviews. The current release includes 594 tasks across three evaluation categories: assessing risk of bias, determining study inclusion/exclusion, and judging inferential distance from studies to a hypothesis. For these categories, ground truth is drawn from 24 open-access Cochrane systematic reviews spanning 16 clinical areas.

Existing AI-for-science benchmarks targeting scientific literature tend to focus on information retrieval, comprehension, and extraction. EvidenceSynthBench isolates the evaluative reasoning needed to reliably synthesize complex bodies of evidence.

Future categories under development include GRADE assessment of bodies of evidence, heterogeneity assessment for meta-analysis, and external validity reasoning.

Task Types

Task	Input	Judgement	Tasks
Risk of Bias	Fulltext paper + bias domain	Low / High / Unclear	317
Eligibility	Review objectives + fulltext paper	Include / Exclude	122
Directness	Hypothesis + 3 study abstracts	Most → least direct	155

Risk of Bias (RoB)

Given the full text of an open-access clinical study and a specific bias domain (e.g., "Allocation concealment"), the agent must produce a risk-of-bias judgement (Low / High / Unclear or Some concerns). Ground truth is drawn from Cochrane RoB assessments using the RoB 1 tool or RoB 2 tool. Scored by judgement accuracy and weighted kappa.

Eligibility

Given a review's title and objectives, plus the full text of an open-access clinical study, the agent must decide whether the study should be included in or excluded from the review. The agent receives the review objectives but not the detailed inclusion/exclusion criteria listed in the review. This tests whether it can infer appropriate criteria rather than mechanically applying a given checklist. Scored by decision accuracy (Include/Exclude).

Directness

Given a hypothesis drawn from a Cochrane review and the abstracts of three thematically related studies, the agent must rank them from most to least direct — that is, by the inferential distance needed to connect each study's results to the hypothesis. Each task contains one direct study (included in the Cochrane review), one indirect study (excluded but topically related), and one distal study (more distant from the hypothesis but chosen to be superficially similar to it). Scored by exact-match and pairwise ranking accuracy.

Ground-Truth Validation

Every task instance has been audited for validity. Tasks where the ground truth is unfair given the information available to the agent — due to reviewer reliance on unpublished data, criteria not stated in the review objectives, or obvious cases of reviewer oversight — were flagged and excluded.

Getting Started

Installation

Requires Python 3.11+.

git clone https://github.com/blue-eclectus/EvidenceSynthBench.git
cd EvidenceSynthBench
pip install -r requirements.txt

Set API keys in a .env file (only the keys for providers you plan to use):

ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...
OPENROUTER_API_KEY=sk-or-...

Downloading Papers

Task instances reference papers by PMID. Abstracts and fulltext PDFs are not included in the repository and must be downloaded before running evaluations.

Step 1: Download abstracts (required for all tasks)

python3 setup_papers.py download --abstracts-only

This fetches abstracts from PubMed for all 450 studies and backfills them into directness task files. Takes a few minutes.

Step 2: Download open-access PDFs (required for RoB and eligibility tasks)

python3 setup_papers.py download

This fetches abstracts (if not already present) and downloads open-access PDFs via OpenAlex and PubMed Central. Some PDFs will not be available due to publisher restrictions on automated downloads — the script writes data/pdfs/missing.json listing any that could not be retrieved. You can manually download or import data/studies.ris into a citation manager (e.g. Zotero, EndNote) to try to bulk-download missing PDFs, then place them as data/pdfs/{pmid}.pdf.

Step 3: Parse PDFs (required for RoB and eligibility tasks)

GROBID is the included pdf parsing option, but you can use any PDF-to-text solution.

GROBID parsing option

Start a GROBID server:

docker run -d --name grobid -p 8070:8070 lfoppiano/grobid:0.8.0

Then parse the downloaded PDFs:

python3 setup_papers.py parse

This extracts structured fulltext from each PDF and stores it in the paper store. You can re-run both download and parse safely — they skip already-completed work.

Alternative PDF parsers

The evaluation harness reads fulltext from JSON files in data/paper_store/{pmid}.json. To use your own parser, populate these files with the following structure:

{
  "pmid": "12345678",
  "title": "Paper title",
  "doi": "10.1234/example",
  "abstract": "Abstract text...",
  "fulltext": {
    "sections": {
      "Abstract": "...",
      "1 Introduction": "...",
      "2 Methods": "..."
    },
    "source": "your-parser-name"
  }
}

The fulltext.sections dictionary should map section headings to their text content. After populating the json files, you can proceed directly to running evaluations.

Running Evaluations

# Evaluate on eligibility (public subset)
python3 evaluate.py --task-type eligibility \
  --model claude-haiku-4-5-20251001 \
  --subset eligibility

# Evaluate on RoB with extended thinking
python3 evaluate.py --task-type rob \
  --model claude-sonnet-4-5-20250929 \
  --subset rob \
  --extended-thinking

# Evaluate on directness
python3 evaluate.py --task-type directness \
  --model gpt-5-mini-2025-08-07 \
  --subset directness

# Quick test with a small sample
python3 evaluate.py --task-type eligibility \
  --model claude-haiku-4-5-20251001 \
  --subset eligibility \
  --limit 20

# Score an existing run without re-running
python3 evaluate.py --score-only RUN_ID

Use --seed N for reproducible task ordering across runs. Use --concurrency N (default 4) to control parallel API calls.

Custom Agents

You can evaluate your own agent by implementing a Python file with an Agent class:

class Agent:
    def assess(self, prompt: str) -> str:
        """Return the model's response as a string or dict. Can be sync or async."""
        # Your implementation here
        return response_text

Then run:

python3 evaluate.py --task-type eligibility \
  --agent path/to/my_agent.py \
  --subset eligibility

Supported Models

The evaluation harness includes built-in adapters for:

Anthropic: claude-* (supports extended thinking)
OpenAI: gpt-*, o1-*, o3-* (supports reasoning models)
Google: gemini-* (supports thinking config for 2.5 models)
OpenRouter: openrouter:provider/model

Data Format

RoB Task Instance

{
  "task_id": "CD006689_akinyotu-2018_allocation-concealment",
  "review_id": "CD006689",
  "study_name": "Akinyotu 2018",
  "pmid": "29719927",
  "doi": "10.1002/ijgo.12516",
  "rob_tool": "RoB1",
  "clinical_area": "infectious disease",
  "domain_name": "Allocation concealment (selection bias)",
  "outcome": null,
  "valid_judgements": ["Low risk", "Unclear risk", "High risk"],
  "ground_truth": {
    "judgement": "Low risk",
    "support": "The investigators were masked to allocation because the drugs were pre-packaged on..."
  }
}

Eligibility Task Instance

{
  "task_id": "CD007491_screening_caplan-1999",
  "task_type": "eligibility",
  "review_id": "CD007491",
  "review_title": "Admission avoidance hospital at home",
  "review_objectives": "To determine the effectiveness and cost of managing patients with...",
  "synthesis_type": "meta-analysis",
  "study_name": "Caplan 1999",
  "pmid": "16127109",
  "ground_truth": {
    "decision": "Include",
    "reason": null,
    "reason_category": null
  }
}

Directness Task Instance

{
  "task_id": "CD006689_directness_v2_001",
  "task_type": "directness",
  "review_id": "CD006689",
  "hypothesis": "Intermittent preventive treatment is effective for malaria prevention in...",
  "studies": [
    {"label": "A", "pmid": "12345678"},
    {"label": "B", "pmid": "23456789"},
    {"label": "C", "pmid": "34567890"}
  ],
  "ground_truth": {
    "ranking": ["A", "C", "B"],
    "tiers": {"A": "direct", "B": "distal", "C": "indirect"}
  }
}

Study abstracts are backfilled into directness task files by setup_papers.py download. Fulltext papers for RoB and eligibility tasks are stored separately in data/paper_store/{pmid}.json and joined at evaluation time.

A bibliographic file (data/studies.ris) is included with RIS entries for all studies referenced in the benchmark.

Citation

Technical report forthcoming. For now, please cite the repository:

@misc{evidencesynthbench2026,
  title={EvidenceSynthBench: A Benchmark for Evidence Synthesis Reasoning},
  author={Begun, Michael W.},
  year={2026},
  url={https://github.com/blue-eclectus/EvidenceSynthBench}
}

Acknowledgements

EvidenceSynthBench is built using published, open-access reviews from Cochrane. Abstracts are sourced from PubMed. Open-access primary studies are sourced via Unpaywall / OpenAlex and parsed using GROBID.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
synthesisbench		synthesisbench
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup_papers.py		setup_papers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvidenceSynthBench

Overview

Task Types

Risk of Bias (RoB)

Eligibility

Directness

Ground-Truth Validation

Getting Started

Data Format

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvidenceSynthBench

Overview

Task Types

Risk of Bias (RoB)

Eligibility

Directness

Ground-Truth Validation

Getting Started

Data Format

Citation

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages