Skip to content

Add ClinicalJargonDataset and ClinicalJargonVerification benchmark task#941

Open
John-Carson wants to merge 4 commits intosunlabuiuc:masterfrom
John-Carson:cs598-clinical-jargon
Open

Add ClinicalJargonDataset and ClinicalJargonVerification benchmark task#941
John-Carson wants to merge 4 commits intosunlabuiuc:masterfrom
John-Carson:cs598-clinical-jargon

Conversation

@John-Carson
Copy link
Copy Markdown

PyHealth PR Description

Summary

  • Adds ClinicalJargonDataset backed by public MedLingo and CASI benchmark assets.
  • Adds ClinicalJargonVerification, a binary candidate-verification task for public clinical jargon evaluation.
  • Adds docs, example usage, synthetic test resources, and unit tests.

Contributors

Contribution Type

  • Dataset + Task

Original Paper

Implementation Overview

  • ClinicalJargonDataset downloads and normalizes the public MedLingo and CASI assets into a PyHealth dataset.
  • ClinicalJargonVerification converts each benchmark item into paired-text binary classification samples over candidate expansions.
  • The example script demonstrates task configuration ablations through benchmark, casi_variant, and medlingo_distractors.
  • The tests use only synthetic/demo resources and validate dataset loading, patient parsing, task generation, and sample structure.

Files To Review

  • pyhealth/datasets/clinical_jargon.py
  • pyhealth/datasets/configs/clinical_jargon.yaml
  • pyhealth/tasks/clinical_jargon_verification.py
  • examples/clinical_jargon_clinical_jargon_verification_transformers.py
  • tests/core/test_clinical_jargon.py
  • docs/api/datasets/pyhealth.datasets.ClinicalJargonDataset.rst
  • docs/api/tasks/pyhealth.tasks.ClinicalJargonVerification.rst

Validation

  • python3 -m unittest discover -s 598-DLH/clinical_jargon_project/tests -p 'test_*.py'
  • PYTHONPATH=598-DLH/PyHealth python3 -m unittest 598-DLH/PyHealth/tests/core/test_clinical_jargon.py
  • python3 598-DLH/PyHealth/examples/clinical_jargon_clinical_jargon_verification_transformers.py --model-name hf-internal-testing/tiny-random-bert --benchmark medlingo --medlingo-distractors 1 --epochs 1 --batch-size 2

Copilot AI review requested due to automatic review settings April 4, 2026 19:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new public clinical jargon benchmark dataset and an associated binary verification task, plus supporting docs, example usage, and unit tests.

Changes:

  • Introduces ClinicalJargonDataset with normalized MedLingo + CASI metadata and a YAML dataset config.
  • Adds ClinicalJargonVerification task that generates paired-text binary samples over candidate expansions.
  • Adds a runnable Transformers example, Sphinx API docs, synthetic test resources, and unit tests.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pyhealth/datasets/clinical_jargon.py Implements dataset normalization and (currently automatic) remote asset fetching.
pyhealth/datasets/configs/clinical_jargon.yaml Declares the examples table schema for the dataset.
pyhealth/datasets/__init__.py Exposes ClinicalJargonDataset at package import level.
pyhealth/tasks/clinical_jargon_verification.py Implements the candidate-verification task and sample generation.
pyhealth/tasks/__init__.py Exposes ClinicalJargonVerification at package import level.
examples/clinical_jargon_clinical_jargon_verification_transformers.py Demonstrates training/evaluating a Transformers model on the task.
tests/core/test_clinical_jargon.py Adds unit tests covering dataset/task loading and sample structure.
test-resources/clinical_jargon/clinical_jargon_examples.csv Adds synthetic/demo benchmark rows used by tests/examples.
docs/api/datasets/pyhealth.datasets.ClinicalJargonDataset.rst Adds Sphinx API stub for the dataset.
docs/api/datasets.rst Adds dataset entry to the datasets API index.
docs/api/tasks/pyhealth.tasks.ClinicalJargonVerification.rst Adds Sphinx API stub for the task.
docs/api/tasks.rst Adds task entry to the tasks API index.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@John-Carson John-Carson marked this pull request as draft April 4, 2026 19:50
@John-Carson John-Carson marked this pull request as ready for review April 9, 2026 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants