feat: Add HiCu model and MIMIC4ICD10Coding task for ICD-10 coding by matthew-ardi · Pull Request #947 · sunlabuiuc/PyHealth

matthew-ardi · 2026-04-05T20:37:32Z

Summary

Adds HiCu model implementing hierarchical curriculum learning for automated ICD coding (Ren et al., ML4H 2022)
Adds MIMIC4ICD10Coding task for extracting discharge notes and ICD-10 diagnosis codes from MIMIC-IV
Includes end-to-end example script with training experiments comparing curriculum vs flat training and ASL vs BCE loss

Contribution type: Full Pipeline (Task + Model)
Paper: https://arxiv.org/abs/2208.02301

Design decisions

ICD-10 instead of ICD-9: The original paper uses ICD-9 on MIMIC-III. I adapted it to ICD-10 on MIMIC-IV since that's the current coding standard and MIMIC-IV is the newer dataset. The hierarchy builder uses the 22 ICD-10-CM chapters as depth 0, 3-character categories as depth 1, and full codes as depth 2.
3 hierarchy depths instead of 5: The paper uses 5 depths for ICD-9. I found 3 depths sufficient for ICD-10 and it keeps the code simpler without losing the core curriculum learning idea. A follow-up extension can be done if we determine that 5 depths are beneficial.

Test plan

pytest tests/core/test_hicu.py -v
pytest tests/core/test_mimic4_icd10_coding.py -v
python examples/mimic4_icd10_coding_hicu.py — runs end-to-end on synthetic data
Tested on MIMIC-IV dev mode 1000 patients (907 samples) with MPS backend locally

Copilot

Pull request overview

This PR adds an end-to-end ICD-10 coding pipeline for MIMIC-IV by introducing a new task for extracting discharge-note text + ICD-10 labels, and a new HiCu model implementing hierarchical curriculum learning over an ICD-10 hierarchy.

Changes:

Added MIMIC4ICD10Coding task with simple tokenization and ICD-10-only filtering.
Added HiCu model (MultiResCNN encoder + hierarchical decoder + asymmetric loss) with ICD-10 hierarchy utilities.
Added tests, API docs entries, and an example script demonstrating curriculum vs flat training.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/core/test_mimic4_icd10_coding.py	Unit tests covering sample extraction, ICD-10 filtering, dedup, and tokenization behavior.
tests/core/test_hicu.py	Unit tests covering HiCu initialization, forward/backward, depth switching, weight transfer, ASL, and hierarchy builder.
pyhealth/tasks/medical_coding.py	Implements `MIMIC4ICD10Coding` task and `_tokenize_clinical_text` helper.
pyhealth/tasks/init.py	Exposes `MIMIC4ICD10Coding` in the public tasks namespace.
pyhealth/models/hicu.py	Adds the HiCu model, hierarchical decoder, encoder, ASL, and ICD-10 hierarchy utilities.
pyhealth/models/init.py	Exposes HiCu-related classes in the public models namespace.
examples/mimic4_icd10_coding_hicu.py	End-to-end runnable example for synthetic + real MIMIC-IV, including curriculum experiments.
docs/api/tasks/pyhealth.tasks.MIMIC4ICD10Coding.rst	New API doc page for the ICD-10 coding task.
docs/api/tasks.rst	Adds the ICD-10 coding task to the tasks API index.
docs/api/models/pyhealth.models.HiCu.rst	New API doc page for HiCu and its components.
docs/api/models.rst	Adds HiCu to the models API index.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-05T20:46:25Z

pyhealth/tasks/medical_coding.py

+                {
+                    "patient_id": patient.patient_id,
+                    "text": tokens,
+                    "icd_codes": list(set(icd_codes)),


icd_codes is deduplicated via list(set(icd_codes)), which produces a non-deterministic ordering across runs/processes. This can make sample outputs flaky and harder to debug. Consider using a deterministic dedup (e.g., sorted(set(icd_codes)) or an order-preserving dedup) before returning the sample.

Suggested change

"icd_codes": list(set(icd_codes)),

"icd_codes": list(dict.fromkeys(icd_codes)),

Copilot · 2026-04-05T20:46:25Z

pyhealth/models/hicu.py

+
+        y_true_full = kwargs[self.label_key].to(self.device).float()
+        y_true = self._remap_labels(y_true_full, self.current_depth)
+        loss = self.asl_loss(logits, y_true)
+        y_prob = torch.sigmoid(logits)
+
+        return {"loss": loss, "y_prob": y_prob, "y_true": y_true, "logit": logits}


HiCu.forward unconditionally reads kwargs[self.label_key] and always returns loss/y_true. This deviates from the BaseModel forward contract used by other models in the repo (loss/y_true only when labels are provided) and will raise a KeyError for inference-only calls. Consider returning {logit, y_prob} always, and only computing/adding {loss, y_true} when self.label_key in kwargs.

Suggested change

y_true_full = kwargs[self.label_key].to(self.device).float()

y_true = self._remap_labels(y_true_full, self.current_depth)

loss = self.asl_loss(logits, y_true)

y_prob = torch.sigmoid(logits)

return {"loss": loss, "y_prob": y_prob, "y_true": y_true, "logit": logits}

y_prob = torch.sigmoid(logits)

results = {"y_prob": y_prob, "logit": logits}

if self.label_key in kwargs:

y_true_full = kwargs[self.label_key].to(self.device).float()

y_true = self._remap_labels(y_true_full, self.current_depth)

loss = self.asl_loss(logits, y_true)

results["loss"] = loss

results["y_true"] = y_true

return results

…ing, deterministic ordering, visit_id, and memory-efficient label mappings Agent-Logs-Url: https://github.com/matthew-ardi/PyHealth/sessions/4752d079-651a-4fe1-9faa-3a2025813f50 Co-authored-by: matthew-ardi <25186507+matthew-ardi@users.noreply.github.com>

matthew-ardi added 2 commits April 5, 2026 15:27

HiCu

770363d

revert

df7d117

matthew-ardi marked this pull request as ready for review April 5, 2026 20:40

Copilot AI review requested due to automatic review settings April 5, 2026 20:40

matthew-ardi marked this pull request as draft April 5, 2026 20:41

Copilot started reviewing on behalf of matthew-ardi April 5, 2026 20:41 View session

Copilot AI reviewed Apr 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add HiCu model and MIMIC4ICD10Coding task for ICD-10 coding#947

feat: Add HiCu model and MIMIC4ICD10Coding task for ICD-10 coding#947
matthew-ardi wants to merge 3 commits intosunlabuiuc:masterfrom
matthew-ardi:hicu_auto_icd_coding

matthew-ardi commented Apr 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 5, 2026

Uh oh!

Copilot AI Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	"icd_codes": list(set(icd_codes)),
	"icd_codes": list(dict.fromkeys(icd_codes)),

Conversation

matthew-ardi commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design decisions

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matthew-ardi commented Apr 5, 2026 •

edited

Loading