feat: Add HiCu model and MIMIC4ICD10Coding task for ICD-10 coding#947
feat: Add HiCu model and MIMIC4ICD10Coding task for ICD-10 coding#947matthew-ardi wants to merge 3 commits intosunlabuiuc:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an end-to-end ICD-10 coding pipeline for MIMIC-IV by introducing a new task for extracting discharge-note text + ICD-10 labels, and a new HiCu model implementing hierarchical curriculum learning over an ICD-10 hierarchy.
Changes:
- Added
MIMIC4ICD10Codingtask with simple tokenization and ICD-10-only filtering. - Added
HiCumodel (MultiResCNN encoder + hierarchical decoder + asymmetric loss) with ICD-10 hierarchy utilities. - Added tests, API docs entries, and an example script demonstrating curriculum vs flat training.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/core/test_mimic4_icd10_coding.py | Unit tests covering sample extraction, ICD-10 filtering, dedup, and tokenization behavior. |
| tests/core/test_hicu.py | Unit tests covering HiCu initialization, forward/backward, depth switching, weight transfer, ASL, and hierarchy builder. |
| pyhealth/tasks/medical_coding.py | Implements MIMIC4ICD10Coding task and _tokenize_clinical_text helper. |
| pyhealth/tasks/init.py | Exposes MIMIC4ICD10Coding in the public tasks namespace. |
| pyhealth/models/hicu.py | Adds the HiCu model, hierarchical decoder, encoder, ASL, and ICD-10 hierarchy utilities. |
| pyhealth/models/init.py | Exposes HiCu-related classes in the public models namespace. |
| examples/mimic4_icd10_coding_hicu.py | End-to-end runnable example for synthetic + real MIMIC-IV, including curriculum experiments. |
| docs/api/tasks/pyhealth.tasks.MIMIC4ICD10Coding.rst | New API doc page for the ICD-10 coding task. |
| docs/api/tasks.rst | Adds the ICD-10 coding task to the tasks API index. |
| docs/api/models/pyhealth.models.HiCu.rst | New API doc page for HiCu and its components. |
| docs/api/models.rst | Adds HiCu to the models API index. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pyhealth/tasks/medical_coding.py
Outdated
| { | ||
| "patient_id": patient.patient_id, | ||
| "text": tokens, | ||
| "icd_codes": list(set(icd_codes)), |
There was a problem hiding this comment.
icd_codes is deduplicated via list(set(icd_codes)), which produces a non-deterministic ordering across runs/processes. This can make sample outputs flaky and harder to debug. Consider using a deterministic dedup (e.g., sorted(set(icd_codes)) or an order-preserving dedup) before returning the sample.
| "icd_codes": list(set(icd_codes)), | |
| "icd_codes": list(dict.fromkeys(icd_codes)), |
|
|
||
| y_true_full = kwargs[self.label_key].to(self.device).float() | ||
| y_true = self._remap_labels(y_true_full, self.current_depth) | ||
| loss = self.asl_loss(logits, y_true) | ||
| y_prob = torch.sigmoid(logits) | ||
|
|
||
| return {"loss": loss, "y_prob": y_prob, "y_true": y_true, "logit": logits} |
There was a problem hiding this comment.
HiCu.forward unconditionally reads kwargs[self.label_key] and always returns loss/y_true. This deviates from the BaseModel forward contract used by other models in the repo (loss/y_true only when labels are provided) and will raise a KeyError for inference-only calls. Consider returning {logit, y_prob} always, and only computing/adding {loss, y_true} when self.label_key in kwargs.
| y_true_full = kwargs[self.label_key].to(self.device).float() | |
| y_true = self._remap_labels(y_true_full, self.current_depth) | |
| loss = self.asl_loss(logits, y_true) | |
| y_prob = torch.sigmoid(logits) | |
| return {"loss": loss, "y_prob": y_prob, "y_true": y_true, "logit": logits} | |
| y_prob = torch.sigmoid(logits) | |
| results = {"y_prob": y_prob, "logit": logits} | |
| if self.label_key in kwargs: | |
| y_true_full = kwargs[self.label_key].to(self.device).float() | |
| y_true = self._remap_labels(y_true_full, self.current_depth) | |
| loss = self.asl_loss(logits, y_true) | |
| results["loss"] = loss | |
| results["y_true"] = y_true | |
| return results |
…ing, deterministic ordering, visit_id, and memory-efficient label mappings Agent-Logs-Url: https://github.com/matthew-ardi/PyHealth/sessions/4752d079-651a-4fe1-9faa-3a2025813f50 Co-authored-by: matthew-ardi <25186507+matthew-ardi@users.noreply.github.com>
Summary
HiCumodel implementing hierarchical curriculum learning for automated ICD coding (Ren et al., ML4H 2022)MIMIC4ICD10Codingtask for extracting discharge notes and ICD-10 diagnosis codes from MIMIC-IVContribution type: Full Pipeline (Task + Model)
Paper: https://arxiv.org/abs/2208.02301
Design decisions
Test plan
pytest tests/core/test_hicu.py -vpytest tests/core/test_mimic4_icd10_coding.py -vpython examples/mimic4_icd10_coding_hicu.py— runs end-to-end on synthetic data