Skip to content

Finley-Maple/DeathClaw

Repository files navigation

Death Prediction Pipeline

Predict death after age 60 using patient features and disease history before age 60, evaluated across four methods.

Methods

# Method Description
1 Delphi Generative transformer for health trajectories; predicts "Death" token probability. Evaluated with DeLong AUC.
2 Benchmarking (CoxPH) CoxPH survival model on binary disease features + baseline biomarkers. Evaluated with C-index and time-dependent AUC.
3 Text Embedding + CoxPH Convert disease history to natural language, embed with Qwen3-Embedding, combine with baselines, fit CoxPH.
4 Trajectory Embedding + CoxPH Delphi-style token + age embeddings (sin/cos), pool across events, combine with baselines, fit CoxPH.

Directory Structure

├── benchmarking/           # Survival data preprocessing & CoxPH training
│   ├── preprocess_diagnosis.py         # Extract disease features from UKB
│   ├── preprocess_survival.py          # Build survival dataset (event_flag, duration_days)
│   ├── autoprognosis_survival_dataset.csv  # Output: survival dataset
│   └── disease_before60_features.csv       # Output: binary disease flags
├── Delphi/                 # Delphi model, training & evaluation code
│   ├── model.py, train.py, utils.py    # Core Delphi code
│   └── evaluate_auc.py                 # AUC evaluation via DeLong
├── embedding/              # Embedding extraction (methods 3 & 4)
│   ├── qwen_embedding.py              # Qwen text-only embedding (method 3 tokens / method 4 texts)
│   └── trajectory_embedding.py        # Token+age embedding pipeline (method 4)
├── preprocessing/          # Preprocessing for embedding inputs
│   ├── generate_disease_trajectory.py  # Build age-at-diagnosis matrix (disease_trajectory.csv)
│   ├── generate_trajectory_text.py     # Convert matrix → Delphi-style trajectory text per patient
│   └── natural_text_conversion.py      # Convert tabular data → natural-language text per patient
├── evaluation/             # Unified evaluation & comparison
│   ├── cohort_split.py                 # Define shared train/val/test split
│   ├── evaluate_delphi.py              # Evaluate Delphi on shared cohort
│   ├── evaluate_benchmarking.py        # Train & evaluate CoxPH on shared cohort
│   ├── evaluate_embedding_survival.py  # Train & evaluate CoxPH on embeddings
│   └── unified_evaluation.py           # Compare all methods in one table
├── data/                   # Raw & processed data (gitignored)
├── UKB_extraction/         # UK Biobank data extraction tools
├── docs/                   # Proposals, references
└── run_pipeline.sh         # One-command pipeline runner (steps 1–7)

Quick Start (one command)

Run the entire pipeline end-to-end with run_pipeline.sh:

# Local / CPU: 10k sample, Qwen3-Embedding-0.6B (auto-selected)
bash run_pipeline.sh

# Full dataset, auto-selects model based on device
bash run_pipeline.sh --full

# GPU server: full dataset, 8B model
bash run_pipeline.sh --full --embedding-model Qwen/Qwen3-Embedding-8B

# Mid-range GPU: 4B model
bash run_pipeline.sh --full --embedding-model Qwen/Qwen3-Embedding-4B

# Skip preprocessing if data already exists
bash run_pipeline.sh --skip-preprocess --steps 5,6,7

# Skip Delphi (if no checkpoint available)
bash run_pipeline.sh --skip-delphi

Options:

Flag Description
--full Use all participants instead of a 10k sample
--sample-size N Custom sample size (default: 10000)
--embedding-model MODEL Qwen3-Embedding-0.6B/4B/8B (auto-selected by device)
--token-mode random|qwen Trajectory token embedding mode (default: random)
--skip-preprocess Skip steps 1-2 if CSV files already exist
--skip-delphi Skip Delphi evaluation
--steps 1,2,3,... Run only specific steps
--device cuda|cpu Force device (auto-detected by default)
--random-state N Random seed (default: 42)

The script logs everything to pipeline_YYYYMMDD_HHMMSS.log and prints the comparison table at the end.

Embedding model load error: If you see Can't load the configuration of 'Qwen/Qwen3-Embedding-0.6B', the code will retry with a fresh download and then fall back to sentence-transformers if needed. Install the fallback with: pip install sentence-transformers. Ensure no local folder named Qwen exists in the current working directory, and that you have network access to Hugging Face.

Pipeline (step by step)

Step 0: UKB data extraction

Extract raw UK Biobank data into data/. See UKB_extraction/ for tooling.

Step 1: Build survival dataset & disease features

These scripts produce the two CSV files that all downstream steps depend on.

python benchmarking/preprocess_diagnosis.py    # → disease_before60_features.csv
python benchmarking/preprocess_survival.py     # → autoprognosis_survival_dataset.csv (10k sample)
# Or use the full dataset:
python benchmarking/preprocess_survival.py --all

Step 2: Build disease trajectory matrix

Generate per-patient age-at-diagnosis for all diseases (needed by method 4).

python preprocessing/generate_disease_trajectory.py   # → data/preprocessed/disease_trajectory.csv

Step 3: Define shared cohort split

Create a single train/val/test split (70/15/15, stratified) used by all methods.

python evaluation/cohort_split.py              # → evaluation/cohort_split.json

Step 4: Generate embedding inputs & Delphi binary data

Method 3 (text embedding): convert trajectory data to disease-history text with age at diagnosis (e.g. "At age 20.3, patient was diagnosed with G43 migraine."). Only disease events are included; demographics and biomarkers are added as numeric features during survival model training.

python preprocessing/natural_text_conversion.py \
    --trajectory-csv data/preprocessed/disease_trajectory.csv \
    --output-csv     data/preprocessed/text_before60.csv \
    --output-dir     data/preprocessed/text_before60

Method 4 (trajectory embedding): convert trajectory matrix to Delphi-style text.

python preprocessing/generate_trajectory_text.py \
    --output-csv  data/preprocessed/trajectory_before60.csv \
    --output-dir  data/preprocessed/trajectory_before60

Method 1 (Delphi binary data): convert trajectory + demographics to Delphi binary format, aligned with the shared cohort split.

python Delphi/preprocess_delphi_binary.py \
    --output-dir  Delphi/data/ukb_respiratory_data

This generates train.bin, val.bin, test.bin using the same patient splits as all other methods.

Step 5: Compute embeddings

Method 3: embed natural-language texts with Qwen3-Embedding.

# GPU server (8B, 4096-dim):
python embedding/qwen_embedding.py \
    --input-csv   data/preprocessed/text_before60.csv \
    --output-dir  data/preprocessed/embeddings_text \
    --model-name  Qwen/Qwen3-Embedding-8B

# Local / CPU (0.6B, 1024-dim):
python embedding/qwen_embedding.py \
    --input-csv   data/preprocessed/text_before60.csv \
    --output-dir  data/preprocessed/embeddings_text \
    --model-name  Qwen/Qwen3-Embedding-0.6B \
    --no-flash-attn

Method 4: embed trajectory token+age vectors.

# Random token embeddings (CPU, for testing):
python embedding/trajectory_embedding.py \
    --input-csv   data/preprocessed/trajectory_before60.csv \
    --output-dir  data/preprocessed/embeddings_traj

# Or with Qwen token embeddings (GPU):
python embedding/trajectory_embedding.py \
    --input-csv   data/preprocessed/trajectory_before60.csv \
    --output-dir  data/preprocessed/embeddings_traj \
    --token-mode  qwen

Step 6: Train & evaluate each method

Each evaluation script trains on the shared train split and evaluates on val/test.

# Method 1: Delphi (requires step 4 Delphi binary data)
python evaluation/evaluate_delphi.py \
    --split test \
    --save-preds \
    --horizons-days 365 1825

# Method 2: Benchmarking (CoxPH on binary disease features)
python evaluation/evaluate_benchmarking.py \
    --baseline-mode all

# Method 3: Text Embedding + CoxPH
python evaluation/evaluate_embedding_survival.py \
    --embedding-dir data/preprocessed/embeddings_text \
    --tag patient \
    --method-name text_embedding \
    --baseline-mode all

# Method 4: Trajectory Embedding + CoxPH
python evaluation/evaluate_embedding_survival.py \
    --embedding-dir data/preprocessed/embeddings_traj \
    --tag trajectory \
    --method-name trajectory_embedding \
    --baseline-mode none

Each evaluator now shares the same CLI options:

  • --baseline-mode {all, none, custom} (and --baseline-cols ...) to control which survival covariates are concatenated with embeddings.
  • --survival-csv / --cohort-json to point at alternative datasets or splits.
  • --save-preds to dump per-split risk scores in evaluation/*/predictions/.
  • Delphi adds --horizons-days to override the default quantile-based horizons and aligns its risk outputs with the shared survival metrics.

Step 7: Unified comparison

python evaluation/unified_evaluation.py        # → evaluation/unified_comparison.csv

Requirements

Python 3.9+. Install all dependencies at once:

pip install -r requirements.txt

For GPU (CUDA), install PyTorch with CUDA first:

pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

Key dependencies and what uses them:

Package Version Used by
torch >=2.4.0 Delphi, Qwen3-Embedding
transformers >=4.51.0 Qwen3-Embedding (methods 3 & 4)
lifelines >=0.27.0 CoxPH models (methods 2, 3, 4)
scikit-survival >=0.22.0 Time-dependent AUC evaluation
numpy >=1.24.0 All components
pandas >=2.0.0 All components

See also: Delphi/requirements.txt (original Delphi deps), embedding/requirements_qwen.txt (Qwen-specific).

TODOs

  1. traj_before60: add hints for LLM to understand the text
  2. text_before60: add timepoints for alcohol and smoking

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors