Death Prediction Pipeline

Predict death after age 60 using patient features and disease history before age 60, evaluated across four methods.

Methods

#	Method	Description
1	Delphi	Generative transformer for health trajectories; predicts "Death" token probability. Evaluated with DeLong AUC.
2	Benchmarking (CoxPH)	CoxPH survival model on binary disease features + baseline biomarkers. Evaluated with C-index and time-dependent AUC.
3	Text Embedding + CoxPH	Convert disease history to natural language, embed with Qwen3-Embedding, combine with baselines, fit CoxPH.
4	Trajectory Embedding + CoxPH	Delphi-style token + age embeddings (sin/cos), pool across events, combine with baselines, fit CoxPH.

Directory Structure

├── benchmarking/           # Survival data preprocessing & CoxPH training
│   ├── preprocess_diagnosis.py         # Extract disease features from UKB
│   ├── preprocess_survival.py          # Build survival dataset (event_flag, duration_days)
│   ├── autoprognosis_survival_dataset.csv  # Output: survival dataset
│   └── disease_before60_features.csv       # Output: binary disease flags
├── Delphi/                 # Delphi model, training & evaluation code
│   ├── model.py, train.py, utils.py    # Core Delphi code
│   └── evaluate_auc.py                 # AUC evaluation via DeLong
├── embedding/              # Embedding extraction (methods 3 & 4)
│   ├── qwen_embedding.py              # Qwen text-only embedding (method 3 tokens / method 4 texts)
│   └── trajectory_embedding.py        # Token+age embedding pipeline (method 4)
├── preprocessing/          # Preprocessing for embedding inputs
│   ├── generate_disease_trajectory.py  # Build age-at-diagnosis matrix (disease_trajectory.csv)
│   ├── generate_trajectory_text.py     # Convert matrix → Delphi-style trajectory text per patient
│   └── natural_text_conversion.py      # Convert tabular data → natural-language text per patient
├── evaluation/             # Unified evaluation & comparison
│   ├── cohort_split.py                 # Define shared train/val/test split
│   ├── evaluate_delphi.py              # Evaluate Delphi on shared cohort
│   ├── evaluate_benchmarking.py        # Train & evaluate CoxPH on shared cohort
│   ├── evaluate_embedding_survival.py  # Train & evaluate CoxPH on embeddings
│   └── unified_evaluation.py           # Compare all methods in one table
├── data/                   # Raw & processed data (gitignored)
├── UKB_extraction/         # UK Biobank data extraction tools
├── docs/                   # Proposals, references
└── run_pipeline.sh         # One-command pipeline runner (steps 1–7)

Quick Start (one command)

Run the entire pipeline end-to-end with run_pipeline.sh:

# Local / CPU: 10k sample, Qwen3-Embedding-0.6B (auto-selected)
bash run_pipeline.sh

# Full dataset, auto-selects model based on device
bash run_pipeline.sh --full

# GPU server: full dataset, 8B model
bash run_pipeline.sh --full --embedding-model Qwen/Qwen3-Embedding-8B

# Mid-range GPU: 4B model
bash run_pipeline.sh --full --embedding-model Qwen/Qwen3-Embedding-4B

# Skip preprocessing if data already exists
bash run_pipeline.sh --skip-preprocess --steps 5,6,7

# Skip Delphi (if no checkpoint available)
bash run_pipeline.sh --skip-delphi

Options:

Flag	Description
`--full`	Use all participants instead of a 10k sample
`--sample-size N`	Custom sample size (default: 10000)
`--embedding-model MODEL`	Qwen3-Embedding-0.6B/4B/8B (auto-selected by device)
`--token-mode random\|qwen`	Trajectory token embedding mode (default: random)
`--skip-preprocess`	Skip steps 1-2 if CSV files already exist
`--skip-delphi`	Skip Delphi evaluation
`--steps 1,2,3,...`	Run only specific steps
`--device cuda\|cpu`	Force device (auto-detected by default)
`--random-state N`	Random seed (default: 42)

The script logs everything to pipeline_YYYYMMDD_HHMMSS.log and prints the comparison table at the end.

Embedding model load error: If you see Can't load the configuration of 'Qwen/Qwen3-Embedding-0.6B', the code will retry with a fresh download and then fall back to sentence-transformers if needed. Install the fallback with: pip install sentence-transformers. Ensure no local folder named Qwen exists in the current working directory, and that you have network access to Hugging Face.

Pipeline (step by step)

Step 0: UKB data extraction

Extract raw UK Biobank data into data/. See UKB_extraction/ for tooling.

Step 1: Build survival dataset & disease features

These scripts produce the two CSV files that all downstream steps depend on.

python benchmarking/preprocess_diagnosis.py    # → disease_before60_features.csv
python benchmarking/preprocess_survival.py     # → autoprognosis_survival_dataset.csv (10k sample)
# Or use the full dataset:
python benchmarking/preprocess_survival.py --all

Step 2: Build disease trajectory matrix

Generate per-patient age-at-diagnosis for all diseases (needed by method 4).

python preprocessing/generate_disease_trajectory.py   # → data/preprocessed/disease_trajectory.csv

Step 3: Define shared cohort split

Create a single train/val/test split (70/15/15, stratified) used by all methods.

python evaluation/cohort_split.py              # → evaluation/cohort_split.json

Step 4: Generate embedding inputs & Delphi binary data

Method 3 (text embedding): convert trajectory data to disease-history text with age at diagnosis (e.g. "At age 20.3, patient was diagnosed with G43 migraine."). Only disease events are included; demographics and biomarkers are added as numeric features during survival model training.

python preprocessing/natural_text_conversion.py \
    --trajectory-csv data/preprocessed/disease_trajectory.csv \
    --output-csv     data/preprocessed/text_before60.csv \
    --output-dir     data/preprocessed/text_before60

Method 4 (trajectory embedding): convert trajectory matrix to Delphi-style text.

python preprocessing/generate_trajectory_text.py \
    --output-csv  data/preprocessed/trajectory_before60.csv \
    --output-dir  data/preprocessed/trajectory_before60

Method 1 (Delphi binary data): convert trajectory + demographics to Delphi binary format, aligned with the shared cohort split.

python Delphi/preprocess_delphi_binary.py \
    --output-dir  Delphi/data/ukb_respiratory_data

This generates train.bin, val.bin, test.bin using the same patient splits as all other methods.

Step 5: Compute embeddings

Method 3: embed natural-language texts with Qwen3-Embedding.

# GPU server (8B, 4096-dim):
python embedding/qwen_embedding.py \
    --input-csv   data/preprocessed/text_before60.csv \
    --output-dir  data/preprocessed/embeddings_text \
    --model-name  Qwen/Qwen3-Embedding-8B

# Local / CPU (0.6B, 1024-dim):
python embedding/qwen_embedding.py \
    --input-csv   data/preprocessed/text_before60.csv \
    --output-dir  data/preprocessed/embeddings_text \
    --model-name  Qwen/Qwen3-Embedding-0.6B \
    --no-flash-attn

Method 4: embed trajectory token+age vectors.

# Random token embeddings (CPU, for testing):
python embedding/trajectory_embedding.py \
    --input-csv   data/preprocessed/trajectory_before60.csv \
    --output-dir  data/preprocessed/embeddings_traj

# Or with Qwen token embeddings (GPU):
python embedding/trajectory_embedding.py \
    --input-csv   data/preprocessed/trajectory_before60.csv \
    --output-dir  data/preprocessed/embeddings_traj \
    --token-mode  qwen

Step 6: Train & evaluate each method

Each evaluation script trains on the shared train split and evaluates on val/test.

# Method 1: Delphi (requires step 4 Delphi binary data)
python evaluation/evaluate_delphi.py \
    --split test \
    --save-preds \
    --horizons-days 365 1825

# Method 2: Benchmarking (CoxPH on binary disease features)
python evaluation/evaluate_benchmarking.py \
    --baseline-mode all

# Method 3: Text Embedding + CoxPH
python evaluation/evaluate_embedding_survival.py \
    --embedding-dir data/preprocessed/embeddings_text \
    --tag patient \
    --method-name text_embedding \
    --baseline-mode all

# Method 4: Trajectory Embedding + CoxPH
python evaluation/evaluate_embedding_survival.py \
    --embedding-dir data/preprocessed/embeddings_traj \
    --tag trajectory \
    --method-name trajectory_embedding \
    --baseline-mode none

Each evaluator now shares the same CLI options:

--baseline-mode {all, none, custom} (and --baseline-cols ...) to control which survival covariates are concatenated with embeddings.
--survival-csv / --cohort-json to point at alternative datasets or splits.
--save-preds to dump per-split risk scores in evaluation/*/predictions/.
Delphi adds --horizons-days to override the default quantile-based horizons and aligns its risk outputs with the shared survival metrics.

Step 7: Unified comparison

python evaluation/unified_evaluation.py        # → evaluation/unified_comparison.csv

Requirements

Python 3.9+. Install all dependencies at once:

pip install -r requirements.txt

For GPU (CUDA), install PyTorch with CUDA first:

pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

Key dependencies and what uses them:

Package	Version	Used by
`torch`	>=2.4.0	Delphi, Qwen3-Embedding
`transformers`	>=4.51.0	Qwen3-Embedding (methods 3 & 4)
`lifelines`	>=0.27.0	CoxPH models (methods 2, 3, 4)
`scikit-survival`	>=0.22.0	Time-dependent AUC evaluation
`numpy`	>=1.24.0	All components
`pandas`	>=2.0.0	All components

See also: Delphi/requirements.txt (original Delphi deps), embedding/requirements_qwen.txt (Qwen-specific).

TODOs

traj_before60: add hints for LLM to understand the text
text_before60: add timepoints for alcohol and smoking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Death Prediction Pipeline

Methods

Directory Structure

Quick Start (one command)

Pipeline (step by step)

Step 0: UKB data extraction

Step 1: Build survival dataset & disease features

Step 2: Build disease trajectory matrix

Step 3: Define shared cohort split

Step 4: Generate embedding inputs & Delphi binary data

Step 5: Compute embeddings

Step 6: Train & evaluate each method

Step 7: Unified comparison

Requirements

TODOs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Delphi		Delphi
UKB_extraction		UKB_extraction
benchmarking		benchmarking
docs		docs
embedding		embedding
evaluation		evaluation
preprocessing		preprocessing
tests		tests
.gitignore		.gitignore
README.md		README.md
death_prediction_pipeline_plan.md		death_prediction_pipeline_plan.md
requirements.txt		requirements.txt
run_pipeline.sh		run_pipeline.sh

Folders and files

Latest commit

History

Repository files navigation

Death Prediction Pipeline

Methods

Directory Structure

Quick Start (one command)

Pipeline (step by step)

Step 0: UKB data extraction

Step 1: Build survival dataset & disease features

Step 2: Build disease trajectory matrix

Step 3: Define shared cohort split

Step 4: Generate embedding inputs & Delphi binary data

Step 5: Compute embeddings

Step 6: Train & evaluate each method

Step 7: Unified comparison

Requirements

TODOs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages