Surgical Blood Generalization

I trained a U-Net on robotic surgery blood masks, then tested it on laparoscopic frames. Same task, different world. I wanted a straight answer: how much does performance drop, and what does the model actually do wrong?

Short answer: it collapses. Dice goes from about 0.91 on held-out robotic data to about 0.001 on laparoscopic data. The model does not quietly "fail in the dark." It fires blood-like predictions all over the place. On frames with no blood in the label, it still flags blood most of the time. That is a trust problem for any alert system, not just a low score on a leaderboard.

The setup

Robotic animal surgery (HemoSet) looks nothing like human laparoscopy (CholecSeg8k). Different camera, light, tissue. I kept training on HemoSet only and treated all of CholecSeg8k as out-of-domain test. No cherry-picking.

What I found

In-domain (HemoSet val) Dice ~0.91. Training behaved normally. Curves look sane.
Out-of-domain (CholecSeg8k) Dice ~0.001. Basically flat.
Failure modes: almost every blood frame is PARTIAL (some overlap, bad boundaries). FALSE_POSITIVE is huge on no-blood frames. FALSE_NEGATIVE count was zero in my run: the model almost never outputs a completely empty mask on blood frames. So the story is misfire and misalignment, not "missed bleeds only."
Brightness did not line up with Dice in a useful way on blood frames. GT blood coverage % had the strongest Spearman link with Dice (negative in my tables). I read that as: once predictions are noisy everywhere, naive overlap scores get weird on large vs small regions. Your mileage may vary on interpretation.

Below are the figures I generated in Phase 4 so you can see the gap and the collapse without reading another paragraph.

Domain gap (you see it before any numbers)

In-domain vs out-of-domain metrics

Where the mistakes go

What correlates with Dice on blood frames

Qualitative examples

Laparoscopic frames are dark on purpose. That is the data. The point is not "turn up the brightness." The point is the model never learned blood in a way that survives the switch of domain.

How to reproduce (rough order)

Put HemoSet under data/hemoset/ and CholecSeg8k under data/cholecseg8k/ (see project notes for layout).
pip install -r requirements.txt
python -m src.data_pipeline
python -m src.train
python -m src.domain_analysis
python -m src.visualize
streamlit run app.py for the dashboard.

Training needs a GPU in practice. CholecSeg8k is CC BY-NC-SA 4.0. Check each dataset's license before you reuse anything commercially.

Project layout

surgical-blood-generalization/
  app.py                 # Streamlit UI
  requirements.txt
  src/
    data_pipeline.py
    train.py
    models.py
    domain_analysis.py
    visualize.py
  outputs/
    figures/             # PNGs from visualize.py
    results/             # JSON + CSV metrics

Why this might matter to a clinician (informal)

If you bolted this exact model onto a live feed, you would get endless false alarms. Surgeons would mute it. I am not claiming clinical validation here. I am claiming: domain shift is not abstract in surgery. A model can look great on the data it was fed and still be useless or harmful when the scene changes.

Limits

One architecture, one training recipe, one split. Not a sweep of encoders or losses.
HemoSet is small. CholecSeg8k blood pixels are sparse. Metrics are sensitive to class imbalance.
I am reporting what I measured on disk. If you rerun seeds or code versions, numbers will drift slightly.

Credit

CholecSeg8k: Hong et al., Kaggle release, CC BY-NC-SA 4.0.
HemoSet: see the HemoSet site and their terms.
Code in this repo: MIT (see LICENSE).

If you use this repo, cite the original datasets and papers from their pages. I did not write a separate paper for this README.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Surgical Blood Generalization

The setup

What I found