I trained a U-Net on robotic surgery blood masks, then tested it on laparoscopic frames. Same task, different world. I wanted a straight answer: how much does performance drop, and what does the model actually do wrong?
Short answer: it collapses. Dice goes from about 0.91 on held-out robotic data to about 0.001 on laparoscopic data. The model does not quietly "fail in the dark." It fires blood-like predictions all over the place. On frames with no blood in the label, it still flags blood most of the time. That is a trust problem for any alert system, not just a low score on a leaderboard.
Robotic animal surgery (HemoSet) looks nothing like human laparoscopy (CholecSeg8k). Different camera, light, tissue. I kept training on HemoSet only and treated all of CholecSeg8k as out-of-domain test. No cherry-picking.
- In-domain (HemoSet val) Dice ~0.91. Training behaved normally. Curves look sane.
- Out-of-domain (CholecSeg8k) Dice ~0.001. Basically flat.
- Failure modes: almost every blood frame is PARTIAL (some overlap, bad boundaries). FALSE_POSITIVE is huge on no-blood frames. FALSE_NEGATIVE count was zero in my run: the model almost never outputs a completely empty mask on blood frames. So the story is misfire and misalignment, not "missed bleeds only."
- Brightness did not line up with Dice in a useful way on blood frames. GT blood coverage % had the strongest Spearman link with Dice (negative in my tables). I read that as: once predictions are noisy everywhere, naive overlap scores get weird on large vs small regions. Your mileage may vary on interpretation.
Below are the figures I generated in Phase 4 so you can see the gap and the collapse without reading another paragraph.
Laparoscopic frames are dark on purpose. That is the data. The point is not "turn up the brightness." The point is the model never learned blood in a way that survives the switch of domain.
- Put HemoSet under
data/hemoset/and CholecSeg8k underdata/cholecseg8k/(see project notes for layout). pip install -r requirements.txtpython -m src.data_pipelinepython -m src.trainpython -m src.domain_analysispython -m src.visualizestreamlit run app.pyfor the dashboard.
Training needs a GPU in practice. CholecSeg8k is CC BY-NC-SA 4.0. Check each dataset's license before you reuse anything commercially.
surgical-blood-generalization/
app.py # Streamlit UI
requirements.txt
src/
data_pipeline.py
train.py
models.py
domain_analysis.py
visualize.py
outputs/
figures/ # PNGs from visualize.py
results/ # JSON + CSV metrics
If you bolted this exact model onto a live feed, you would get endless false alarms. Surgeons would mute it. I am not claiming clinical validation here. I am claiming: domain shift is not abstract in surgery. A model can look great on the data it was fed and still be useless or harmful when the scene changes.
- One architecture, one training recipe, one split. Not a sweep of encoders or losses.
- HemoSet is small. CholecSeg8k blood pixels are sparse. Metrics are sensitive to class imbalance.
- I am reporting what I measured on disk. If you rerun seeds or code versions, numbers will drift slightly.
- CholecSeg8k: Hong et al., Kaggle release, CC BY-NC-SA 4.0.
- HemoSet: see the HemoSet site and their terms.
- Code in this repo: MIT (see
LICENSE).
If you use this repo, cite the original datasets and papers from their pages. I did not write a separate paper for this README.




