FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures
This is the official implementation of the FISBe (FlyLight Instance Segmentation Benchmark) evaluation pipeline. It is the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations.
Download the dataset: https://kainmueller-lab.github.io/fisbe/
The benchmark supports 2D and 3D segmentations and computes a wide range of commonly used evaluation metrics (e.g., AP, F1, coverage). Crucially, it provides specialized error attribution for topological errors (False Merges, False Splits) relevant to filamentous structures.
-
Official Protocol: Implements the exact ranking score (
$S$ ) and matching logic defined in the FISBe paper. -
Topology-Aware: Uses skeleton-based localization (
clDice) to handle thin structures robustly. - Error Attribution: Explicitly quantifies False Merges (FM) and False Splits (FS) via many-to-many matching.
-
Flexibility: Supports HDF5 (
.hdf,.h5) and Zarr (.zarr) files. - Modes: Single file, folder evaluation, or 3x stability analysis.
- Partly Labeled Support: Robust evaluation that ignores background conflicts for sparse Ground Truth.
The recommended way to install is using uv (fastest) or micromamba.
pip install uv
git clone https://github.com/Kainmueller-Lab/evaluate-instance-segmentation.git
cd evaluate-instance-segmentation
uv venv
uv pip install -e .micromamba create -n evalinstseg python=3.10
micromamba activate evalinstseg
git clone https://github.com/Kainmueller-Lab/evaluate-instance-segmentation.git
cd evaluate-instance-segmentation
pip install -e .The evalinstseg command is automatically available after installation.
evalinstseg \
--res_file tests/pred/sample_01.hdf \
--res_key volumes/gmm_label_cleaned \
--gt_file tests/gt/sample_01.zarr \
--gt_key volumes/gt_instances \
--split_file assets/sample_list_per_split.txt \
--out_dir tests/results \
--app flylightIf you provide a directory path to --res_file, the tool will look for matching Ground Truth files in the --gt_file folder. Files are matched by name.
evalinstseg \
--res_file /path/to/predictions_folder \
--res_key volumes/gmm_label_cleaned \
--gt_file /path/to/ground_truth_folder \
--gt_key volumes/gt_instances \
--out_dir /path/to/output_folder \
--app flylightCompute the Mean ± Std of metrics across exactly 3 different training runs (e.g., different random seeds).
evalinstseg \
--stability_mode \
--run_dirs experiments/seed1 experiments/seed2 experiments/seed3 \
--gt_file data/ground_truth_folder \
--out_dir results/stability_report \
--app flylightRequirements:
--run_dirs: Provide exactly 3 folders.--gt_file: The folder containing Ground Truth files (filenames must match predictions).
If your ground truth is sparse (not fully dense), use the --partly flag. See the Partly Labeled Data Mode section for details on how False Positives are handled.
You can integrate the benchmark directly into your Python scripts or notebooks.
from evalinstseg import evaluate_file
# Run evaluation
metrics = evaluate_file(
res_file="tests/pred/sample_01.hdf",
gt_file="tests/gt/sample_01.zarr",
res_key="volumes/labels",
gt_key="volumes/gt_instances",
out_dir="output_folder",
ndim=3,
app="flylight", # Applies default FISBe config
partly=False # Set True for sparse GT
)
# Access metrics directly
print("AP:", metrics['confusion_matrix']['avAP'])
print("False Merges:", metrics['general']['FM'])If you already have the arrays loaded in memory:
import numpy as np
from evalinstseg import evaluate_volume
pred_array = np.load(...) # Shape: (Z, Y, X)
gt_array = np.load(...)
metrics = evaluate_volume(
gt_labels=gt_array,
pred_labels=pred_array,
ndim=3,
outFn="output_path_prefix",
localization_criterion="cldice", # or 'iou'
assignment_strategy="greedy",
add_general_metrics=["false_merge", "false_split"]
)For a complete reference of all calculated metrics, see docs/METRICS.md.
Note: Some output keys use internal names; see the documentation for the exact mapping to website/leaderboard columns.
The flylight preset implements the specific metrics described in the FISBe paper for evaluating long-range thin filamentous neuronal structures.
Primary Ranking Score (
- avF1: Average F1 score across clDice thresholds.
- C (Coverage): Average GT skeleton coverage (assignment via max clPrecision; scoring via clRecall on union of matches).
- clDiceTP: Average clDice score of matched TPs at threshold 0.5.
- tp: Relative number of TPs at threshold 0.5 (
TP_0.5 / N_GT). - FS (False Splits): Sum over GT of
max(0, N_assigned_pred - 1). - FM (False Merges): Sum over predictions of
max(0, N_assigned_gt - 1).
FISBe includes 71 partly labeled images where only a subset of neurons is annotated.
- Logic: Unmatched predictions are only counted as False Positives if they match a Foreground GT instance.
- Background Exclusion: Predictions matching background (unlabeled regions) are ignored.
Metrics returned by the API or saved to disk are grouped into category-specific dictionaries:
metrics["confusion_matrix"]
├── TP / FP / FN # Counts across all images
├── precision / recall # Standard detection metrics
└── avAP # Mean precision × recall proxy
metrics["general"]
├── aggregate_score # S (Official Ranking Score)
├── avg_gt_skel_coverage # C (Coverage)
├── FM # Global False Merge count
└── FS # Global False Split count
metrics["curves"]
└── F1_0.1 … F1_0.9 # Per-threshold performance