Implementation of an automatic data processing pipeline for LEGEND (Large Enriched Germanium Experiment for Neutrinoless double beta decay) L200 data, based on Snakemake. The snakemake docs have some nice tutorials alternatively hsf has a nice one also.
legend-dataflow orchestrates the full chain of data processing from raw digitiser
output to physics-ready event data. It calibrates and optimises hundreds of detector
channels in parallel before bringing them together to process physics data.
The pipeline processes data through a series of tiers:
| Tier | Description |
|---|---|
raw |
Converted from DAQ format (ORCA/FCIO) to LH5 |
tcm |
Timing correction module |
dsp |
Digital signal processing (pole-zero, energy filters) |
hit |
Energy calibration and pulse shape discrimination |
psp |
Partition-level DSP (averaged over calibration runs) |
pht |
Partition-level HIT with advanced quality control |
ann |
Artificial neural network cuts (coax detectors only) |
evt |
Event-level reconstruction and cross-talk correction |
skm |
Final physics skim |
For each tier, calibration parameters are first derived per-channel from dedicated calibration data, then applied to physics data in parallel.
Clone the repository and set up a virtual environment (requires Python 3.11+):
git clone https://github.com/legend-exp/legend-dataflow.git
cd legend-dataflow
uv venv --python 3.12
source .venv/bin/activate
uv pip install -e ".[dev]"Adapt dataflow-config.yaml to your site (paths, execution environment), then install
the software environment:
dataflow -v install -s <host> dataflow-config.yamlwhere <host> is one of the execution environments defined in the config (bare,
lngs, sator, nersc).
The main target format is:
[all|valid]-{experiment}-{period}-{run}-{datatype}-{tier}.gen
Examples:
# Process all physics data through the SKM tier
snakemake --profile workflow/profiles/default all-l200-p03-r001-phy-skm.gen
# Process a specific period and datatype through DSP
snakemake --profile workflow/profiles/default sel-l200-p03-*-cal-dsp.gen
# Process multiple runs
snakemake --profile workflow/profiles/default all-l200-p03-r000_r001-phy-hit.genUse all to process all data or valid to process only analysis-selected data
(any keyword present in the runlists.yaml file in legend-datasets can be used here).
Wildcards (*) and multi-value selectors (_-separated) are supported for most
label components.
Full documentation is available at the project's ReadTheDocs page:
- User Manual – configuration, profiles, and running the dataflow
- Pipeline Overview – how the processing pipeline works step by step
- Developer Guide – how to extend and contribute to the pipeline
- API Reference – Python API documentation
- legend-pydataobj – LEGEND data objects (LH5 format)
- legend-daq2lh5 – DAQ format conversion to LH5
- dspeed – Digital signal processing
- pygama – Gamma-ray analysis
MIT License. See LICENSE.md.