This repository contains all the code and data necessary to build the wildfire dataset. This dataset is then used to train our ML models.
Install uv with pipx:
pipx install uvCreate a virtualenv and install the dependencies with uv:
uv syncActivate the uv virutalenv:
source .venv/bin/activateGet the wildfire datasets with dvc:
dvc get . data/processedPull all the data with dvc:
dvc pullNote: One needs to configure their dvc remote and get access to our remote data storage. Please ask somebody from the team to give you access.
Run the pipeline to build the dataset:
dvc reproBefore running the DVC pipeline, you can add new sequences to the raw datasets.
dvc pull# Add new wildfire sequences
uv run python scripts/add_data.py --src /path/to/new/wildfire/sequences --type wildfire
# Add new false positive sequences
uv run python scripts/add_data.py --src /path/to/new/fp/sequences --type fpThis copies the folders into data/raw/<type>/data/, validates naming and structure, assigns stable train/val/test splits (80/10/10 per camera), and updates data/raw/<type>/registry.json.
Use --dry-run to preview without writing anything.
Re-add the updated folder(s) to DVC and push to remote storage:
# For wildfire
uv run dvc add data/raw/wildfire
git add data/raw/wildfire.dvc
dvc push data/raw/wildfire
# For false positives
uv run dvc add data/raw/fp
git add data/raw/fp.dvc
dvc push data/raw/fpdvc repro- build_wf_yolo_dataset: Samples up to 10 labeled images per wildfire sequence and copies them into a YOLO-format dataset (
data/processed/wildfire_yolo/), split into train/val/test according toregistry.json. - build_fp_yolo_dataset: Samples false positive images using round-robin by max detection score. Quotas: 10% FP for train/val, 50% FP for test. Outputs to
data/processed/fp_yolo/. - merge_yolo_dataset: Merges wildfire and FP images into two final datasets —
data/processed/yolo_train_val/anddata/processed/yolo_test/.
All dataset versions are tracked via Git tags. Each tag points to a specific dvc.lock, which records the exact content hashes of every output.
# 1. Produce datasets
uv run dvc repro
# 2. Push data to remote
uv run dvc push
# 3. Commit and tag
git add dvc.lock data/raw/wildfire.dvc data/raw/fp.dvc
git commit -m "dataset: release v1.0.0"
git tag v1.0.0
git push && git push --tags# Import locked to a tag (reproducible, updatable)
dvc import https://github.com/pyronear/pyro-dataset data/processed/yolo_train_val --rev v1.0.0
dvc import https://github.com/pyronear/pyro-dataset data/processed/yolo_test --rev v1.0.0
# Update to latest
dvc update yolo_train_val.dvc
dvc update yolo_test.dvc