Pyro Dataset

This repository contains all the code and data necessary to build the wildfire dataset. This dataset is then used to train our ML models.

Setup

🐍 Python dependencies

Install uv with pipx:

pipx install uv

Create a virtualenv and install the dependencies with uv:

uv sync

Activate the uv virutalenv:

source .venv/bin/activate

🍜 Data dependencies

Get the wildfire datasets with dvc:

dvc get . data/processed

Pull all the data with dvc:

dvc pull

Note: One needs to configure their dvc remote and get access to our remote data storage. Please ask somebody from the team to give you access.

Run the pipeline to build the dataset:

dvc repro

Adding Data

Before running the DVC pipeline, you can add new sequences to the raw datasets.

1. Pull existing raw data

dvc pull

2. Add new sequences

# Add new wildfire sequences
uv run python scripts/add_data.py --src /path/to/new/wildfire/sequences --type wildfire

# Add new false positive sequences
uv run python scripts/add_data.py --src /path/to/new/fp/sequences --type fp

This copies the folders into data/raw/<type>/data/, validates naming and structure, assigns stable train/val/test splits (80/10/10 per camera), and updates data/raw/<type>/registry.json.

Use --dry-run to preview without writing anything.

3. Track and push the new data

Re-add the updated folder(s) to DVC and push to remote storage:

# For wildfire
uv run dvc add data/raw/wildfire
git add data/raw/wildfire.dvc
dvc push data/raw/wildfire

# For false positives
uv run dvc add data/raw/fp
git add data/raw/fp.dvc
dvc push data/raw/fp

4. Run the pipeline

dvc repro

Data Pipeline

Stages

build_wf_yolo_dataset: Samples up to 10 labeled images per wildfire sequence and copies them into a YOLO-format dataset (data/processed/wildfire_yolo/), split into train/val/test according to registry.json.
build_fp_yolo_dataset: Samples false positive images using round-robin by max detection score. Quotas: 10% FP for train/val, 50% FP for test. Outputs to data/processed/fp_yolo/.
merge_yolo_dataset: Merges wildfire and FP images into two final datasets — data/processed/yolo_train_val/ and data/processed/yolo_test/.

Dataset Versioning

All dataset versions are tracked via Git tags. Each tag points to a specific dvc.lock, which records the exact content hashes of every output.

Release a new version

# 1. Produce datasets
uv run dvc repro

# 2. Push data to remote
uv run dvc push

# 3. Commit and tag
git add dvc.lock data/raw/wildfire.dvc data/raw/fp.dvc
git commit -m "dataset: release v1.0.0"
git tag v1.0.0
git push && git push --tags

Use a specific version in another repo

# Import locked to a tag (reproducible, updatable)
dvc import https://github.com/pyronear/pyro-dataset data/processed/yolo_train_val --rev v1.0.0
dvc import https://github.com/pyronear/pyro-dataset data/processed/yolo_test --rev v1.0.0

# Update to latest
dvc update yolo_train_val.dvc
dvc update yolo_test.dvc

Name		Name	Last commit message	Last commit date
Latest commit History 251 Commits
.dvc		.dvc
data		data
scripts		scripts
src/pyro_dataset		src/pyro_dataset
tests		tests
.dvcignore		.dvcignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
TODO.md		TODO.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pyro Dataset

Setup

🐍 Python dependencies

🍜 Data dependencies

Adding Data

1. Pull existing raw data

2. Add new sequences

3. Track and push the new data

4. Run the pipeline

Data Pipeline

Stages

Dataset Versioning

Release a new version

Use a specific version in another repo

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pyro Dataset

Setup

🐍 Python dependencies

🍜 Data dependencies

Adding Data

1. Pull existing raw data

2. Add new sequences

3. Track and push the new data

4. Run the pipeline

Data Pipeline

Stages

Dataset Versioning

Release a new version

Use a specific version in another repo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages