Train a gradient-boosted tree model that predicts PM2.5 from weather conditions.
Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activateInstall dependencies:
python -m pip install -r requirements.txtFetch or update the data repository:
./configureThis creates or updates data/, and the CSV files are read from data/csvs/.
Train the model with the default dataset:
python train_pm25_gbt.pyThe training script:
- inner-joins air and weather CSVs for the same year on timestamp
- normalizes merged timestamps to the weather CSV format
- appends all yearly merged tables together
- uses the latest year as the test split by default
Train with explicit test year and time features:
python train_pm25_gbt.py --test-year 2023 --add-time-featuresTrain without writing models/pm25_gbt.joblib:
python train_pm25_gbt.py --no-saveExport the merged dataset before training:
python train_pm25_gbt.py --merged-out merged_pm25_weather.csvUse a custom CSV directory:
python train_pm25_gbt.py --data-dir /path/to/csvs