Repository for final project of the course Deep Learning in Physics. It contains extensions for photometric classification of astronomical transients, built on top of the avocado framework for the PLAsTiCC dataset.
This repository implements three classifiers as alternatives to or improvements over the gradient-boosted decision tree in Boone (2019):
- Residual MLP trained on avocado GP features
- Transformer trained directly on raw augmented light curves
- Transformer trained on GP-interpolated light curves
Other attempted pipelines and models are also present, but do not necessarily perform well, or run from end to end.
This branch (main) contains the dlip_plasticc package with additional classifiers on top of Avocado. For a reproduction of Avocado results, please visit the reproduction branch, which keeps the repository structure of Avocado. For a redshift weighting analysis, please visit the redshift weighting branch.
The full repository is structured as follows.
dlip_plasticc/
├── avocado_settings.json # avocado path configuration (loaded at runtime)
├── configs/
│ └── default.toml # default training and path configuration
├── jobs/
│ ├── submit_chunks.sh # SLURM array job for GP-fitting training set
│ └── submit_chunks_test.sh # SLURM array job for GP-fitting test set
├── notebooks/ # notebooks to generate main results
│ ├── train_predict_score_mlp_plasticc.ipynb
│ ├── train_predict_score_transformer_plasticc.ipynb
│ └── train_predict_score_transformer_plasticc_GP.ipynb
├── scripts/
│ ├── run_chunk # process a single GP training chunk
│ ├── run_chunk_test # process a single GP test chunk
│ ├── merge_chunks # merge per-chunk .h5 files into one
│ ├── build_plasticc_augment_gp_chunks
│ └── train_classifier
├── src/
│ └── dlip_plasticc/
│ ├── config.py # configuration loading and avocado settings
│ ├── features/
│ │ ├── adapters.py # feature format adapters
│ │ └── sequence.py # sequence featurizers for raw light curves
│ ├── models/
│ │ ├── mlp.py # residual MLP classifier
│ │ └── transformer.py # transformer classifier
│ └── pipelines/
│ ├── blend.py # prediction blending utilities, not used in the main pipeline
│ ├── gp_fit.py # GP grid featurizer for transformer input
│ ├── predict.py # prediction pipeline utilities
│ └── score.py # flat, redshift, and Kaggle scoring metrics
├── pyproject.toml # package metadata and dependencies
└── README.md
Requirements: Python 3.10 or higher.
Note: It is advised to use a Conda or virtual environment to store all dependencies. For example:
conda create -n plasticc python=3.10 -y
Avocado must be installed manually. To avoid any compatibility conflicts, please install this fork:
pip install git+https://github.com/stuitje/avocado.git
cd avocado
pip install -e .git clone https://github.com/stuitje/dlip_plasticc.git
cd dlip_plasticc
pip install -e .To use this code, the necessary data must first be downloaded, preprocessed, augmented and, depending on the user's choice of model, featurised, using Avocado's pipeline.
Following the instructions on the avocado documentation:
Create and move to a new working directory for avocado. All of the datasets, classifiers and predictions will be stored in this directory, so consider avoiding your homefolder if your on a HPC cluster (use, for example, a scratch/ directory). For example:
mkdir ~/plasticcUpdate avocado_settings.json with the appropriate paths to the directory (created above), where you store your data, predictions, classifiers and features.
This downloads the PLAsTiCC dataset from zenodo, and preprocesses it automatically.
cd ~/plasticc # or any other folder name
avocado_download_plasticcHere the data is augmented to create a number of redshift realisations of each object. Adjust the number of augmentations as wished. Our results are based on 15 augmentations; Avocado uses 100. This can take a couple of hours to run.
Note: If you are on a HPC cluster, it is recommended to run this in e.g. a SLURM job.
avocado_augment plasticc_train plasticc_augment \
--num_augments 15 \
--num_chunks 100 \To use e.g. the MLP on the featurised dataset, featurisation is necessary. This will take a long time (full test set, sequentially, ~100 hours), especially if you used many augmentations in the step above. Consider parallelising, and submit it as a SLURM job when on a HPC cluster. Consider featurising only a part of the test set, at least in first instance.
avocado_featurize plasticc_train # usually not necessary, as we use the augmented dataset
avocado_featurize plasticc_test
avocado_featurize plasticc_augment
To featurise only a part (e.g. 10%) of the test set:
for chunk in $(seq 0 49); do
avocado_featurize plasticc_test \
--num_chunks 500 \
--chunk $chunk
done
To use a transformer on GP-fitted light curves, the raw light curves must first be interpolated onto a uniform time grid using GP regression. This step is optional and only required if you want to train the GP transformer.
The GP fitting is parallelised across chunks using SLURM job arrays.
Each chunk is processed independently and written to its own .h5 file,
which are then merged into a single file at the end.
Training set:
sbatch jobs/submit_chunks.shOnce all jobs have completed, check for missing chunks and merge:
# check all chunks are present
ls /scratch/.../data/chunks/chunk_*.h5 | wc -l # should print 100
# merge into a single file
scripts/merge_chunksTest set (first 100 chunks = 20% of test set):
sbatch jobs/submit_chunks_test.sh
# merge when done
scripts/ merge_chunks # update CHUNK_DIR and NUM_CHUNKS in the script firstNote: GP fitting the full training set takes approximately 12 core hours in total, but only ~5 minutes of wall time when parallelised across 100 SLURM jobs. If you do not have access to a SLURM cluster, you can run the chunks sequentially using
scripts/run_chunkandscripts/run_chunk_testdirectly.
Default model settings are in configs/default.toml. An avocado_settings.json file must be present in the working directory when running notebooks or scripts.
Note: In most notebooks, the default configuration parameters are overridden, so updating this file is not needed.
Training and evaluation notebooks are in the root directory:
| Notebook | Description |
|---|---|
train_predict_score_mlp_plasticc.ipynb |
Train the residual MLP on avocado features |
train_predict_score_transformer_plasticc_GP.ipynb |
Train the transformer on raw light curves |
train_predict_score_transformer_plasticc.ipynb |
Train the transformer on GP-interpolated light curves |
https://github.com/stuitje/dlip_plasticc
Boone, K. (2019). Avocado: Photometric Classification of Astronomical Transients with Gaussian Process Augmentation. The Astronomical Journal, 158(6), 257. https://doi.org/10.3847/1538-3881/ab5182