The goal of this project is to test out various self-supervised / unsupervised pretraining strategies for high energy physics jets. It represents a minimal implementation of the paper Is Tokenization Needed for Masked Particle Modelling?. It is also a stripped-down and reduced version of the original repo, which still had older code used in testing and development.
This repo uses PyTorch and Lightning to facilitate training and evaluation of the model, Hydra for configuration management, and Weights and Biases for logging.
Additionally, some models require FlashAttention to operate, which necessitates running on Nvidia Ampere (or later) GPUs.
├── configs # All configuration files for hydra
│ ├── callbacks
│ ├── datamodule
│ ├── experiment
│ ├── hydra
│ ├── model
│ ├── export.yaml
│ └── train.yaml
├── data # Scripts for generating the data files
├── docker # Dockerfile and other docker related files
├── mltools # mltools submodule
├── plots # output plots
├── resources # Preprocessors and discretisors
├── scripts # All executable scripts other than data setup
├── workflow # Workflow examples using snakemake
├── pyproject.toml
├── README.md
├── requirements.txt
└── src # All source code
├── callbacks
├── data
└── models
To run this project clone and download the repository.
This project relies on a custom submodule called mltools stored here on CERN GitLab.
This is a collection of useful functions, layers and networks for deep learning developed by the RODEM group at UNIGE.
If you didn't clone the project with the --recursive flag you can pull the submodule using:
git submodule update --init --recursive
This project is setup to use the CERN GitLab CI/CD to automatically build a Docker image based on the docker/Dockerfile and requirements.txt when a commit is pushed.
The latest images can be found here.
To install the project locally, you can use the requirements.txt file to install the necessary packages.
However, installing FlashAttention requires the packages ninja and packaging to already be installed!
pip install packaging ninja
pip install -r requirements.txt
The dataset is generated using the data scripts.
Two datasets are used, which first have to be manually downloaded using the following links
- JetClass Dataset
- Btagging Dataset (coming soon)
Once the files are downloaded, they need to be converted from ROOT to the more usable HDF format using the following scripts:
data/make_jetclass.pydata/make_btag.py
Additionally, the JetClass dataset is huge and must, therefore, be merged into a single HDF file for fast streaming access during training:
data/combine_jetclass.py
Configuration is done through Hydra, which not only allows one to configure a class, but also choose which class to use via its instantiation method.
The main configuration file that composes training is train.yaml.
- This file sets up all the paths to save the network, the trainer and logger.
- It also imports additional yaml files from the other folders:
model: Chooses which model to traincallbacks: Chooses which callbacks to usedatamodule: Chooses which datamodule to useexperiment: Overwrites any of the config values before composition
There are two main classes for the datasets: StreamDataset and MapDataset, which also have corresponding datamodules.
- The
StreamDatasetis used for the JetClass dataset, which is too large to fit into memory and must be streamed from disk. For speed, theStreamDataset__getitem__method returns a full batch slice from the HDF file. One downside is that the dataset is not shuffled between epochs, but this is not too much of an issue due to its size. - The
MapDatasetis used for the Btagging dataset, which is small enough to fit into memory. This is a more traditional PyTorch dataset, which can be shuffled between epochs.
If a model has a predict_step method, then outputs can be saved using the scripts/export.py script.
The predict_step method should return a dictionary of tensors with shape (batch_size, *).
These will be stacked and saved as seperate keys in an output HDF file.
It is not advisable to directly change any of the set yaml files. Instead, change the behavior of the run using an experiment file.
For example, the experiment file example.yaml contains:
# @package _global_
defaults:
- override /model: ssfm
encoder_config:
dim: 64
decoder_config:
dim: 64This would change the model to use the ssfm model and set the encoder and decoder dimensions to 64.
It can be loaded via the command line using the following command:
python scripts/train.py experiment=exampleAlternatively, specific parameters can be set via the command line:
python scripts/train.py model=ssfm encoder_config.dim=64 decoder_config.dim=64The workflow directory contains a snakemake file that can be used to run an experiment containing multiple steps.
- This is useful for example running the pretraining and then the finetuning command in sequence for many SSL models.
- The current configuration is designed explicitly for the UNIGE HPC cluster and will need to be modified for other systems.
- It uses SLURM as the executor plugin and requires an apptainer image to have been built.
- Snakemake's updates are not backwards compatible, so specific versions are required to run the workflow.
Install the required packages using:
pip install snakemake-executor-plugin-slurm==0.4.1 snakemake==8.4.1Then run the workflow using:
snakemake --snakefile workflow/example.smk --workflow-profile workflowTo just build the dag and not run the workflow, append the following to the above command:
... -e dryrun --dag | dot -Tpng > workflow/example.pngThis creates the following image:
This project is licensed under the MIT License. See the LICENSE file for details.

