This Python package is designed for training, validating, and deploying machine learning models for drug discovery applications, including bioactivity and property prediction. Specifically, it allows you to train and evaluate graph neural network (GNN) and SE(3) equivariant Transformer models for both regression and classification tasks.
- The models simultaneously utilize atom, edge features, 3D conformations, along with global molecular features.
- They support training as a single model or as an ensemble of models.
- Multi-task learning, enabling prediction of multiple tasks concurrently.
- It accommodates both target-based and ligand-based prediction tasks.
- The loss function supports classification and regression tasks.
git clone git@github.com:ashtawy/optimol.git
cd optimol
conda env create -f environment.yaml -n optimol
conda activate optimol
export PYTHONNOUSERSITE=1
pip install --no-deps -e .pip install git+https://github.com/ashtawy/optimol.git
This is a quick benchmark experiment using experimental microsomal clearance data from the Therapeutics Data Commons (TDC) to quickly test a ligand-based model.
To train and test the model, run:
# assuming you copied the repo files ./data/tdc_microsome_train_val.csv & ./data/tdc_microsome_test.csv in this repo to /tmp/tdc_clearance
optimol_train data=mol2d \
data.train_data=/tmp/tdc_clearance/tdc_microsome_train_val.csv \
data.test_data=/tmp/tdc_clearance/tdc_microsome_test.csv \
data.ligand_field=smiles \
data.ligand_id_field=name \
data.tasks='{Y_log:regression}' \
model.ensemble_size=5 \
hydra.run.dir=/tmp/tdc_clearance/model_and_evalsTo use the model later for scoring, run:
# assuming you copied the file ./data/tdc_microsome_test.csv in this repo to /tmp/tdc_clearance
optimol_score --config-path /tmp/tdc_clearance/model_and_evals/.hydra \
--config-name config.yaml \
ckpt_path=/tmp/tdc_clearance/model_and_evals \
task_name=eval \
data.test_data=/tmp/tdc_clearance/tdc_microsome_test.csv \
logger=null \
data.ligand_field=smiles \
data.ligand_id_field=name \
hydra.run.dir=/tmp/tdc_clearance/predictionsThe directory structure of new project looks like this:
├── .github <- Github Actions workflows
│
├── configs <- Hydra configs
│ ├── callbacks <- Callbacks configs
│ ├── data <- Data configs
│ ├── debug <- Debugging configs
│ ├── experiment <- Experiment configs
│ ├── extras <- Extra utilities configs
│ ├── hparams_search <- Hyperparameter search configs
│ ├── hydra <- Hydra configs
│ ├── local <- Local configs
│ ├── logger <- Logger configs
│ ├── model <- Model configs
│ ├── paths <- Project paths configs
│ ├── trainer <- Trainer configs
│ │
│ ├── eval.yaml <- Main config for evaluation
│ └── train.yaml <- Main config for training
│
├── data <- Project data
│
├── logs <- Logs generated by hydra and lightning loggers
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description,
│ e.g. `1.0-jqp-initial-data-exploration.ipynb`.
│
├── scripts <- Shell scripts
│
├── optimol <- Source code
│ ├── data <- Data scripts
│ ├── models <- Model scripts
│ ├── utils <- Utility scripts
│ │
│ ├── eval.py <- Run evaluation
│ └── train.py <- Run training
│
├── tests <- Tests of any kind
│
├── .env.example <- Example of file for storing private environment variables
├── .gitignore <- List of files ignored by git
├── .pre-commit-config.yaml <- Configuration of pre-commit hooks for code formatting
├── .project-root <- File for inferring the position of project root directory
├── environment.yaml <- File for installing conda environment
├── Makefile <- Makefile with commands like `make train` or `make test`
├── pyproject.toml <- Configuration options for testing and linting
├── requirements.txt <- File for installing python dependencies
├── setup.py <- File for installing project as a package
└── README.md
Override any config parameter from command line
python train.py trainer.max_epochs=20 model.optimizer.lr=1e-4Note: You can also add new parameters with
+sign.
python train.py +model.new_param="owo"Note: You can also do this
# Override many configs with CLI by removing an entire section and replacing it with a new one
python train.py --config-path /home/hashtawy/configs --config-name train.yaml data=mol2d data.train_data=/data/experiments/exp1/train_mols.csv data.test_data=/data/experiments/exp1/test_mols.csv data.tasks=null data.tasks='{is_active:classification}' data.ligand_field=smiles data.ligand_id_field=name hydra.run.dir=/data/experiments/exp1/output_dir model=lbgnn model.ensemble_size=1 trainer.max_epochs=10Train on CPU, GPU, multi-GPU and TPU
# train on CPU
python train.py trainer=cpu
# train on 1 GPU
python train.py trainer=gpu
# train on TPU
python train.py +trainer.tpu_cores=8
# train with DDP (Distributed Data Parallel) (4 GPUs)
python train.py trainer=ddp trainer.devices=4
# train with DDP (Distributed Data Parallel) (8 GPUs, 2 nodes)
python train.py trainer=ddp trainer.devices=4 trainer.num_nodes=2
# simulate DDP on CPU processes
python train.py trainer=ddp_sim trainer.devices=2
# accelerate training on mac
python train.py trainer=mpsEvaluate checkpoint on test dataset
python eval.py ckpt_path="/path/to/ckpt/name.ckpt"python optimol/ens_eval.py --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-04_05-39-47/.hydra --config-name config.yaml hydra.run.dir=/tmp/test2 ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-04_05-39-47 task_name=eval data.test_data_dir=/data/datasets/enamine/real_db/version_240222/hf_datasets/exp_shard_0
python optimol/ens_eval.py --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-04_05-39-47/.hydra --config-name config.yaml hydra.run.dir=/tmp/test3 ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-04_05-39-47 task_name=eval data.test_data_dir=/data/datasets/enamine/real_db/version_240222/hf_datasets/exp_shard_0 logger=null
optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-04_05-39-47/.hydra --config-name config.yaml hydra.run.dir=/tmp/test4 ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-04_05-39-47 task_name=eval data.test_data_dir=/data/datasets/enamine/real_db/version_240222/hf_datasets/exp_shard_0 logger=null
adock raw
optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-16_22-41-51/.hydra --config-name config.yaml hydra.run.dir=/tmp/test5 ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-16_22-41-51 task_name=eval data.test_data_dir=/data/datasets/enamine/real_db/version_240222/hf_datasets/exp_shard_0 logger=null
adock normalized
optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-17_14-57-32/.hydra --config-name config.yaml hydra.run.dir=/tmp/test5 ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-17_14-57-32 task_name=eval data.test_data_dir=/data/datasets/enamine/real_db/version_240222/hf_datasets/exp_shard_0 logger=null
BA & Poses
optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-05_14-12-54/.hydra --config-name config.yaml hydra.run.dir=/tmp/test8 ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-05_14-12-54 task_name=eval data.test_data_dir=/data/datasets/pdbbind/v2020/hf_datasets/core2016_pK_and_poses logger=null
optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-06_17-45-51/.hydra --config-name config.yaml hydra.run.dir=/data/datasets/bindingdb/version_240608/adock/sim_0.5_max_16_poses_preds ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-06_17-45-51 task_name=eval data.test_data_dir=/data/datasets/bindingdb/version_240608/adock/hf_datasets/sim_0.5_max_16_poses logger=null
# test featurization w/o docking
optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04/.hydra --config-name config.yaml hydra.run.dir=/tmp/test_docking ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04 task_name=eval data.test_data=/data/datasets/pdbbind/v2020/redocked_2020/train_test_splits/core2016_pK_sample5.csv logger=null data.target_field=protein data.ligand_field=ligand data.complex_id_field=pdb_id
optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04/.hydra --config-name config.yaml ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04 task_name=eval data.test_data=/data/datasets/pdbbind/v2020/redocked_2020/train_test_splits/core2016_pK_sample5.csv logger=null data.target_field=protein data.ligand_field=ligand data.complex_id_field=pdb_id data.ligand_dock.output_dir=/tmp/test_w_docking hydra.run.dir=/tmp/test_w_docking data.ligand_dock.enabled=True data.ligand_prep.enabled=True data.ligand_prep.output_dir=/tmp/test_w_docking
optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04/.hydra --config-name config.yaml ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04 task_name=eval data.test_data=/data/datasets/pdbbind/v2020/redocked_2020/train_test_splits/core2016_pK_sample5.csv logger=null data.target_field=protein data.ligand_field=ligand data.complex_id_field=pdb_id data.ligand_dock.enabled=True data.ligand_dock.n_poses=5 data.ligand_prep.enabled=True data.ligand_prep.n_confs=5 hydra.run.dir=/home/hashtawy/tmp/test_prep_n_dock
optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04/.hydra --config-name config.yaml ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04 task_name=eval data.test_data=/data/screens/xtarget/data/molecules.csv logger=null data.target_path=/data/datasets/pdbbind/v2020/v2020-general-PL/xtarget/xtarget_protein.pdb data.ligand_field=smiles data.complex_id_field=name data.ligand_dock.enabled=True data.ligand_dock.n_poses=2 data.ligand_prep.crystal_ligand_path=/data/datasets/pdbbind/v2020/v2020-general-PL/xtarget/xtarget_ligand.sdf data.ligand_prep.enabled=True data.ligand_prep.n_confs=2 data.ligand_dock.crystal_ligand_path=/data/datasets/pdbbind/v2020/v2020-general-PL/xtarget/xtarget_ligand.sdf data.target_field=protein hydra.run.dir=/home/hashtawy/tmp/xtarget_output1
optimol_score --config-path /data/models/optimol/models/artifacts/2024-08-06/2024-08-06_17-45-51/.hydra --config-name config.yaml ckpt_path=/data/models/optimol/models/artifacts/2024-08-06/2024-08-06_17-45-51 task_name=eval data.test_data=/data/screens/xtarget/data/molecules.csv logger=null data.target_path=/data/datasets/pdbbind/v2020/v2020-general-PL/xtarget/xtarget_protein.pdb data.ligand_field=smiles data.complex_id_field=name data.ligand_dock.enabled=True data.ligand_dock.n_poses=5 data.ligand_prep.crystal_ligand_path=/data/datasets/pdbbind/v2020/v2020-general-PL/xtarget/xtarget_ligand.sdf data.ligand_prep.enabled=True data.ligand_prep.n_confs=5 data.ligand_dock.crystal_ligand_path=/data/datasets/pdbbind/v2020/v2020-general-PL/xtarget/xtarget_ligand.sdf data.target_field=protein hydra.run.dir=/home/hashtawy/tmp/xtarget_output2 data.ligand_dock.docking_engines=qvina_w_gpu
Note: Checkpoint can be either path or URL.
All PyTorch Lightning modules are dynamically instantiated from module paths specified in config. Example model config:
_target_: optimol.models.gnn_module.GnnLitModule
optimizer:
_target_: torch.optim.Adam
_partial_: true
lr: 0.0005
weight_decay: 0.00001
scheduler:
_target_: torch.optim.lr_scheduler.ReduceLROnPlateau
_partial_: true
mode: min
factor: 0.1
patience: 100000
ensemble_size: 5
net:
_target_: optimol.models.components.gnn.GNN
n_global_features: 0 #112
n_atom_types: 17
n_atom_embeddings: 36
# etc etcUsing this config we can instantiate the object with the following line:
model = hydra.utils.instantiate(config.model)The config system makes experimenting with different models straightforward - simply define a new model's module path and parameters in a config file.
You can easily swap between different models and datamodules using command line flags:
python train.py model=sbgnnExample pipeline managing the instantiation logic: optimol/train.py.
Location: configs/train.yaml
Main project config contains default training configuration.
It determines how config is composed when simply executing command python train.py.
Show main project config
# order of defaults determines the order in which configs override each other
defaults:
- _self_
- data: mol2d.yaml
- model: lbgnn.yaml
- callbacks: default.yaml
- logger: null # set logger here or use command line (e.g. `python train.py logger=csv`)
- trainer: default.yaml
- paths: default.yaml
- extras: default.yaml
- hydra: default.yaml
# experiment configs allow for version control of specific hyperparameters
# e.g. best hyperparameters for given model and datamodule
- experiment: null
# config for hyperparameter optimization
- hparams_search: null
# optional local config for machine/user specific settings
# it's optional since it doesn't need to exist and is excluded from version control
- optional local: default.yaml
# debugging config (enable through command line, e.g. `python train.py debug=default)
- debug: null
# task name, determines output directory path
task_name: "train"
# tags to help you identify your experiments
# you can overwrite this in experiment configs
# overwrite from command line with `python train.py tags="[first_tag, second_tag]"`
# appending lists from command line is currently not supported :(
# https://github.com/facebookresearch/hydra/issues/1547
tags: ["dev"]
# set False to skip model training
train: True
# evaluate on test set, using best model weights achieved during training
# lightning chooses best weights based on the metric specified in checkpoint callback
test: True
# simply provide checkpoint path to resume training
ckpt_path: null
# seed for random number generators in pytorch, numpy and python.random
seed: nullBasic workflow
- Write your PyTorch Lightning module (see models/gnn_module.py for example)
- Write your PyTorch Lightning datamodule (see data/mol2d_datamodule.py for example)
- Run training:
python optimol/train.py
Hydra creates new output directory for every executed run.
Default logging structure:
├── logs
│ ├── task_name
│ │ ├── runs # Logs generated by single runs
│ │ │ ├── YYYY-MM-DD_HH-MM-SS # Datetime of the run
│ │ │ │ ├── .hydra # Hydra logs
│ │ │ │ ├── csv # Csv logs
│ │ │ │ ├── wandb # Weights&Biases logs
│ │ │ │ ├── checkpoints # Training checkpoints
│ │ │ │ └── ... # Any other thing saved during training
│ │ │ └── ...
│ │ │
│ │ └── multiruns # Logs generated by multiruns
│ │ ├── YYYY-MM-DD_HH-MM-SS # Datetime of the multirun
│ │ │ ├──1 # Multirun job number
│ │ │ ├──2
│ │ │ └── ...
│ │ └── ...
│ │
│ └── debugs # Logs generated when debugging config is attached
│ └── ...
You can change this structure by modifying paths in hydra configuration.
Template comes with generic tests implemented with pytest.
# run all tests
pytest
# run tests from specific file
pytest tests/test_train.py
# run all tests except the ones marked as slow
pytest -k "not slow"Most of the implemented tests don't check for any specific output - they exist to simply verify that executing some commands doesn't end up in throwing exceptions. You can execute them once in a while to speed up the development.
Currently, the tests cover cases like:
- running 1 train, val and test step
- running 1 epoch on 1% of data, saving ckpt and resuming for the second epoch
- running 2 epochs on 1% of data, with DDP simulated on CPU
And many others. You should be able to modify them easily for your use case.
There is also @RunIf decorator implemented, that allows you to run tests only if certain conditions are met, e.g. GPU is available or system is not windows. See the examples.
Template comes with CI workflows implemented in Github Actions:
.github/workflows/test.yaml: running all tests with pytest.github/workflows/code-quality-main.yaml: running pre-commits on main branch for all files.github/workflows/code-quality-pr.yaml: running pre-commits on pull requests for modified files only
Lightning supports multiple ways of doing distributed training. The most common one is DDP, which spawns separate process for each GPU and averages gradients between them. To learn about other approaches read the lightning docs.
You can run DDP on mnist example with 4 GPUs like this:
python train.py trainer=ddpNote: When using DDP you have to be careful how you write your models - read the docs.
The simplest way is to pass datamodule attribute directly to model on initialization:
# ./optimol/train.py
datamodule = hydra.utils.instantiate(config.data)
model = hydra.utils.instantiate(config.model, some_param=datamodule.some_param)Note: Not a very robust solution, since it assumes all your datamodules have
some_paramattribute available.
Similarly, you can pass a whole datamodule config as an init parameter:
# ./optimol/train.py
model = hydra.utils.instantiate(config.model, dm_conf=config.data, _recursive_=False)You can also pass a datamodule config parameter to your model through variable interpolation:
# ./configs/model/my_model.yaml
_target_: optimol.models.my_module.MyLitModule
lr: 0.01
some_param: ${data.some_param}Another approach is to access datamodule in LightningModule directly through Trainer:
# ./optimol/models/mnist_module.py
def on_train_start(self):
self.some_param = self.trainer.datamodule.some_paramNote: This only works after the training starts since otherwise trainer won't be yet available in LightningModule.
Optimol is licensed under the following license.
All Rights Reserved
Copyright (c) 2024 Optimol
This software and associated documentation files (the "Software") are proprietary and confidential.
No part of this Software may be reproduced, distributed, or transmitted in any form or by any means,
including photocopying, recording, or other electronic or mechanical methods, without the prior
written permission of the copyright holder.
Unauthorized copying, modification, distribution, public display, or use of this Software for any
purpose is strictly prohibited. The Software is protected by copyright law and international treaties.
This Software is provided by the copyright holders and contributors "AS IS" and any express or
implied warranties, including, but not limited to, the implied warranties of merchantability and
fitness for a particular purpose are disclaimed.
For licensing inquiries, please contact hashtawy@optimol.ai and hossam.ashtawy@gmail.com