Skip to content

ashtawy/optimol

Repository files navigation

📌  Introduction

This Python package is designed for training, validating, and deploying machine learning models for drug discovery applications, including bioactivity and property prediction. Specifically, it allows you to train and evaluate graph neural network (GNN) and SE(3) equivariant Transformer models for both regression and classification tasks.

The current GNN and SE(3) equivariant Transformer models offer several capabilities:

  1. The models simultaneously utilize atom, edge features, 3D conformations, along with global molecular features.
  2. They support training as a single model or as an ensemble of models.
  3. Multi-task learning, enabling prediction of multiple tasks concurrently.
  4. It accommodates both target-based and ligand-based prediction tasks.
  5. The loss function supports classification and regression tasks.

Install using conda

Development

git clone git@github.com:ashtawy/optimol.git
cd optimol
conda env create -f environment.yaml -n optimol
conda activate optimol
export PYTHONNOUSERSITE=1
pip install --no-deps -e .

Production

pip install git+https://github.com/ashtawy/optimol.git

Usage

This is a quick benchmark experiment using experimental microsomal clearance data from the Therapeutics Data Commons (TDC) to quickly test a ligand-based model.

To train and test the model, run:

# assuming you copied the repo files ./data/tdc_microsome_train_val.csv & ./data/tdc_microsome_test.csv in this repo to /tmp/tdc_clearance
optimol_train data=mol2d \
            data.train_data=/tmp/tdc_clearance/tdc_microsome_train_val.csv \
            data.test_data=/tmp/tdc_clearance/tdc_microsome_test.csv \
            data.ligand_field=smiles \
            data.ligand_id_field=name \
            data.tasks='{Y_log:regression}' \
            model.ensemble_size=5 \
            hydra.run.dir=/tmp/tdc_clearance/model_and_evals

To use the model later for scoring, run:

# assuming you copied the file ./data/tdc_microsome_test.csv in this repo to /tmp/tdc_clearance
optimol_score --config-path /tmp/tdc_clearance/model_and_evals/.hydra  \
        --config-name config.yaml \
        ckpt_path=/tmp/tdc_clearance/model_and_evals \
        task_name=eval \
        data.test_data=/tmp/tdc_clearance/tdc_microsome_test.csv \
        logger=null \
        data.ligand_field=smiles \
        data.ligand_id_field=name \
        hydra.run.dir=/tmp/tdc_clearance/predictions

Train a Ligand-Based GNN model on Experimental Microsomal Clearance Data

Project Structure

The directory structure of new project looks like this:

├── .github                   <- Github Actions workflows
│
├── configs                   <- Hydra configs
│   ├── callbacks                <- Callbacks configs
│   ├── data                     <- Data configs
│   ├── debug                    <- Debugging configs
│   ├── experiment               <- Experiment configs
│   ├── extras                   <- Extra utilities configs
│   ├── hparams_search           <- Hyperparameter search configs
│   ├── hydra                    <- Hydra configs
│   ├── local                    <- Local configs
│   ├── logger                   <- Logger configs
│   ├── model                    <- Model configs
│   ├── paths                    <- Project paths configs
│   ├── trainer                  <- Trainer configs
│   │
│   ├── eval.yaml             <- Main config for evaluation
│   └── train.yaml            <- Main config for training
│
├── data                   <- Project data
│
├── logs                   <- Logs generated by hydra and lightning loggers
│
├── notebooks              <- Jupyter notebooks. Naming convention is a number (for ordering),
│                             the creator's initials, and a short `-` delimited description,
│                             e.g. `1.0-jqp-initial-data-exploration.ipynb`.
│
├── scripts                <- Shell scripts
│
├──  optimol                   <- Source code
│   ├── data                     <- Data scripts
│   ├── models                   <- Model scripts
│   ├── utils                    <- Utility scripts
│   │
│   ├── eval.py                  <- Run evaluation
│   └── train.py                 <- Run training
│
├── tests                  <- Tests of any kind
│
├── .env.example              <- Example of file for storing private environment variables
├── .gitignore                <- List of files ignored by git
├── .pre-commit-config.yaml   <- Configuration of pre-commit hooks for code formatting
├── .project-root             <- File for inferring the position of project root directory
├── environment.yaml          <- File for installing conda environment
├── Makefile                  <- Makefile with commands like `make train` or `make test`
├── pyproject.toml            <- Configuration options for testing and linting
├── requirements.txt          <- File for installing python dependencies
├── setup.py                  <- File for installing project as a package
└── README.md

🚀  Quickstart

Quick usage tips

Override any config parameter from command line
python train.py trainer.max_epochs=20 model.optimizer.lr=1e-4

Note: You can also add new parameters with + sign.

python train.py +model.new_param="owo"

Note: You can also do this

# Override many configs with CLI by removing an entire section and replacing it with a new one
python train.py --config-path /home/hashtawy/configs --config-name train.yaml data=mol2d data.train_data=/data/experiments/exp1/train_mols.csv data.test_data=/data/experiments/exp1/test_mols.csv data.tasks=null data.tasks='{is_active:classification}' data.ligand_field=smiles data.ligand_id_field=name hydra.run.dir=/data/experiments/exp1/output_dir model=lbgnn model.ensemble_size=1 trainer.max_epochs=10
Train on CPU, GPU, multi-GPU and TPU
# train on CPU
python train.py trainer=cpu

# train on 1 GPU
python train.py trainer=gpu

# train on TPU
python train.py +trainer.tpu_cores=8

# train with DDP (Distributed Data Parallel) (4 GPUs)
python train.py trainer=ddp trainer.devices=4

# train with DDP (Distributed Data Parallel) (8 GPUs, 2 nodes)
python train.py trainer=ddp trainer.devices=4 trainer.num_nodes=2

# simulate DDP on CPU processes
python train.py trainer=ddp_sim trainer.devices=2

# accelerate training on mac
python train.py trainer=mps
Evaluate checkpoint on test dataset
python eval.py ckpt_path="/path/to/ckpt/name.ckpt"
python optimol/ens_eval.py --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-04_05-39-47/.hydra --config-name config.yaml hydra.run.dir=/tmp/test2 ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-04_05-39-47 task_name=eval data.test_data_dir=/data/datasets/enamine/real_db/version_240222/hf_datasets/exp_shard_0

python optimol/ens_eval.py --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-04_05-39-47/.hydra --config-name config.yaml hydra.run.dir=/tmp/test3 ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-04_05-39-47 task_name=eval data.test_data_dir=/data/datasets/enamine/real_db/version_240222/hf_datasets/exp_shard_0 logger=null

optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-04_05-39-47/.hydra --config-name config.yaml hydra.run.dir=/tmp/test4 ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-04_05-39-47 task_name=eval data.test_data_dir=/data/datasets/enamine/real_db/version_240222/hf_datasets/exp_shard_0 logger=null

adock raw
optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-16_22-41-51/.hydra --config-name config.yaml hydra.run.dir=/tmp/test5 ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-16_22-41-51 task_name=eval data.test_data_dir=/data/datasets/enamine/real_db/version_240222/hf_datasets/exp_shard_0 logger=null

adock normalized
optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-17_14-57-32/.hydra --config-name config.yaml hydra.run.dir=/tmp/test5 ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-07-17_14-57-32 task_name=eval data.test_data_dir=/data/datasets/enamine/real_db/version_240222/hf_datasets/exp_shard_0 logger=null

BA & Poses
optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-05_14-12-54/.hydra --config-name config.yaml hydra.run.dir=/tmp/test8 ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-05_14-12-54 task_name=eval data.test_data_dir=/data/datasets/pdbbind/v2020/hf_datasets/core2016_pK_and_poses logger=null

optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-06_17-45-51/.hydra --config-name config.yaml hydra.run.dir=/data/datasets/bindingdb/version_240608/adock/sim_0.5_max_16_poses_preds ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-06_17-45-51 task_name=eval data.test_data_dir=/data/datasets/bindingdb/version_240608/adock/hf_datasets/sim_0.5_max_16_poses logger=null

# test featurization w/o docking
optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04/.hydra --config-name config.yaml hydra.run.dir=/tmp/test_docking ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04 task_name=eval data.test_data=/data/datasets/pdbbind/v2020/redocked_2020/train_test_splits/core2016_pK_sample5.csv logger=null data.target_field=protein data.ligand_field=ligand data.complex_id_field=pdb_id

optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04/.hydra --config-name config.yaml ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04 task_name=eval data.test_data=/data/datasets/pdbbind/v2020/redocked_2020/train_test_splits/core2016_pK_sample5.csv logger=null data.target_field=protein data.ligand_field=ligand data.complex_id_field=pdb_id data.ligand_dock.output_dir=/tmp/test_w_docking hydra.run.dir=/tmp/test_w_docking data.ligand_dock.enabled=True data.ligand_prep.enabled=True data.ligand_prep.output_dir=/tmp/test_w_docking


optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04/.hydra --config-name config.yaml ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04 task_name=eval data.test_data=/data/datasets/pdbbind/v2020/redocked_2020/train_test_splits/core2016_pK_sample5.csv logger=null data.target_field=protein data.ligand_field=ligand data.complex_id_field=pdb_id data.ligand_dock.enabled=True data.ligand_dock.n_poses=5 data.ligand_prep.enabled=True data.ligand_prep.n_confs=5 hydra.run.dir=/home/hashtawy/tmp/test_prep_n_dock

optimol_score --config-path /home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04/.hydra --config-name config.yaml ckpt_path=/home/hashtawy/projects/light_hydra/logs/train/runs/2024-08-28_04-31-04 task_name=eval data.test_data=/data/screens/xtarget/data/molecules.csv logger=null data.target_path=/data/datasets/pdbbind/v2020/v2020-general-PL/xtarget/xtarget_protein.pdb data.ligand_field=smiles data.complex_id_field=name data.ligand_dock.enabled=True data.ligand_dock.n_poses=2 data.ligand_prep.crystal_ligand_path=/data/datasets/pdbbind/v2020/v2020-general-PL/xtarget/xtarget_ligand.sdf data.ligand_prep.enabled=True data.ligand_prep.n_confs=2 data.ligand_dock.crystal_ligand_path=/data/datasets/pdbbind/v2020/v2020-general-PL/xtarget/xtarget_ligand.sdf data.target_field=protein hydra.run.dir=/home/hashtawy/tmp/xtarget_output1


optimol_score --config-path /data/models/optimol/models/artifacts/2024-08-06/2024-08-06_17-45-51/.hydra --config-name config.yaml ckpt_path=/data/models/optimol/models/artifacts/2024-08-06/2024-08-06_17-45-51 task_name=eval data.test_data=/data/screens/xtarget/data/molecules.csv logger=null data.target_path=/data/datasets/pdbbind/v2020/v2020-general-PL/xtarget/xtarget_protein.pdb data.ligand_field=smiles data.complex_id_field=name data.ligand_dock.enabled=True data.ligand_dock.n_poses=5 data.ligand_prep.crystal_ligand_path=/data/datasets/pdbbind/v2020/v2020-general-PL/xtarget/xtarget_ligand.sdf data.ligand_prep.enabled=True data.ligand_prep.n_confs=5 data.ligand_dock.crystal_ligand_path=/data/datasets/pdbbind/v2020/v2020-general-PL/xtarget/xtarget_ligand.sdf data.target_field=protein hydra.run.dir=/home/hashtawy/tmp/xtarget_output2 data.ligand_dock.docking_engines=qvina_w_gpu

Note: Checkpoint can be either path or URL.

How It Works

All PyTorch Lightning modules are dynamically instantiated from module paths specified in config. Example model config:

_target_: optimol.models.gnn_module.GnnLitModule

optimizer:
  _target_: torch.optim.Adam
  _partial_: true
  lr: 0.0005
  weight_decay: 0.00001

scheduler:
  _target_: torch.optim.lr_scheduler.ReduceLROnPlateau
  _partial_: true
  mode: min
  factor: 0.1
  patience: 100000

ensemble_size: 5

net:
  _target_: optimol.models.components.gnn.GNN
  n_global_features: 0 #112
  n_atom_types: 17
  n_atom_embeddings: 36
  # etc etc

Using this config we can instantiate the object with the following line:

model = hydra.utils.instantiate(config.model)

The config system makes experimenting with different models straightforward - simply define a new model's module path and parameters in a config file.

You can easily swap between different models and datamodules using command line flags:

python train.py model=sbgnn

Example pipeline managing the instantiation logic: optimol/train.py.


Main Config

Location: configs/train.yaml
Main project config contains default training configuration.
It determines how config is composed when simply executing command python train.py.

Show main project config
# order of defaults determines the order in which configs override each other
defaults:
  - _self_
  - data: mol2d.yaml
  - model: lbgnn.yaml
  - callbacks: default.yaml
  - logger: null # set logger here or use command line (e.g. `python train.py logger=csv`)
  - trainer: default.yaml
  - paths: default.yaml
  - extras: default.yaml
  - hydra: default.yaml

  # experiment configs allow for version control of specific hyperparameters
  # e.g. best hyperparameters for given model and datamodule
  - experiment: null

  # config for hyperparameter optimization
  - hparams_search: null

  # optional local config for machine/user specific settings
  # it's optional since it doesn't need to exist and is excluded from version control
  - optional local: default.yaml

  # debugging config (enable through command line, e.g. `python train.py debug=default)
  - debug: null

# task name, determines output directory path
task_name: "train"

# tags to help you identify your experiments
# you can overwrite this in experiment configs
# overwrite from command line with `python train.py tags="[first_tag, second_tag]"`
# appending lists from command line is currently not supported :(
# https://github.com/facebookresearch/hydra/issues/1547
tags: ["dev"]

# set False to skip model training
train: True

# evaluate on test set, using best model weights achieved during training
# lightning chooses best weights based on the metric specified in checkpoint callback
test: True

# simply provide checkpoint path to resume training
ckpt_path: null

# seed for random number generators in pytorch, numpy and python.random
seed: null

Workflow

Basic workflow

  1. Write your PyTorch Lightning module (see models/gnn_module.py for example)
  2. Write your PyTorch Lightning datamodule (see data/mol2d_datamodule.py for example)
  3. Run training:
    python optimol/train.py

Logs

Hydra creates new output directory for every executed run.

Default logging structure:

├── logs
│   ├── task_name
│   │   ├── runs                        # Logs generated by single runs
│   │   │   ├── YYYY-MM-DD_HH-MM-SS       # Datetime of the run
│   │   │   │   ├── .hydra                  # Hydra logs
│   │   │   │   ├── csv                     # Csv logs
│   │   │   │   ├── wandb                   # Weights&Biases logs
│   │   │   │   ├── checkpoints             # Training checkpoints
│   │   │   │   └── ...                     # Any other thing saved during training
│   │   │   └── ...
│   │   │
│   │   └── multiruns                   # Logs generated by multiruns
│   │       ├── YYYY-MM-DD_HH-MM-SS       # Datetime of the multirun
│   │       │   ├──1                        # Multirun job number
│   │       │   ├──2
│   │       │   └── ...
│   │       └── ...
│   │
│   └── debugs                          # Logs generated when debugging config is attached
│       └── ...

You can change this structure by modifying paths in hydra configuration.


Tests

Template comes with generic tests implemented with pytest.

# run all tests
pytest

# run tests from specific file
pytest tests/test_train.py

# run all tests except the ones marked as slow
pytest -k "not slow"

Most of the implemented tests don't check for any specific output - they exist to simply verify that executing some commands doesn't end up in throwing exceptions. You can execute them once in a while to speed up the development.

Currently, the tests cover cases like:

  • running 1 train, val and test step
  • running 1 epoch on 1% of data, saving ckpt and resuming for the second epoch
  • running 2 epochs on 1% of data, with DDP simulated on CPU

And many others. You should be able to modify them easily for your use case.

There is also @RunIf decorator implemented, that allows you to run tests only if certain conditions are met, e.g. GPU is available or system is not windows. See the examples.


Continuous Integration

Template comes with CI workflows implemented in Github Actions:

  • .github/workflows/test.yaml: running all tests with pytest
  • .github/workflows/code-quality-main.yaml: running pre-commits on main branch for all files
  • .github/workflows/code-quality-pr.yaml: running pre-commits on pull requests for modified files only

Distributed Training

Lightning supports multiple ways of doing distributed training. The most common one is DDP, which spawns separate process for each GPU and averages gradients between them. To learn about other approaches read the lightning docs.

You can run DDP on mnist example with 4 GPUs like this:

python train.py trainer=ddp

Note: When using DDP you have to be careful how you write your models - read the docs.


Accessing Datamodule Attributes In Model

The simplest way is to pass datamodule attribute directly to model on initialization:

# ./optimol/train.py
datamodule = hydra.utils.instantiate(config.data)
model = hydra.utils.instantiate(config.model, some_param=datamodule.some_param)

Note: Not a very robust solution, since it assumes all your datamodules have some_param attribute available.

Similarly, you can pass a whole datamodule config as an init parameter:

# ./optimol/train.py
model = hydra.utils.instantiate(config.model, dm_conf=config.data, _recursive_=False)

You can also pass a datamodule config parameter to your model through variable interpolation:

# ./configs/model/my_model.yaml
_target_: optimol.models.my_module.MyLitModule
lr: 0.01
some_param: ${data.some_param}

Another approach is to access datamodule in LightningModule directly through Trainer:

# ./optimol/models/mnist_module.py
def on_train_start(self):
  self.some_param = self.trainer.datamodule.some_param

Note: This only works after the training starts since otherwise trainer won't be yet available in LightningModule.


License

Optimol is licensed under the following license.

All Rights Reserved

Copyright (c) 2024 Optimol

This software and associated documentation files (the "Software") are proprietary and confidential. 
No part of this Software may be reproduced, distributed, or transmitted in any form or by any means, 
including photocopying, recording, or other electronic or mechanical methods, without the prior 
written permission of the copyright holder.

Unauthorized copying, modification, distribution, public display, or use of this Software for any 
purpose is strictly prohibited. The Software is protected by copyright law and international treaties.

This Software is provided by the copyright holders and contributors "AS IS" and any express or 
implied warranties, including, but not limited to, the implied warranties of merchantability and 
fitness for a particular purpose are disclaimed.

For licensing inquiries, please contact hashtawy@optimol.ai and hossam.ashtawy@gmail.com

Authors

About

Train structure-based and ligand-based deep learning models for predicting potency, ADMET, and general molecular properties

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages