Skip to content

maranasgroup/CatPred

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

131 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

CatPred: A Comprehensive Framework for Deep Learning In Vitro Enzyme Kinetic Parameters

DOI Colab License


๐Ÿšจ Announcements ๐Ÿ“ข

  • โœ… 28th Feb 2025 - Published in Nature Communications
  • โœ… 27th Dec 2024 - Updated repository with scripts to reproduce results from the manuscript.
  • ๐Ÿšง TODO
    • Add prediction codes for models using 3D-structural features.
    • Add instructions to install CatPred using a Docker image.

๐Ÿ“š Table of Contents


๐ŸŒ Google Colab Interface

For ease of use without any hardware requirements, a Google Colab interface is available here: tiny.cc/catpred. It contains sample data, instructions, and installation all in the Colab notebook.


๐Ÿ’ป Local Installation

If you would like to install the package on a local machine, please follow the instructions below.

๐Ÿ–ฅ๏ธ System Requirements

  • For prediction: Any machine running a Linux-based operating system is recommended.
  • For training: A Linux-based operating system on a GPU-enabled machine is recommended.

Both training and prediction have been tested on Ubuntu 20.04.5 LTS with NVIDIA A10 and CUDA Version: 12.0.

To train or predict with GPUs, you will need:

  • CUDA >= 11.7
  • cuDNN

๐Ÿ“ฅ Installation

Both options require conda, so first install Miniconda from https://conda.io/miniconda.html.

Then proceed to either option below to complete the installation. If installing the environment with conda seems to be taking too long, you can also try running conda install -c conda-forge mamba and then replacing conda with mamba in each of the steps below.

Note for machines with GPUs: You may need to manually install a GPU-enabled version of PyTorch by following the instructions here. If you're encountering issues with not using a GPU on your system after following the instructions below, check which version of PyTorch you have installed in your environment using conda list | grep torch or similar. If the PyTorch line includes cpu, please uninstall it using conda remove pytorch and reinstall a GPU-enabled version using the instructions at the link above.

Installing and Downloading Pre-trained Models (~10 mins)

mkdir catpred_pipeline catpred_pipeline/results
cd catpred_pipeline
wget -c --tries=5 --timeout=30 https://catpred.s3.us-east-1.amazonaws.com/capsule_data_update.tar.gz || \
wget -c --tries=5 --timeout=30 https://catpred.s3.amazonaws.com/capsule_data_update.tar.gz
tar -xzf capsule_data_update.tar.gz
git clone https://github.com/maranasgroup/catpred.git
cd catpred
conda env create -f environment.yml
conda activate catpred
pip install -e .

stride is Linux-only and optional for the default demos. If needed for your workflow, install it separately on Linux:

conda install -c kimlab stride

๐Ÿ”ฎ Prediction

The Jupyter Notebook batch_demo.ipynb and the Python script demo_run.py show the usage of pre-trained models for prediction.

Input CSV requirements for demo_run.py and batch prediction:

  • Required columns: SMILES, sequence, pdbpath.
  • pdbpath must be unique per unique sequence. Reusing the same pdbpath for different sequences can produce incorrect cached embeddings.
  • Reusing the same pdbpath for repeated measurements of the same sequence is supported.

The helper script used to build protein records is:

python ./scripts/create_pdbrecords.py --data_file <input.csv> --out_file <input.json.gz>

CatPred currently expects one sequence per row. Multi-protein complexes (e.g., heteromers/homodimers) are not explicitly modeled as separate chains in the default prediction workflow.

For released benchmark datasets, the number of entries with 3D structure can be smaller than the total sequence/substrate pairs; 3D-derived artifacts are available only for the subset with valid structure mapping.

๐ŸŒ Web API (Optional)

CatPred also provides an optional FastAPI service for prediction workflows.

Install web dependencies:

pip install -e ".[web]"

Run the API:

catpred_web --host 0.0.0.0 --port 8000

Endpoints:

  • GET /health โ€” liveness check.
  • GET /ready โ€” backend configuration/readiness.
  • POST /predict โ€” run inference.

By default, the API is hardened for service use:

  • input_file requests are disabled (use input_rows instead).
  • request-time overrides of repo_root / python_executable are disabled.
  • results_dir is constrained under CATPRED_API_RESULTS_ROOT.

Minimal POST /predict example for local inference using input_rows:

curl -X POST http://127.0.0.1:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "parameter": "kcat",
    "checkpoint_dir": "../data/pretrained/reproduce_checkpoints/kcat",
    "input_rows": [
      {"SMILES": "CCO", "sequence": "ACDEFGHIK", "pdbpath": "seq_a"},
      {"SMILES": "CCN", "sequence": "LMNPQRSTV", "pdbpath": "seq_b"}
    ],
    "results_dir": "batch1",
    "backend": "local"
  }'

You can keep local inference as default and optionally enable Modal as another backend:

export CATPRED_DEFAULT_BACKEND=local
export CATPRED_MODAL_ENDPOINT="https://<your-modal-endpoint>"
export CATPRED_MODAL_TOKEN="<optional-token>"
export CATPRED_MODAL_FALLBACK_TO_LOCAL=1

Use "backend": "modal" in /predict requests to route through Modal. If fallback is enabled (env var above or request field fallback_to_local), failed modal requests can automatically reroute to local inference.

Optional API environment variables:

# Root directories used by API path constraints
export CATPRED_API_INPUT_ROOT="/absolute/path/for/input-csvs"
export CATPRED_API_RESULTS_ROOT="/absolute/path/for/results"
export CATPRED_API_CHECKPOINT_ROOT="/absolute/path/for/checkpoints"

# Enable only for trusted local workflows (not recommended for public deployments)
export CATPRED_API_ALLOW_INPUT_FILE=1
export CATPRED_API_ALLOW_UNSAFE_OVERRIDES=1

# Request limits
export CATPRED_API_MAX_INPUT_ROWS=1000
export CATPRED_API_MAX_INPUT_FILE_BYTES=5000000

Deserialization hardening controls:

# Trusted roots used by secure loaders (colon-separated list on Unix)
export CATPRED_TRUSTED_DESERIALIZATION_ROOTS="/srv/catpred:/srv/catpred-data"

# Backward-compatible default is enabled (1). Set to 0 to block unsafe pickle-based loading.
# Use 0 only after validating your artifacts are safe-load compatible.
export CATPRED_ALLOW_UNSAFE_DESERIALIZATION=1

๐Ÿงช Fine-Tuning On Custom Data

You can fine-tune CatPred on your own regression targets using train.py.

  1. Prepare train/val/test CSVs with at least:
  • SMILES
  • sequence
  • pdbpath (unique per unique sequence)
  • one numeric target column (for example: log10kcat_max)
  1. Build a protein-records file that covers all pdbpath values in your splits:
python ./scripts/create_pdbrecords.py --data_file <combined_or_train_csv> --out_file <protein_records.json.gz>
  1. Train:
python train.py \
  --protein_records_path <protein_records.json.gz> \
  --data_path <train.csv> \
  --separate_val_path <val.csv> \
  --separate_test_path <test.csv> \
  --dataset_type regression \
  --smiles_columns SMILES \
  --target_columns <target_column_name> \
  --add_esm_feats \
  --loss_function mve \
  --save_dir <output_checkpoint_dir>

For working end-to-end examples, see the training commands in scripts such as scripts/reproduce_figS10_catpred.sh.

๐Ÿ”„ Reproducing Publication Results

We provide three separate ways for reproducing the results of the publication.

1. Quick Method โšก

Estimated run time: Few minutes

Run using:

./reproduce_quick.sh

For all results pertaining to CatPred, UniKP, DLKcat, and Baseline models, this method only uses pre-trained predictions and analyses to reproduce results of the publications, including all main and supplementary figures.

2. Prediction Method ๐Ÿ› ๏ธ

Estimated run time: Up to a day depending on your GPU

Run using:

./reproduce_prediction.sh

For results pertaining to CatPred, this method uses pre-trained models to perform predictions on test sets. For results pertaining to UniKP, DLKcat, and Baseline, this method uses only pre-trained predictions and analyses to reproduce results of the publications, including all main and supplementary figures.

3. Training Method ๐Ÿ‹๏ธ

Estimated run time: Up to 12-14 days depending on your GPU

Run using:

./reproduce_training.sh

For all results pertaining to CatPred, UniKP, DLKcat, and Baseline models, this method trains everything from scratch. Then, it uses the trained checkpoints to make predictions and analyzes them to reproduce results of the publications, including all main and supplementary figures.


๐Ÿ™ Acknowledgements

We thank the authors of the following open-source repositories:

  • Chemprop - Majority of the functionality in this codebase has been inspired from the Chemprop library.
  • Rotary PyTorch - The rotary positional embeddings functionality for Seq-Attn. is from Rotary PyTorch.
  • Progres - Protein Graph Embedding Search using pre-trained EGNN models from Progres.

๐Ÿ“œ License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.


๐Ÿ“– Citations

If you find the models useful in your research, we ask that you cite the relevant paper:

@article {Boorla2024.03.10.584340,
	author = {Veda Sheersh Boorla and Costas D. Maranas},
	title = {CatPred: A comprehensive framework for deep learning in vitro enzyme kinetic parameters kcat, Km and Ki},
	elocation-id = {2024.03.10.584340},
	year = {2024},
	doi = {10.1101/2024.03.10.584340},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2024/03/26/2024.03.10.584340},
	eprint = {https://www.biorxiv.org/content/early/2024/03/26/2024.03.10.584340.full.pdf},
	journal = {bioRxiv}
}

About

Machine Learning models for in vitro enzyme kinetic parameter prediction

Topics

Resources

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE.txt

Stars

Watchers

Forks

Packages

 
 
 

Contributors