- โ 28th Feb 2025 - Published in Nature Communications
- โ 27th Dec 2024 - Updated repository with scripts to reproduce results from the manuscript.
- ๐ง TODO
- Add prediction codes for models using 3D-structural features.
- Add instructions to install CatPred using a Docker image.
For ease of use without any hardware requirements, a Google Colab interface is available here: tiny.cc/catpred. It contains sample data, instructions, and installation all in the Colab notebook.
If you would like to install the package on a local machine, please follow the instructions below.
- For prediction: Any machine running a Linux-based operating system is recommended.
- For training: A Linux-based operating system on a GPU-enabled machine is recommended.
Both training and prediction have been tested on Ubuntu 20.04.5 LTS with NVIDIA A10 and CUDA Version: 12.0.
To train or predict with GPUs, you will need:
- CUDA >= 11.7
- cuDNN
Both options require conda, so first install Miniconda from https://conda.io/miniconda.html.
Then proceed to either option below to complete the installation. If installing the environment with conda seems to be taking too long, you can also try running conda install -c conda-forge mamba and then replacing conda with mamba in each of the steps below.
Note for machines with GPUs: You may need to manually install a GPU-enabled version of PyTorch by following the instructions here. If you're encountering issues with not using a GPU on your system after following the instructions below, check which version of PyTorch you have installed in your environment using conda list | grep torch or similar. If the PyTorch line includes cpu, please uninstall it using conda remove pytorch and reinstall a GPU-enabled version using the instructions at the link above.
mkdir catpred_pipeline catpred_pipeline/results
cd catpred_pipeline
wget -c --tries=5 --timeout=30 https://catpred.s3.us-east-1.amazonaws.com/capsule_data_update.tar.gz || \
wget -c --tries=5 --timeout=30 https://catpred.s3.amazonaws.com/capsule_data_update.tar.gz
tar -xzf capsule_data_update.tar.gz
git clone https://github.com/maranasgroup/catpred.git
cd catpred
conda env create -f environment.yml
conda activate catpred
pip install -e .stride is Linux-only and optional for the default demos. If needed for your workflow, install it separately on Linux:
conda install -c kimlab strideThe Jupyter Notebook batch_demo.ipynb and the Python script demo_run.py show the usage of pre-trained models for prediction.
Input CSV requirements for demo_run.py and batch prediction:
- Required columns:
SMILES,sequence,pdbpath. pdbpathmust be unique per unique sequence. Reusing the samepdbpathfor different sequences can produce incorrect cached embeddings.- Reusing the same
pdbpathfor repeated measurements of the same sequence is supported.
The helper script used to build protein records is:
python ./scripts/create_pdbrecords.py --data_file <input.csv> --out_file <input.json.gz>CatPred currently expects one sequence per row. Multi-protein complexes (e.g., heteromers/homodimers) are not explicitly modeled as separate chains in the default prediction workflow.
For released benchmark datasets, the number of entries with 3D structure can be smaller than the total sequence/substrate pairs; 3D-derived artifacts are available only for the subset with valid structure mapping.
CatPred also provides an optional FastAPI service for prediction workflows.
Install web dependencies:
pip install -e ".[web]"Run the API:
catpred_web --host 0.0.0.0 --port 8000Endpoints:
GET /healthโ liveness check.GET /readyโ backend configuration/readiness.POST /predictโ run inference.
By default, the API is hardened for service use:
input_filerequests are disabled (useinput_rowsinstead).- request-time overrides of
repo_root/python_executableare disabled. results_diris constrained underCATPRED_API_RESULTS_ROOT.
Minimal POST /predict example for local inference using input_rows:
curl -X POST http://127.0.0.1:8000/predict \
-H "Content-Type: application/json" \
-d '{
"parameter": "kcat",
"checkpoint_dir": "../data/pretrained/reproduce_checkpoints/kcat",
"input_rows": [
{"SMILES": "CCO", "sequence": "ACDEFGHIK", "pdbpath": "seq_a"},
{"SMILES": "CCN", "sequence": "LMNPQRSTV", "pdbpath": "seq_b"}
],
"results_dir": "batch1",
"backend": "local"
}'You can keep local inference as default and optionally enable Modal as another backend:
export CATPRED_DEFAULT_BACKEND=local
export CATPRED_MODAL_ENDPOINT="https://<your-modal-endpoint>"
export CATPRED_MODAL_TOKEN="<optional-token>"
export CATPRED_MODAL_FALLBACK_TO_LOCAL=1Use "backend": "modal" in /predict requests to route through Modal. If fallback is enabled (env var above or request field fallback_to_local), failed modal requests can automatically reroute to local inference.
Optional API environment variables:
# Root directories used by API path constraints
export CATPRED_API_INPUT_ROOT="/absolute/path/for/input-csvs"
export CATPRED_API_RESULTS_ROOT="/absolute/path/for/results"
export CATPRED_API_CHECKPOINT_ROOT="/absolute/path/for/checkpoints"
# Enable only for trusted local workflows (not recommended for public deployments)
export CATPRED_API_ALLOW_INPUT_FILE=1
export CATPRED_API_ALLOW_UNSAFE_OVERRIDES=1
# Request limits
export CATPRED_API_MAX_INPUT_ROWS=1000
export CATPRED_API_MAX_INPUT_FILE_BYTES=5000000Deserialization hardening controls:
# Trusted roots used by secure loaders (colon-separated list on Unix)
export CATPRED_TRUSTED_DESERIALIZATION_ROOTS="/srv/catpred:/srv/catpred-data"
# Backward-compatible default is enabled (1). Set to 0 to block unsafe pickle-based loading.
# Use 0 only after validating your artifacts are safe-load compatible.
export CATPRED_ALLOW_UNSAFE_DESERIALIZATION=1You can fine-tune CatPred on your own regression targets using train.py.
- Prepare train/val/test CSVs with at least:
SMILESsequencepdbpath(unique per unique sequence)- one numeric target column (for example:
log10kcat_max)
- Build a protein-records file that covers all
pdbpathvalues in your splits:
python ./scripts/create_pdbrecords.py --data_file <combined_or_train_csv> --out_file <protein_records.json.gz>- Train:
python train.py \
--protein_records_path <protein_records.json.gz> \
--data_path <train.csv> \
--separate_val_path <val.csv> \
--separate_test_path <test.csv> \
--dataset_type regression \
--smiles_columns SMILES \
--target_columns <target_column_name> \
--add_esm_feats \
--loss_function mve \
--save_dir <output_checkpoint_dir>For working end-to-end examples, see the training commands in scripts such as scripts/reproduce_figS10_catpred.sh.
We provide three separate ways for reproducing the results of the publication.
Estimated run time: Few minutes
Run using:
./reproduce_quick.shFor all results pertaining to CatPred, UniKP, DLKcat, and Baseline models, this method only uses pre-trained predictions and analyses to reproduce results of the publications, including all main and supplementary figures.
Estimated run time: Up to a day depending on your GPU
Run using:
./reproduce_prediction.shFor results pertaining to CatPred, this method uses pre-trained models to perform predictions on test sets. For results pertaining to UniKP, DLKcat, and Baseline, this method uses only pre-trained predictions and analyses to reproduce results of the publications, including all main and supplementary figures.
Estimated run time: Up to 12-14 days depending on your GPU
Run using:
./reproduce_training.shFor all results pertaining to CatPred, UniKP, DLKcat, and Baseline models, this method trains everything from scratch. Then, it uses the trained checkpoints to make predictions and analyzes them to reproduce results of the publications, including all main and supplementary figures.
We thank the authors of the following open-source repositories:
- Chemprop - Majority of the functionality in this codebase has been inspired from the Chemprop library.
- Rotary PyTorch - The rotary positional embeddings functionality for Seq-Attn. is from Rotary PyTorch.
- Progres - Protein Graph Embedding Search using pre-trained EGNN models from Progres.
This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.
If you find the models useful in your research, we ask that you cite the relevant paper:
@article {Boorla2024.03.10.584340,
author = {Veda Sheersh Boorla and Costas D. Maranas},
title = {CatPred: A comprehensive framework for deep learning in vitro enzyme kinetic parameters kcat, Km and Ki},
elocation-id = {2024.03.10.584340},
year = {2024},
doi = {10.1101/2024.03.10.584340},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/03/26/2024.03.10.584340},
eprint = {https://www.biorxiv.org/content/early/2024/03/26/2024.03.10.584340.full.pdf},
journal = {bioRxiv}
}