Skip to content

catalystforyou/SynCraft-Core

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SynCraft: Guiding Large Language Models to Predict Edit Sequences for Molecular Synthesizability Optimization

SynCraft is a reasoning-based framework that reframes synthesizability optimization not as a sequence translation task, but as a precise structural editing problem. Leveraging the emergent reasoning capabilities of Large Language Models (LLMs), SynCraft navigates the "synthesis cliff" where minimal structural modifications yield significant gains in synthetic feasibility.

By predicting executable sequences of atom-level edits rather than generating SMILES strings directly, SynCraft circumvents the syntactic fragility of LLMs while harnessing their chemical intuition.

Overview

Key Features

  • Generative Editing via In-Context Reasoning: Decouples strategic planning from chemical execution. The LLM acts as a chemical strategist, reasoning about synthetic liabilities before prescribing edits.
  • Discrete Editing Action Space: Uses a precise JSON-based command set (DEL_ATOM, ADD_BOND, MUTATE_ATOM, etc.) to modify molecular graphs deterministically, ensuring validity.
  • Interaction-Aware Optimization: Incorporates 3D protein-ligand interaction data (via AutoDock Vina and PLIP) into the prompting strategy to preserve critical pharmacophores during optimization.
  • Synthesis Cliff Navigation: Focuses on minimal, high-impact edits to transform "unsynthesizable" molecules into accessible analogs without destroying the original scaffold.

Structure

SynCraft-Core/
├── assets/                 # Data assets
│   ├── reasoning.json      # Golden examples with reasoning traces
│   ├── RIPK1.txt           # Example molecule lists
│   └── unsolved.json       # Input datasets
├── config/                 # Configuration files
├── notebooks/              # Jupyter notebooks for analysis
├── scripts/                # Shell scripts for running experiments
│   ├── inference_enhanced.sh
│   └── inference_bioactivity_constrain.sh
├── src/                    # Source code
│   ├── inference_enhanced.py             # Standard inference script
│   ├── inference_bioactivity_constrain.py # Interaction-aware inference
│   ├── utils.py                          # Core editing & reconstruction logic
│   ├── extract_interaction.py            # PLIP interaction analysis
│   └── docking_utils.py                  # Vina docking wrappers
└── vina/                   # Vina executables and receptor files

Installation

Prerequisites

Python Dependencies

Install the required Python packages:

pip install rdkit litellm loguru meeko openbabel tqdm numpy syntheseus

Environment Setup

SynCraft uses litellm to interface with LLMs (e.g., Gemini, DeepSeek). You must set your API keys in your environment variables:

export GEMINI_API_KEY='your-gemini-api-key'
# or
export DEEPSEEK_API_KEY='your-deepseek-api-key'

Usage

1. Standard Synthesizability Optimization

To run the standard optimization pipeline which focuses on restoring synthesizability using chemical reasoning:

cd scripts
bash inference_enhanced.sh

Under the hood (src/inference_enhanced.py):

  • Loads unsynthesizable molecules.
  • Retrieves similar "golden examples" (pairs of unsynthesizable $\to$ synthesizable molecules) for few-shot prompting.
  • Prompts the LLM to reason about synthetic liabilities and generate a JSON edit sequence.
  • Applies the edits deterministically to produce the result.

Key Arguments:

  • --dataset: The dataset key in the input JSON.
  • --model: The LLM model to use (e.g., gemini/gemini-2.5-pro).
  • --few-shot-k: Number of few-shot examples to use (default: 5).
  • --pass-k: Number of parallel inference passes per molecule.

2. Interaction-Aware Optimization

To optimize molecules while preserving binding interactions (requires Vina and receptor files):

cd scripts
bash inference_bioactivity_constrain.sh

Under the hood (src/inference_bioactivity_constrain.py):

  • Docks the input molecule into the target receptor.
  • Analyzes interactions (H-bonds, $\pi$-stacking, etc.) using PLIP.
  • Injects these constraints into the LLM prompt (e.g., "Atom 5 forms a critical Hydrogen Bond...").
  • The LLM generates edits that respect these biological constraints.

Configuration: Ensure your receptor files (.pdbqt, .pdb, config.txt) are correctly placed in the vina/ directory and referenced in the script.

Methodology

The Edit Action Space

SynCraft defines a compact action space $\mathcal{A}$ where operations are referenced via unique atom-map numbers:

  • DEL_ATOM: Removes a specific atom.
  • MUTATE_ATOM: Changes the atomic element.
  • ADD_ATOM: Introduces a new atom.
  • ADD_BOND / DEL_BOND: Creates or removes bonds.
  • CHANGE_BOND: Modifies bond order/aromaticity.
  • SET_CHIRAL / SET_BOND_STEREO: Defines stereochemistry.

Workflow

  1. Retrieval: Finds similar "Synthesis Cliff" examples.
  2. Reasoning: The LLM analyzes the molecule and articulates a plan.
  3. Execution: The plan (JSON) is executed by the deterministic toolkit (src/utils.py).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors