SynCraft: Guiding Large Language Models to Predict Edit Sequences for Molecular Synthesizability Optimization
SynCraft is a reasoning-based framework that reframes synthesizability optimization not as a sequence translation task, but as a precise structural editing problem. Leveraging the emergent reasoning capabilities of Large Language Models (LLMs), SynCraft navigates the "synthesis cliff" where minimal structural modifications yield significant gains in synthetic feasibility.
By predicting executable sequences of atom-level edits rather than generating SMILES strings directly, SynCraft circumvents the syntactic fragility of LLMs while harnessing their chemical intuition.
- Generative Editing via In-Context Reasoning: Decouples strategic planning from chemical execution. The LLM acts as a chemical strategist, reasoning about synthetic liabilities before prescribing edits.
- Discrete Editing Action Space: Uses a precise JSON-based command set (
DEL_ATOM,ADD_BOND,MUTATE_ATOM, etc.) to modify molecular graphs deterministically, ensuring validity. - Interaction-Aware Optimization: Incorporates 3D protein-ligand interaction data (via AutoDock Vina and PLIP) into the prompting strategy to preserve critical pharmacophores during optimization.
- Synthesis Cliff Navigation: Focuses on minimal, high-impact edits to transform "unsynthesizable" molecules into accessible analogs without destroying the original scaffold.
SynCraft-Core/
├── assets/ # Data assets
│ ├── reasoning.json # Golden examples with reasoning traces
│ ├── RIPK1.txt # Example molecule lists
│ └── unsolved.json # Input datasets
├── config/ # Configuration files
├── notebooks/ # Jupyter notebooks for analysis
├── scripts/ # Shell scripts for running experiments
│ ├── inference_enhanced.sh
│ └── inference_bioactivity_constrain.sh
├── src/ # Source code
│ ├── inference_enhanced.py # Standard inference script
│ ├── inference_bioactivity_constrain.py # Interaction-aware inference
│ ├── utils.py # Core editing & reconstruction logic
│ ├── extract_interaction.py # PLIP interaction analysis
│ └── docking_utils.py # Vina docking wrappers
└── vina/ # Vina executables and receptor files
- Python 3.8+
- AutoDock Vina (for interaction-aware mode)
- PLIP (for interaction-aware mode)
- OpenBabel
Install the required Python packages:
pip install rdkit litellm loguru meeko openbabel tqdm numpy syntheseusSynCraft uses litellm to interface with LLMs (e.g., Gemini, DeepSeek). You must set your API keys in your environment variables:
export GEMINI_API_KEY='your-gemini-api-key'
# or
export DEEPSEEK_API_KEY='your-deepseek-api-key'To run the standard optimization pipeline which focuses on restoring synthesizability using chemical reasoning:
cd scripts
bash inference_enhanced.shUnder the hood (src/inference_enhanced.py):
- Loads unsynthesizable molecules.
- Retrieves similar "golden examples" (pairs of unsynthesizable
$\to$ synthesizable molecules) for few-shot prompting. - Prompts the LLM to reason about synthetic liabilities and generate a JSON edit sequence.
- Applies the edits deterministically to produce the result.
Key Arguments:
--dataset: The dataset key in the input JSON.--model: The LLM model to use (e.g.,gemini/gemini-2.5-pro).--few-shot-k: Number of few-shot examples to use (default: 5).--pass-k: Number of parallel inference passes per molecule.
To optimize molecules while preserving binding interactions (requires Vina and receptor files):
cd scripts
bash inference_bioactivity_constrain.shUnder the hood (src/inference_bioactivity_constrain.py):
- Docks the input molecule into the target receptor.
- Analyzes interactions (H-bonds,
$\pi$ -stacking, etc.) using PLIP. - Injects these constraints into the LLM prompt (e.g., "Atom 5 forms a critical Hydrogen Bond...").
- The LLM generates edits that respect these biological constraints.
Configuration:
Ensure your receptor files (.pdbqt, .pdb, config.txt) are correctly placed in the vina/ directory and referenced in the script.
SynCraft defines a compact action space
DEL_ATOM: Removes a specific atom.MUTATE_ATOM: Changes the atomic element.ADD_ATOM: Introduces a new atom.ADD_BOND/DEL_BOND: Creates or removes bonds.CHANGE_BOND: Modifies bond order/aromaticity.SET_CHIRAL/SET_BOND_STEREO: Defines stereochemistry.
- Retrieval: Finds similar "Synthesis Cliff" examples.
- Reasoning: The LLM analyzes the molecule and articulates a plan.
- Execution: The plan (JSON) is executed by the deterministic toolkit (
src/utils.py).