Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model

[Project Page] [Paper] [Data] [Mol-Llama-2-7b-chat Weights] [Mol-Llama-3.1-8B-Instruct Weights]

Why Mol-LLaMA?

Mol-LLaMA is designed to elucidate fundamental properties of molecules with explainability and reasoning ability.

Instruction Dataset: Mol-LLaMA-Instruct
- Broad knowledge centered on molecules including structural, chemical, and biological features of molecules.
- Three data types considering the causality among molecular features: Detailed Structural Descriptions, Structure-to-Feature Relationships, and Comprehensive Conversations.
Model Architecture
1. Molecular encoders: Pretrained 2D encoder (MoleculeSTM) and 3D encoder (Uni-Mol)
2. Blending Module: Combining complementary information from 2D and 3D encoders via cross-attention
3. Q-Former: Embed molecular representations into query tokens based on SciBERT
4. LoRA: Adapters for fine-tuning LLMs

Dependencies

Create an environment with python 3.10 and pytorch 2.4.1. Then, run the following commands to install requirements:

pip install -r requirements.txt

pip install flash-attn --no-build-isolation
conda install pyg -c pyg -y
conda install pytorch-scatter -c pyg -y
conda install openbabel -c conda-forge -y

Dataset Preparation

To train Mol-LLaMA, please download the dataset to data directory from the following link: Mol-LLaMA-Instruct

Model Preparation

The pretrained models, including LLaMA, SciBERT, MoleculeSTM, and Uni-Mol, will be automatically downloaded via hugggingface. FYI, the related lines can be found in models/mol_llama.py and models/mol_llama_encoder.py.

Training

To train the projectors including the blending module and Q-Former, please run the following command:

python stage1.py

The training and data configurations can be found in configs/stage1/. Checkpoints will be saved in checkpoints/stage1 as default.

To instruction-tune, please run the following command:

python stage2.py

The training and data configurations can be found in configs/stage2/. Checkpoints will be saved in checkpoints/stage2 as default. We use the checkpoint of 10th epoch.

Playground

The exemplar inference code is provided in playground.py which takes playground_inputs.json as inputs. To run the inference, please follow the command below:

python playground.py

Evaluation on PAMPA

To reproduce the results on PAMPA task, please run the following command:

python -m pampa.inference

Citation

If you find our work useful, please consider citing our work.

@misc{kim2025molllama,
    title={Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model},
    author={Dongki Kim and Wonbin Lee and Sung Ju Hwang},
    year={2025},
    eprint={2502.13449},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs		configs
data		data
data_provider		data_provider
models		models
pampa		pampa
pubchem_infer_outputs		pubchem_infer_outputs
trainer		trainer
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
eval_pubchem_test.py		eval_pubchem_test.py
infer_pubchem_test.py		infer_pubchem_test.py
moleculeqa.py		moleculeqa.py
playground.py		playground.py
playground_inputs.json		playground_inputs.json
requirements.txt		requirements.txt
stage1.py		stage1.py
stage2.py		stage2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model

Why Mol-LLaMA?

Dependencies

Dataset Preparation

Model Preparation

Training

Playground

Evaluation on PAMPA

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model

Why Mol-LLaMA?

Dependencies

Dataset Preparation

Model Preparation

Training

Playground

Evaluation on PAMPA

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages