Rethinking Molecule Synthesizability with Chain-of-Reaction

This is the official code repository for the paper titled Rethinking Molecule Synthesizability with Chain-of-Reaction.

Abstract: A well-known pitfall of molecular generative models is that they are not guaranteed to generate synthesizable molecules. There have been considerable attempts to address this problem, but given the exponentially large combinatorial space of synthesizable molecules, existing methods have shown limited coverage of the space and poor molecular optimization performance. To tackle these problems, we introduce ReaSyn, a generative framework for synthesizable projection where the model explores the neighborhood of given molecules in the synthesizable space by generating pathways that result in synthesizable analogs. To fully utilize the chemical knowledge contained in the synthetic pathways, we propose a novel perspective that views synthetic pathways akin to reasoning paths in large language models (LLMs). Specifically, inspired by chain-of-thought (CoT) reasoning in LLMs, we introduce the chain-of-reaction (CoR) notation that explicitly states reactants, reaction types, and intermediate products for each step in a pathway. With the CoR notation, ReaSyn can get dense supervision in every reaction step to explicitly learn chemical reaction rules during supervised training and perform step-by-step reasoning. In addition, to further enhance the reasoning capability of ReaSyn, we propose reinforcement learning (RL)-based finetuning and goal-directed test-time compute scaling tailored for synthesizable projection. ReaSyn achieves the highest reconstruction rate and pathway diversity in synthesizable molecule reconstruction and the highest optimization performance in synthesizable goal-directed molecular optimization, and significantly outperforms previous synthesizable projection methods in synthesizable hit expansion. These results highlight ReaSyn's superior ability to navigate combinatorially-large synthesizable chemical space.

Find the Model Card++ for ReaSyn here.

Installation

Run the following command to install dependencies:

conda env create -f env.yml
conda activate reasyn

Data Preparation

Reaction Teamplates

We use the 115 reaction templates used in SynFormer. Place the data as data/rxn_templates/comprehensive.txt.

Enamine Building Blocks

The building blocks used in the paper are from Enamine US Stock catalog, which are available upon request.
After requesting the data from Enamine, place the data as data/building_blocks/building_blocks.txt.
Then, run the following command to preprocess the data:

python scripts/preprocess.py --model-config configs/train.yml

Alternatively, you can directly use preprocessed building block data.
To resolve pickle path compatibility, first clone the SynFormer repository into the top-level directory for ReaSyn (/ReaSyn):

git clone https://github.com/wenhao-gao/synformer.git
cd synformer
pip install --no-deps -e .
pip install scikit-learn==1.6.0 # 1.6.0 is required to load fpindex.pkl
cd ..

Then, download the preprocessed data. Place fpindex.pkl and matrix.pkl in the folder data/processed/comp_2048.
Then, run the following command:

python scripts/convert_processed_1.py
pip install scikit-learn==1.2.2 # 1.2.2 is required for hit expansion later
pip uninstall synformer         # optional; you may delete the synformer package
rm -rf synformer                # optional; you may delete the synformer folder
python scripts/convert_processed_2.py

ZINC250k Building Blocks

For the synthesizable molecule reconstruction task on ZINC250k, we provide additional building blocks in data/building_blocks/building_blocks_zinc250k.txt.
These are the molecules from ZINC250k that have more than 18 heavy atoms.
Run the following command to preprocess the data:

python scripts/preprocess.py --model-config configs/preprocess_zinc250k.yml

Training

We provide the trained model checkpoint. Place model.ckpt in the data/trained_model directory.

Supervised Learning

Run the following command to perform supervised training of ReaSyn:

torchrun --nnodes $NUM_NODES --nproc_per_node $SUBMIT_GPUS \
         --master_addr $MASTER_ADDR --master_port $MASTER_PORT --node_rank $NODE_RANK \
         scripts/train.py -n ${exp_name}

We used 2 nodes and 8 NVIDIA A100 GPUs/node. Training for 500k steps took 5~6 days.

RL Finetuning

Run the following command to perform RL finetuning of ReaSyn:

torchrun --nproc_per_node ${num_gpus} scripts/finetune.py -n ${exp_name} -m ${model_path}

We used 4 NVIDIA A100 GPUs. Finetuning for 1k steps took 5 hours.

Inference

Synthesizable Molecule Reconstruction

Our paper evaluated ReaSyn on three test sets.
For the Enamine and ChEMBL test sets, place enamine_smiles_1k.txt and chembl_filtered_1k.txt from SynFormer in the data folder.
For the ZINC250k test set, we provide data/test_zinc250k.txt.

Run the following command to conduct synthesizable molecule reconstruction:

python scripts/sample.py -m ${model_path} -i ${testset_path} -o ${output_path}
# python scripts/sample.py -m ${model_path} -i data/enamine_smiles_1k.txt -o results/enamine.txt
# python scripts/sample.py -m ${model_path} -i data/chembl_filtered_1k.txt -o results/chembl.txt
# python scripts/sample.py -m ${model_path} -i data/test_zinc250k.txt -o results/zinc250k.txt --add_bb_path data/processed/zinc250k_2048/fpindex.pkl
python scripts/eval_recon.py ${output_path}

We recommend using multiple GPUs for parallelized synthesizable molecule reconstruction.

Synthesizable Goal-directed Optimization of TDC Oracles

Run the following command to conduct synthesizable goal-directed optimization of TDC oracles:

python scripts/optimize_tdc.py -m ${model_path} -o ${oracle} --use_regressor

Synthesizable Hit Expansion

Run the following command to conduct synthesizable hit expansion:

python scripts/sample.py -m ${model_path} -i data/jnk3_hit.txt -o ${output_path} --reward_model jnk3 --exhaustiveness 128
python scripts/eval_hit.py ${output_path}

(Optional) Filtering Pathways

We additionally provide the functionality to filter out generated pathways that lead to molecules that users want to avoid (e.g., toxic molecules). We provide an example catalog of toxic molecules in data/mols_to_filter.txt. Set mols_to_filter and filter_sim arguments to filter synthetic pathways for molecules whose Tanimoto similarity to mols_to_filter is greater than filter_sim.
For example:

python scripts/sample.py -m ${model_path} -i ${testset_path} -o ${output_path} --mols_to_filter data/mols_to_filter.txt  --filter_sim ${filter_sim}

License

Citation

If you find this repository and our paper useful, we kindly request to cite our work.

@article{lee2025reasyn,
  title     = {Rethinking Molecule Synthesizability with Chain-of-Reaction},
  author    = {Lee, Seul and Kreis, Karsten and Veccham, Srimukh Prasad and Liu, Meng and Reidenbach, Danny and Paliwal, Saee and Nie, Weili and Vahdat, Arash},
  journal   = {arXiv},
  year      = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
configs		configs
data		data
data_card		data_card
license_data		license_data
license_thirdparty		license_thirdparty
model_card		model_card
reasyn		reasyn
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Rethinking Molecule Synthesizability with Chain-of-Reaction

Installation

Data Preparation

Reaction Teamplates

Enamine Building Blocks

ZINC250k Building Blocks

Training

Supervised Learning

RL Finetuning

Inference

Synthesizable Molecule Reconstruction

Synthesizable Goal-directed Optimization of TDC Oracles

Synthesizable Hit Expansion

(Optional) Filtering Pathways

License

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

NVIDIA-Digital-Bio/ReaSyn

Folders and files

Latest commit

History

Repository files navigation

Rethinking Molecule Synthesizability with Chain-of-Reaction

Installation

Data Preparation

Reaction Teamplates

Enamine Building Blocks

ZINC250k Building Blocks

Training

Supervised Learning

RL Finetuning

Inference

Synthesizable Molecule Reconstruction

Synthesizable Goal-directed Optimization of TDC Oracles

Synthesizable Hit Expansion

(Optional) Filtering Pathways

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages