Revolutionizing Spectroscopic Analysis Using Sequence-to-Sequence Models I: From Infrared Spectra to Molecular Structures
Ethan J. French, Xianqi Deng, Siqi Chen, Cheng-Wei Ju, Xi Cheng, Lijun Zhang, Xiao Liu, Hui Guan, and Zhou Lin
For English, see README.en.md
para español, consulte README.es.md
中文版, 请参见 README.zh.md
Infrared (IR) spectroscopy reveals molecular and material features via their characteristic vibrational frequencies in an efficient and sensitive style and has thus become one of the most popular analytical tools in broad areas involving chemical discovery. These fields include material synthesis, drug design, pharmacokinetics, safety screening, pollutant sensing, and observational astronomy. However, in situ molecular or material identification from spectral signals remains a resource-intensive challenge and requires professional training due to its complexity in tracking effects to causes. Motivated by the recent success of sequence-to-sequence (Seq2Seq) models from deep learning, we developed a direct, accurate, effortless and physics-informed protocol to realize such a in-situ spectrum-to-structure translation, and provided the proof-of-concept of our models using IR spectra. We expressed both the input IR spectrum and the output molecular structure as alphanumerical sequences, treated them as two sentences describing the same molecule from two different languages, and translated them into each other using Seq2Seq models from recurrent neural networks (RNNs) and Transformers. Trained and validated using a curated data set of 198,091 organic molecules from the QM9 and PC9 databases, our Seq2Seq models achieved state-of-the-art accuracy of up to 0.611, 0.850, 0.804, and > 0.972 in generating target molecular identities, chemical formulas, structural frameworks, and functional groups from only IR spectra. Our study sets the stage for a revolutionary way to analyze molecular or material spectra by replacing human labor with rapid and accurate deep learning approaches.
conda create -n ir python=3.12
conda activate ir
cd Spectrum2Structure/
# pip install pip==24.0
pip install -r requirements.txtNote: Please keep pip ≤ 24.0, if your pip > 24.0 that can't download pytorch_lightning == 1.6.3. If you download pytorch_lightning != 1.6.3, please change the 'gpus' setting of trainer in train.py.
The data is available on Figshare at: 10.6084/m9.figshare.28754678
The data should then be saved to the data directory
# download or copy data to this default directory
cd Spectrum2Structure/dataThe pre-trained models can be obtained via PyTorch Hub
model = torch.hub.load('pytorch/vision', 'resnet18', pretrained=True)or via FigShare at: 10.6084/m9.figshare.28795676
then save the model weights to the checkpoints directory
cd Spectrum2Structure/
mkdir checkpoints
cd checkpoints/
# download or copy model weight to this default directory
unzip smiles_checkpoints.zipOthers - Google Drive
Seq2Seq Models - SELFIES Format
Seq2Seq Models - Mixture Molecule
Seq2Seq Models - SMILES Format
Note: Since the Transformer model utilizes Batch Normalization for spectra data, please ensure that the Batch Size is set to 256 when using any Transformer-based model weights.
Available models for different datasets
SELFIES Format Dataset: lstm, gru, gpt, transformer
Mixture Molecule Dataset: lstm-mixture, gru-mixture , gpt-mixture, transformer-mixture
SMILES Format Dataset: lstm-smiles, gru-smiles, gpt-smiles, transformer-smiles
cd Spectrum2Structure/
chmod +x run_train.sh
./run_train.sh <model>Or you can train the model with custom settings:
cd Spectrum2Structure/
python train.py --model <model> --mode <mode> \
--hidden_dim <hidden_dim> --dropout <dropout> --layers <layers> --heads <heads> \
--batch_size <batch_size> --max_epochs <max_epochs> --lr <lr> --weight_decay <weight_decay>
--use_gpu <use_gpu> --calculate_prediction <calculate_prediction>Optional arguments:
--model Model architecture: LSTM, GRU, GPT, Transformer (default: Transformer)
--mode Choose encoding mode: selfies, smiles, or mixture (default: selfies)
--hidden_dim size of input hidden units (default: 768)
--dropout dropout probability (default: 0.1)
--layers number of hidden layers (default: 6)
--heads number of attention heads (default: 6)
--batch_size Batch size for training (default: 256)
--max_epochs number of epochs to train (default: 80)
--lr initial learning rate (default: 1e-4)
--weight_decay weight decay coefficient (default: 1e-5)
--use_gpu Whether to use GPU for training (default: True)
--calculate_prediction whether to calculate prediction (default: True)
--seed Random seed for reproducibility (default: 78438379)Training Transformer model by SMILES format dataset:
cd Spectrum2Structure/
chmod +x run_train.sh
./run_train.sh transformer-smilesCustom settings:
cd Spectrum2Structure/
python train.py --model Transformer --mode smiles \
--hidden_dim 768 --dropout 0.1 --layers 6 --heads 6 \
--batch_size 256 --max_epochs 95 --lr 1e-4 --weight_decay 1e-5 \
--use_gpu True --calculate_prediction TrueAvailable tasks for different datasets
SELFIES Format Dataset: eval, generation, topk
Mixture Molecule Dataset: eval, generation
SMILES Format Dataset: eval
cd Spectrum2Structure/
chmod +x run_test.sh
./run_test.sh <model> <checkpoint> <mode> [output_file]Or you can test the model with custom settings.
cd Spectrum2Structure/
python test.py --model <model> --mode <mode> --task <task> \
--checkpoints <checkpoints> --batch_size <batch_size> \
--use_gpu <use_gpu> --output_file <output_file>Optional arguments:
--mode Choose encoding mode: selfies, smiles, or mixture (default: selfies)
--task Task type: evaluation, generation, or top‑k prediction (default: eval)
--model Model architecture: LSTM, GRU, GPT, Transformer (default: Transformer)
--checkpoints Path to model checkpoint (required)
--batch_size Batch size for testing (default: 256)
--use_gpu Whether to use GPU for inference (default: True)
--seed Random seed for reproducibility (default: 78438379)
--output_file Custom output file name (optional) (default: None)Get Transformer model Topk resutls:
cd Spectrum2Structure/
chmod +x run_test.sh
./run_test.sh transformer checkpoints/Transformer-epoch=96-step=30070.ckpt topkCustom settings:
cd Spectrum2Structure/
python test.py --model Transformer --mode selfies --task topk \
--checkpoints checkpoints/Transformer-epoch=96-step=30070.ckpt --batch_size 256 --use_gpu TrueFor anything code related, please open an issue. Otherwise, please open a discussion tab.
If you find this model and code are useful in your work, please cite:
@article{french2025revolutionizing,
title={Revolutionizing Spectroscopic Analysis Using Sequence-to-Sequence Models I: From Infrared Spectra to Molecular Structures},
author={French, Ethan and Deng, Xianqi and Chen, Siqi and Ju, Cheng-Wei and Cheng, Xi and Zhang, Lijun and Liu, Xiao and Guan, Hui and Lin, Zhou},
journal={ChemRxiv preprint doi:10.26434/chemrxiv-2025-n4q84},
year={2025}
}H.G. and Z.L. acknowledge the financial support from from the University of Massachusetts Amherst under their Start-Up Funds and the ADVANCE Collaborative Grant. All authors acknowledge UMass/URI Unity Cluster and MIT SuperCloud for providing high-performance computing (HPC) resources.
