Allergens are a major concern in protein safety, especially with the growing use of recombinant proteins in medical products. Traditional allergenicity tests are costly and time-consuming, prompting the need for efficient bioinformatics solutions. In this study, we developed an enhanced deep learning model that classifies proteins as allergenic or non-allergenic based on their sequences. Our method extracts features using two protein language models and combines them in a deep neural network, followed by ensemble modeling to improve performance. The proposed model achieved strong results: 97.91% sensitivity, 97.69% specificity, 97.80% accuracy, and a 99% AUC using five-fold cross-validation.
bioRxiv DOI: https://doi.org/10.1101/2024.08.09.607419
You can try out the AllerTrans model directly available on Hugging Face Spaces: https://huggingface.co/spaces/sfaezella/AllerTrans
-
feature-extraction
- 1. ESM-v2-embeddings.ipynb: Extracts embeddings using ESM-v2 model. Input protein sequences in FASTA format.
- 2. ProtT5-embeddings.ipynb: Extracts embeddings using ProtT5 model. Input protein sequences in FASTA format.
- 3. AAC-feature-vectors.ipynb: Generates amino acid composition feature vectors. Input protein sequences in FASTA format.
-
modeling
- classic-machine-learning.ipynb: Classic machine learning models' training and evaluation, including SVM, RF, XGBoost, and KNN. This notebook also tests the effect of hyperparameter tuning and the autoencoder.
- nonlinear-DNN.ipynb: Train and evaluation of our top-performing deep neural network models, using ESM-v2 and ProtT5 embeddings, and AAC feature vectors.
- single-layer-LSTM.ipynb: Training and evaluation of a single-layer LSTM (Long Short-Term Memory) model.
- 1D-CNN.ipynb: Training and evaluation of a 1-dimensional CNN (Convolutional neural network) model.
-
model-checkpoints
- Contains saved checkpoints of the trained models required for the
nonlinear-DNN
notebook.
- Contains saved checkpoints of the trained models required for the
The utilized dataset for this study includes the public AlgPred 2.0 train and validation sets, which are available here.
-
Feature Extraction:
- Navigate to the
feature-extraction
folder and run the notebooks to extract the necessary feature vectors from protein sequences. Input protein sequences in FASTA format.
- Navigate to the
-
Model Training and Evaluation:
- Navigate to the
modeling
folder. - Open and run the
nonlinear-DNN.ipynb
notebook to train and evaluate the deep neural network model. Ensure the required model checkpoints are available in themodel-checkpoints
folder. - For other models, run the respective notebooks (
classic-machine-learning.ipynb
,single-layer-LSTM.ipynb
,1D-CNN.ipynb
).
- Navigate to the