Skip to content

A Deep Learning Method for Predicting the Allergenicity of Protein Sequences

Notifications You must be signed in to change notification settings

faezesarlakifar/AllerTrans

Repository files navigation

AllerTrans AllerTrans

AllerTrans

A Deep Learning Method for Predicting the Allergenicity of Protein Sequences

Overview

Allergens are a major concern in protein safety, especially with the growing use of recombinant proteins in medical products. Traditional allergenicity tests are costly and time-consuming, prompting the need for efficient bioinformatics solutions. In this study, we developed an enhanced deep learning model that classifies proteins as allergenic or non-allergenic based on their sequences. Our method extracts features using two protein language models and combines them in a deep neural network, followed by ensemble modeling to improve performance. The proposed model achieved strong results: 97.91% sensitivity, 97.69% specificity, 97.80% accuracy, and a 99% AUC using five-fold cross-validation.

bioRxiv DOI: https://doi.org/10.1101/2024.08.09.607419

Online Prediction Tool

You can try out the AllerTrans model directly available on Hugging Face Spaces: https://huggingface.co/spaces/sfaezella/AllerTrans

A comprehensive flowchart that includes all of our experiments

Experiments' Flowchart

Repository Structure

Folders

  • feature-extraction

  • modeling

    • classic-machine-learning.ipynb: Classic machine learning models' training and evaluation, including SVM, RF, XGBoost, and KNN. This notebook also tests the effect of hyperparameter tuning and the autoencoder.
    • nonlinear-DNN.ipynb: Train and evaluation of our top-performing deep neural network models, using ESM-v2 and ProtT5 embeddings, and AAC feature vectors.
    • single-layer-LSTM.ipynb: Training and evaluation of a single-layer LSTM (Long Short-Term Memory) model.
    • 1D-CNN.ipynb: Training and evaluation of a 1-dimensional CNN (Convolutional neural network) model.
  • model-checkpoints

    • Contains saved checkpoints of the trained models required for the nonlinear-DNN notebook.

Dataset

The utilized dataset for this study includes the public AlgPred 2.0 train and validation sets, which are available here.

Usage

  1. Feature Extraction:

    • Navigate to the feature-extraction folder and run the notebooks to extract the necessary feature vectors from protein sequences. Input protein sequences in FASTA format.
  2. Model Training and Evaluation:

    • Navigate to the modeling folder.
    • Open and run the nonlinear-DNN.ipynb notebook to train and evaluate the deep neural network model. Ensure the required model checkpoints are available in the model-checkpoints folder.
    • For other models, run the respective notebooks (classic-machine-learning.ipynb, single-layer-LSTM.ipynb, 1D-CNN.ipynb).