AllerTrans

A Deep Learning Method for Predicting the Allergenicity of Protein Sequences

Overview

Allergens are a major concern in protein safety, especially with the growing use of recombinant proteins in medical products. Traditional allergenicity tests are costly and time-consuming, prompting the need for efficient bioinformatics solutions. In this study, we developed an enhanced deep learning model that classifies proteins as allergenic or non-allergenic based on their sequences. Our method extracts features using two protein language models and combines them in a deep neural network, followed by ensemble modeling to improve performance. The proposed model achieved strong results: 97.91% sensitivity, 97.69% specificity, 97.80% accuracy, and a 99% AUC using five-fold cross-validation.

bioRxiv DOI: https://doi.org/10.1101/2024.08.09.607419

Online Prediction Tool

You can try out the AllerTrans model directly available on Hugging Face Spaces: https://huggingface.co/spaces/sfaezella/AllerTrans

A comprehensive flowchart that includes all of our experiments

Repository Structure

Folders

feature-extraction
- 1. ESM-v2-embeddings.ipynb: Extracts embeddings using ESM-v2 model. Input protein sequences in FASTA format.
- 2. ProtT5-embeddings.ipynb: Extracts embeddings using ProtT5 model. Input protein sequences in FASTA format.
- 3. AAC-feature-vectors.ipynb: Generates amino acid composition feature vectors. Input protein sequences in FASTA format.
modeling
- classic-machine-learning.ipynb: Classic machine learning models' training and evaluation, including SVM, RF, XGBoost, and KNN. This notebook also tests the effect of hyperparameter tuning and the autoencoder.
- nonlinear-DNN.ipynb: Train and evaluation of our top-performing deep neural network models, using ESM-v2 and ProtT5 embeddings, and AAC feature vectors.
- single-layer-LSTM.ipynb: Training and evaluation of a single-layer LSTM (Long Short-Term Memory) model.
- 1D-CNN.ipynb: Training and evaluation of a 1-dimensional CNN (Convolutional neural network) model.
model-checkpoints
- Contains saved checkpoints of the trained models required for the nonlinear-DNN notebook.

Dataset

The utilized dataset for this study includes the public AlgPred 2.0 train and validation sets, which are available here.

Usage

Feature Extraction:
- Navigate to the feature-extraction folder and run the notebooks to extract the necessary feature vectors from protein sequences. Input protein sequences in FASTA format.
Model Training and Evaluation:
- Navigate to the modeling folder.
- Open and run the nonlinear-DNN.ipynb notebook to train and evaluate the deep neural network model. Ensure the required model checkpoints are available in the model-checkpoints folder.
- For other models, run the respective notebooks (classic-machine-learning.ipynb, single-layer-LSTM.ipynb, 1D-CNN.ipynb).

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
additional-experiments		additional-experiments
feature-extraction		feature-extraction
images		images
inference-app		inference-app
model-checkpoints		model-checkpoints
modeling		modeling
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AllerTrans

A Deep Learning Method for Predicting the Allergenicity of Protein Sequences

Overview

Online Prediction Tool

A comprehensive flowchart that includes all of our experiments

Repository Structure

Folders

Dataset

Usage

About

Releases 1

Languages

faezesarlakifar/AllerTrans

Folders and files

Latest commit

History

Repository files navigation

AllerTrans

A Deep Learning Method for Predicting the Allergenicity of Protein Sequences

Overview

Online Prediction Tool

A comprehensive flowchart that includes all of our experiments

Repository Structure

Folders

Dataset

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Languages