This repository provides the source codes and example datasets that reproduce the test environments for each model introduced in the paper, "Simple Ensemble of Sequence-only Protein Embeddings Outperforms Multi-data Predictors for Bacterial Essential Genes", as well as gene essentiality datasets for individual strains. Users can utilize the example code and data to implement protein sequence embedding and essential gene prediction, and can perform predictions on their own data with minor modifications to the provided example codes.
- Protein Sequence Is All You Need: Predict bacterial essential genes using only their protein sequences without integration of complex multi-feature data.
- Extended Bacterial Essential Gene Dataset: Experimental essentiality data (features: 'essentiality', 'protein_seq', 'dna_seq', 'genome_id', 'locus_tag', etc.) of over 280,000 bacterial genes collected from 79 studies.
data/raw_data/
: Essential gene datasets (include non-essential genes) of each strain.data/test_exam/
: Example test datasets consisting of genes from E. coli Keio collection.models/
: Models to predict essential genes ('classifier ~') or encode protein sequences ('embed_custom').results/
: Model evaluation, prediction results and model training history.sources/
: Jupyter notebook codes for sequence embedding ('emb ~') or model test and prediction ('test ~').
- Clone the repository:
git clone https://github.com/sblabkribb/essprotseq.git cd essprotseq
- Install dependencies:
pip install -r requirements.txt
- Set options (data_path, etc.) in each source code:
# Set options (example of 'test-indiv_class.ipynb') embed_ver = ["clstm", "esm2", "bert", "t5"] data_path = "../data/test_exam/" model_path = "../models/classifier_indiv/" result_path = "../results/"
- Run the source code
To cite this work, please reference:
Seongbo Heo et al. "Simple Ensemble of Sequence-only Protein Embeddings Outperforms Multi-data Predictors for Bacterial Essential Genes" Synthetic Biology Research Center, KRIBB.
This project was supported by the Korea Research Institute of Bioscience and Biotechnology (KRIBB) and the National Research Foundation of Korea.