Skip to content

raghavagps/antifp2

Repository files navigation

AntiFP2: Antifungal Protein Prediction Toolkit

This toolkit utilizes a fine-tuned ESM2 model to classify protein sequences, with optional integration of BLAST and MERCI motif scanning. It offers three standalone prediction scripts:

  1. ESM2-only classification using embeddings and a trained classifier.
  2. ESM2 + BLAST + MERCI for combined evidence-based prediction.
  3. MetaPipeline: Annotate contigs with Prokka and predict antifungal proteins using ESM2.

Overview

These scripts classify protein sequences into functional categories using a fine-tuned ESM2 model. Optionally, predictions can be enhanced by incorporating:

  • BLAST results (from precomputed TSV files).
  • MERCI motif hits (from .locate files).

Installation

Install dependencies using pip:

pip install torch==2.4.1 biopython==1.85 pandas fair-esm==2.0.0 huggingface_hub==0.30.2

Or using Conda:

conda install -c conda-forge -c bioconda -c defaults torch==2.4.1 biopython==1.85 pandas fair-esm==2.0.0 huggingface_hub==0.30.2

Clone this repository:

git clone https://github.com/patrik-ackerman/antifp2.git
cd antifp2

(Optional) Create a virtual environment:

conda env create -f environment.yml
conda activate antifp2

Scripts and Usage

1. ESM2-Only Prediction

python3 antifp2_ESM2.py --fasta input.fasta --output esm2_results.csv

Arguments:

  • --fasta: Input FASTA file containing protein sequences.
  • --output: Output CSV file to save predictions.
  • --threshold: (Optional) Classification threshold (default: 0.5).

Output:

A CSV file with the following columns:

  • ID: Sequence identifier.
  • probability: Predicted probability of being an antifungal protein.
  • prediction: Binary prediction (1 for antifungal, 0 for non-antifungal) based on the threshold.

Help:

python3 antifp2_ESM2.py --help

2. ESM2 + BLAST + MERCI Prediction

python3 antifp2_ESM_blast.py \
    --fasta input.fasta \
    --outdir output_directory \
    --threshold 0.5 \
    --no-cleanup

Arguments:

  • --fasta: Input FASTA file containing protein sequences.
  • --outdir: Output directory to save predictions and intermediate files (optional; defaults to input FASTA directory).
  • --threshold: Classification threshold (default: 0.5).
  • --no-cleanup: Flag to keep intermediate files like BLAST and MERCI outputs.

Description:

This script runs the fine-tuned ESM2 model to predict antifungal proteins, then adjusts predictions based on BLAST and MERCI motif hits:

  • Filters sequences to keep lengths between 50 and 3000 and only standard amino acids.
  • Runs MERCI motif scanning and BLASTp search against a configured database.
  • Adjusts probabilities by adding +0.5 for BLAST matches to known positives and reducing -5 for BLAST matches to known negatives
  • Adjusts probabilities by adding +0.5 for motif hits.
  • Clips combined probabilities between 0 and 1.
  • Outputs a CSV file with columns: ID, probability, blast_adjustment, motif_adjustment, combined, prediction.

Outputs:

  • <prefix>_predictions.csv: Final prediction CSV file with adjusted probabilities and binary predictions.
  • rejected_log.txt: Log of sequences rejected during filtering.

Example:

python antifp2_ESM_blast.py --fasta proteins.fasta --outdir results --threshold 0.6

3. MetaPipeline: Prokka + ESM2 Prediction

python3 antifp2_meta.py \
    --contigs contigs.fa \
    --outdir results_dir \
    --threshold 0.5 \
    --threads 8

Arguments:

  • --contigs: Input contigs FASTA file.
  • --outdir: Output directory to save prediction results and intermediate files.
  • --threshold: (Optional) Classification threshold (default: 0.5).
  • --threads: (Optional) Number of threads to use for Prokka (default: all available).
  • --no-cleanup: (Optional) Retain Prokka intermediate files.
  • --metagenome: (Optional) Enable Prokka's metagenome mode.

Workflow:

  1. Prokka Annotation: Annotates input contigs to predict coding sequences.
  2. Sequence Filtering: Filters out proteins: - Shorter than 50 or longer than 3000 amino acids. - Containing non-standard amino acids.
  3. Prediction: Runs the fine-tuned ESM2 model on valid sequences.
  4. Extraction: Saves predicted antifungal sequences to a FASTA file.

Outputs (in ``--outdir``):

  • *_metapred.csv: CSV file with prediction results: - ID, probability, prediction.
  • *_antifp2.fasta: FASTA file of positively predicted antifungal proteins.
  • rejected_log.txt: Log of sequences excluded during filtering.

Help:

python3 antifp2_meta.py --help

Citation

If you use this tool, please cite the following resources:

Releases

No releases published

Packages

No packages published