Skip to content

BioGenies/AMPBenchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AMPBenchmark

AMPBenchmark is a part of our initative for the improvement of benchmarking standards in the field of antimicrobial peptide (AMP) prediction.

How to use the public data?

  1. Download the benchmark sequence data:
  2. Download the training sequence data for all methods and replications:
  3. Train your model using each of the training data set (class of a sequence is denoted by AMP=1 for AMPs and AMP=0 for negative samples, see Sequence data section for details.)
  4. Benchmark trained models against our data. Make sure to use a subset of sequences for appropriate replication (replication number is denoted by, e.g. rep=1, see Sequence data section for details.)
  5. Submit the results in the format described below to the AMPBenchmark web server.

Data submission format

ID training_sampling AMP_probability
DBAASP_10018_AMP=1_rep1 dbAMP 0.97
DBAASP_3217_AMP=1_rep1 dbAMP 0.61
  • ID: must contain the sequence ID, as provided in the FASTA headers of the input sequences.
  • training_sampling: has to contain the type of negative sampling method used to train the model. Possible values are: AMAP, AmpGram, ampir-mature, AMPlify, AMPScannerV2, CS-AMPPred, dbAMP, Gabere&Noble, iAMP-2L, Wang-et-al, Witten&Witten. Remember that a proper benchmark requires you to train your model using every provided sampling method and evaluate it using all sampling methods using appropriate replication.
  • AMP_probability: has to be in the range between 0 and 1.

Example data for a random classifier can be downloaded from Dropbox.

Sequence data

The input data is hosted on Dropbox and GitHub. Note that this single file contains data for all replications which should be used separately with appropriate replications of training sets.

The training data sets are hosted on Dropbox and follow the same naming convention.

There are two types of the input sequences:

  • positive sequence (e.g., DBAASP_10718_AMP=1_rep1): IDinDBAASP_class_replicateID.
  • negative sequences (e.g., Seq1896_sampling_method=Gabere&Noble_AMP=0_rep4): IDandSamplingMethod_class_replicateID.

AMP sequences are derived from the DBAASP database.

md5 sum of the AMPBenchmark_public.fasta: 58f1424c057aaeb64bc632cad6038cad.

Citation

Katarzyna Sidorczuk, Przemysław Gagat, Filip Pietluch, Jakub Kała, Dominik Rafacz, Laura Bąkała, Jadwiga Słowik, Rafał Kolenda, Stefan Rödiger, Legana C H W Fingerhut, Ira R Cooke, Paweł Mackiewicz, Michał Burdukiewicz, Benchmarks in antimicrobial peptide prediction are biased due to the selection of negative data, Briefings in Bioinformatics, 2022;, bbac343, https://doi.org/10.1093/bib/bbac343.

Important links

Contact

If you have any questions, suggestions or comments, contact Michal Burdukiewicz.

Changelog

  • 2023/01/11: fixed data processing.