Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building a Gold-Standard Protein Sequence Dataset for Functional Annotation #5

Closed
rababerladuseladim opened this issue Sep 18, 2019 · 1 comment

Comments

@rababerladuseladim
Copy link

rababerladuseladim commented Sep 18, 2019

Abstract

Metaproteomics is the analysis of proteins in samples composed of multiple organisms. One major use case is the investigation of the functional composition of a sample. Multiple tools can connect identified sequences with functional information (e.g. Unipept, Prophane, MetaGOmics). Unfortunately, the performance of these tools is not easy to assess, due to a lack of data with known ground-truth at the functional level. The target benchmark dataset would consist of a diverse range of peptides/proteins with high-quality, experimentally validated functional annotations. The obstacles that need to be overcome for the creation of such a dataset are: (1) the further complicated protein inference issue in metaproteomics compared to single-organism proteomics (peptides can match to homologues in the same and multiple organisms) and (2) low annotation levels of proteins in the metaproteomic context (many proteins have no function - not even an assumed one - assigned to them). We plan to develop a concept on how the ideal gold standard dataset should be composed and generate it accordingly. Based on this dataset, a functional benchmark of the aforementioned tools can be initiated.

Work plan

  • Compile sequence database of proteins with validated functions
  • generate simulated peptide identification lists based on the database, closely resembling result characteristics in metaproteomics
  • specify benchmarking criteria
  • (Potentially) benchmark existing tools against generated data

Technical details

  • datasets are derived from reference databases such as SwissProt
  • tools for benchmarking:
    • Unipept
    • Prophane
    • MetaGOmics

Contact information

Henning Schiebenhoefer - Robert Koch-Institut (Germany) - [email protected]

@RalfG
Copy link
Member

RalfG commented Oct 8, 2019

This hackathon project will be merged with #1 by @pverscha. See #1 for more info.

@RalfG RalfG closed this as completed Oct 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants