Explore Alternative Metrics for Comprehensive Evaluation of Quantitative Performance when using Mixed-Species Datasets #13

hollenstein · 2024-10-13T16:54:49Z

hollenstein
Oct 13, 2024

Title

Alternative Metrics for Quantitative Proteomics Evaluation

Abstract

Proteomics experiments typically aim to not only identify but also quantify protein content across samples. Assessing quantitative performance is crucial for evaluating instruments, acquisition methods, and data processing algorithms. Benchmarking is typically done using mixed-species proteome samples with varying amounts of the individual species, to create a quantification ground truth. Evaluation of the benchmark results, however, often focuses solely on few global accuracy and precision metrics.

In this project, I propose to explore alternative and more detailed metrics for evaluating the quantitative performance of mixed-species datasets. To investigate the characteristics and usefulness of these metrics, we will implement and compare different algorithms for summarizing ions to proteins. We will use a collection of ground truth datasets with different characteristics: measured on various MS instruments, acquired in both DIA and DDA mode, and analyzed with multiple programs. To easily apply the protein summarization algorithms across all datasets and automatically calculate various performance metrics, we will create a flexible data processing pipeline in Python.

This project aims to provide more detailed performance assessments in quantitative proteomics that will facilitate method development and the evaluation of strengths and weaknesses of data acquisition and processing pipelines.

Project Plan

During the hackathon we will focus on the four major tasks outlined below. Before we start, we will decide on common interfaces for the modules in the Python pipeline. This will allow working on tasks 1-3 in parallel. At the end, in task 4, we will integrate all components and use the results to discuss the usefulness of the different metrics, and if time permits, plan how to improve existing metrics and think of potential additional metrics.

Develop a data processing pipeline in Python that allows the automated processing of results from different software into a standardized format, apply ion to protein summarization algorithms, and calculate performance metrics. Each processing step will be modular with standardized interfaces, enabling easy interchangeability of different implementations. For example, modules will be created for ion to protein summarization and metric calculation.
Create multiple ion to protein summarization algorithms as modules for the Python pipeline. A minimal set of summarization algorithms algorithms should include summing up of all intensities, summing up the top N peptides, and MaxLFQ. Depending on available resources and interest, more advanced methods can also be included.
Assess which aspects of the quantification results might be relevant for evaluation but are not covered by standard metrics. Design alternative metrics to evaluate different aspects of quantitative performance. For example, an alternative metric could be used to highlight the extent of outliers and edge cases that might give rise to false positives. Then implement the calculation of these metrics as modules within the Python pipeline. Experiment with different ways of visualization that enable easy to interpret comparison of the results.
Use the pipeline to run the different summarization algorithms on all datasets, calculate performance metrics, and generate tables and figures for comparing the results.

Note: An alternative approach for evaluating the performance metrics is to consider only one protein summarization algorithm, but add varying amounts of noise and biases to the reported ion intensities. One could then investigate how well the effects of the manipulation are reflected in the calculated metrics. If time permits, we might implement additional modules for the processing pipeline that introduce such errors to the ion intensities, and include them in step 4 when running the pipeline and for characterizing the metrics.

After the Hackathon

To make the results of the hackathon available to the community regardless of how we proceed afterwards, the code would be published on GitHub, along with a brief description of what we learned about the alternative performance metrics.
Some metrics might be suitable and useful for modules of ProteoBench that are using mixed-species datasets. If there is interest by the participants and the ProteoBench team these metrics could be integrated into ProteoBench modules.
During the hackathon the data processing pipeline will be implemented quickly as a prototype. After the hackathon we should have a better understanding of good and bad design decisions. If there is interest, we might refactor the pipeline and implement it in a more robust and maintainable manner and release it as a Python package.

Technical Details

The main programming language is Python. But there is the possibility to design the data processing pipeline in a way that accommodates execution of external scripts or command line tools, which in principle could be written in any language.
We will use Git for version control and share the codebase via a GitHub repository.
An initial set of LFQ benchmark dataset results will be provided that were created with different MS instruments (Exploris 480, Eclipse, Astral, timsTOF HT, and possibly others), acquired in DIA and DDA mode, analyzed with different software (for DIA Spectronaut and maybe DiaNN; for DDA FragPipe and MaxQuant).
The data processing pipeline will be based largely on existing Python libraries that can be extended or modified if required.

Contact information

David M. Hollenstein
University of Vienna, Austria
Mass Spectrometry Facility of the Max Perutz Labs (Part of Vienna BioCenter)
[email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore Alternative Metrics for Comprehensive Evaluation of Quantitative Performance when using Mixed-Species Datasets #13

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Explore Alternative Metrics for Comprehensive Evaluation of Quantitative Performance when using Mixed-Species Datasets #13

hollenstein Oct 13, 2024

Title

Abstract

Project Plan

Technical Details

Contact information

Replies: 0 comments

hollenstein
Oct 13, 2024