Skip to content

The proteomics quantification format, extending mzTab for large scale datasets.

License

Notifications You must be signed in to change notification settings

bigbio/quantms.io

Repository files navigation

quantms.io

Python application Upload Python Package Codacy Badge Codacy Badge Documentation Status PyPI version

quantms is a nextflow pipeline for the analysis of quantitative proteomics data. The pipeline is based on the OpenMS framework and DIA-NN; and it is designed to analyze large scale experiments. The main outputs of quantms workflow are the following:

  • mzTab files with the identification and quantification information.
  • MSstats input file with the peptide quantification values needed for the MSstats analysis.
  • MSstats output file with the differential expression values for each protein.
  • The input SDRF of the pipeline if available.

While all the previous formats are well-known standards and popular formats in the proteomics community; they are difficult to use in big data analysis projects. In addition, these file formats are difficult to extend and provide multiple views of the underlying data. For example, in mzTab it is extremely hard for big datasets to retrieve the identified peptides and features and the corresponding intensities. At the same time it is difficult to get the protein quantification values for a given sample.

Here, we aim to formalize and develop a more standardized format that enables better representation of the identification and quantification results but also enables new and novel use cases for proteomics data analysis. The main use cases for the format are:

  • Fast and easy visualization of the identification and quantification results.
  • Easy integration with other omics data.
  • Easy integration with sample metadata.
  • AI/ML model development based on identification and quantification results.
  • Easy data retrieval for big datasets and large-scale collections of proteomics data.

Note: We are not trying to replace the mzTab format, but to provide a new format that enables AI-related use cases. Most of the features of the mzTab format will be included in the new format.

Data model

quantms.io could be seen as a multiple view representation of a proteomics data analysis results. Each view of the format can be serialized in different formats depending on the use case. the data model of quantms.io defines two main things, the view and how the view is serialized.

  • The data model view defines the structure, the fields and properties that will be included in a view for each peptide, psms, feature or protein, for example.
  • The data serialization defines the format in which the view will be serialized and what features of serialization will be supported, for example compression, indexing or slicing.
view file class serialization format definition example
psm psm_file parquet psm psm example
feature feature_file parquet feature feature example
absolute absolute_file tsv absolute absolute example
differential differential_file tsv differential differential example
sdrf sdrf_file tsv metadata sdrf example
project - json project --

Note: Views can be extended and new views can be added to the format.

Introduction to quantms.io

A quantms.io file is a collection of views, and they are aggregated into a folder .qms and inside that folder a file collect project.json MUST be present. Please read about the project view for more information.

The introduction to the format, concepts and more details topics about serialization can be read in the introduction to the format here.

How to contribute

External contributors, researchers and the proteomics community are more than welcome to contribute to this project.

Contribute with the specification: you can contribute to the specification with ideas or refinements by adding an issue into the issue tracker or performing a PR.

Core contributors and collaborators

The project is run by different groups:

  • Yasset Perez-Riverol (PRIDE Team, European Bioinformatics Institute - EMBL-EBI, U.K.)

IMPORTANT: If you contribute with the following specification, please make sure to add your name to the list of contributors.

Code of Conduct

As part of our efforts toward delivering open and inclusive science, we follow the Contributor Covenant Code of Conduct for Open Source Projects.

How to cite

Copyright notice

This information is free; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This information is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this work; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

About

The proteomics quantification format, extending mzTab for large scale datasets.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages