This is a command line program in Java, which is a module of PathwayMatcher
This module gathers reference biological data necessary to perform pathway search and analysis, and creates static mapping files that are loaded during execution of PathwayMatcher.
The extractor has two main components, one for the mapping of genetic variants and the other to map proteins and proteoforms to pathways.
The necessary mappings for the pathway search:
- SNP --> Gene name
- SNP --> Protein (UniProt accession)
- Protein --> Proteoforms
- Protein --> Reactions
- Proteoform --> Reactions
- Reactions --> Pathways
- Pathways --> Top Level Pathways
The necessary mappings for the interaction networks are:
- VepFolderProcessor: Creates table files with the mapping of genetic variants to gene names and protein UniProt [1] accessions using the Variant Effect Predictor [2].
No file is needed as input.
Tables with the mapping from genetic variants to gene names and SwissProt entries (UniProt). One table for each chromosome: 1.gz, 2.gz,...,22.gz
- Extractor: Creates the mapping files to go from gene names, proteins and proteoforms to reactions and pathways of Reactome [3].
- Running instance of Neo4j with the Reactome graph database loaded.
Tables generated with VepFolderProcessor: 1.gz, 2.gz,...,22.gz
Serialized files ready to be used by PathwayMatcher:
-
Entity lists:
- proteins.gz
- reactions.gz
- pathways.gz
-
Static mappings for pathway search:
- Pairs of chromosome and base pair to protein UniProt accessions: chrBpToProteins1.gz,...,chrBpToProteins22.gz
- SNP rsIds to protein UniProt accessions: rsIdsToProteins1.gz,..., rsIdsToProteins22.gz
- Gene names to protein UniProt accessions: genesToProteins.gz
- Ensembl protein identifiers to UniProt accessions: ensemblToProteins.gz
- Protein UniProt accessions to proteoforms: proteinsToProteoforms.gz
- Protein UniProt accessions to reactions: proteinsToReactions.gz
- Proteoforms to reactions: proteoformsToReactions.gz
- Pathways to top level pathways: pathwaysToTopLevelPathways.gz
-
Static mappings for interaction networks:
- Protein UniProt accessions to the complexes they can form: proteinsToComplexes.gz
- Protein UniProt accessions to entity sets: proteinsToSets.gz
- Proteoforms to complexes: proteoformsToComplexes.gz
- Proteoforms to entity sets: proteoformsToSets.gz
-
ExtractorPeptides This class gathers the 'Proteotypic Peptide' set from ProteomeTools[4] in a single list file.
This is an extra command line application that was used as support during the development process of PathwayMatcher. It is not needed for the main functionality.
-
ExtractorPsiMod: Http client application to gather the available modifications from the PSI-MOD[5] community standard for representation of protein modification data.
This is also an extra command line application not needed for the main functionality, but useful in case a user wants to get the list of available modifications programmatically.
[1] UniProt Consortium, T. UniProt: the universal protein knowledgebase. Nucleic acids research 46, 2699-2699, doi:10.1093/nar/gky092 %J Nucleic Acids Research (2018).
[2] McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biology 17, 122, doi:10.1186/s13059-016-0974-4 (2016).
[3] Fabregat, A. et al. The Reactome Pathway Knowledgebase. Nucleic acids research 46, D649-d655, doi:10.1093/nar/gkx1132 (2018).
[4] Desiere, et al., "The PeptideAtlas Project", Nucleic Acids Research, 2006, 34, D655-D658
[5] Montecchi-Palazzi, L. et al. The PSI-MOD community standard for representation of protein modification data. Nature Biotechnology 26, 864, doi:10.1038/nbt0808-864 (2008).