This is a package for mapping post-translational modifications (obtained from Proteomescout) to the gene, transcript, and exon that codes for each PTM.
Currently, there is no setup.py file or pip/conda installable as this is a work in progress, so we recommend working with a local copy of this repository (such as through using conda develop).
git clone https://github.com/NaegleLab/ExonPTMapper
conda create -n splicing
conda develop ExonPTMapper #need conda-build installed to do this
In order to run this pipeline, data from both ProteomeScout and Ensembl needs to be downloaded. Follow these instructions to get the appropriate data.
To download the data that the proteomescout API will work with, follow the below steps:
- Navigate to ProteomeScout. Will need to update once proteomescout is off WashU servers.
- Go to 'Downloads'
- Download the zip file for mammalian PTMs
- Extract contents of zip file to 'ps_data_dir'
In addition, you will need to obtain the ProteomeScoutAPI and make a local copy of that repository. Eventually this will be streamlined as well to automatically download for use with the package. Download the API here.
From ensembl, you will need to manually download the exon and coding sequences. To download use Biomart:
- Go to ensembl
- Navigate to biomart tab
- Choose latest Ensembl Genes version
- Under 'Filters', go the gene tab. Check the 'Transcript type' and select 'protein_coding'.
- Navigate to attributes
- For exon information, click sequences -> Exon sequences. Under header information, check 'Exon stable ID'. Call this file 'exon_sequences.fasta.gz'.
- For coding sequences, click sequences -> Coding sequences. Call this file 'coding_sequences.fasta.gz'
- Click 'Results' (upper left corner)
- Download compressed file and save in processed_data_dir
The pipeline will automatically download any other necessary meta information about the exons, transcripts, and genes.
Optionally, if a file is provided, you can append TRIFID functional scores downloaded from APPRIS to the transcript meta information. You will need to download these scores directly from APPRIS, specifically only the TRIFID scores information.
To configure ExonPTMapper, need to indicate to the package where your data files will be located, as well as where the api is located. Open config.py and change the variables:
- 'api_dir': location where ProteomeScoutAPI is stored
- 'ps_data_dir': directory which contains the ProteomeScout data files
- 'source_data_dir': where data from various databases (like ensembl) is saved
- 'processed_data_dir': directory which will contain processed created data files
Config.py performs several roles:
- Loads ProteomeScoutAPI
- Processes the ensemble to uniprot mapping file into a usable pandas dataframe (called 'translator')
- Loads the 'available_transcripts.json' file, if it has been generated by processing.py previously. This file contains a list that indicates which transcripts have amino acid sequences matching those of their corresponding uniprot record.
Loading in data and running the entire mapping pipeline only requires the use of one master function:
from ExonPTMapper import mapping
mapper = mapping.run_mapping()
This will load any data found within the processed_data_dir indicated in the config file. Based on the available information, it will start the mapping/projection procedure at a point that is not redundant (i.e. it will not calculate/obtain any information that already exists). If you wish to start the process from scratch, you may set restart = True.
The pipeline will also automatically save new data files in the processed_data_dir.