Skip to content

Python package for processing exon information from Ensembl/Biomart, mapping this information to PTM information from ProteomeScout, and doing downstream analysis

License

Notifications You must be signed in to change notification settings

NaegleLab/ExonPTMapper

Repository files navigation

This is a package for mapping post-translational modifications (obtained from Proteomescout) to the gene, transcript, and exon that codes for each PTM.

Generating data files

Setting up your environment

Currently, there is no setup.py file or pip/conda installable as this is a work in progress, so we recommend working with a local copy of this repository (such as through using conda develop).

git clone https://github.com/NaegleLab/ExonPTMapper
conda create -n splicing
conda develop ExonPTMapper  #need conda-build installed to do this

Downloading the necessary data

In order to run this pipeline, data from both ProteomeScout and Ensembl needs to be downloaded. Follow these instructions to get the appropriate data.

ProteomeScout Data

To download the data that the proteomescout API will work with, follow the below steps:

  1. Navigate to ProteomeScout. Will need to update once proteomescout is off WashU servers.
  2. Go to 'Downloads'
  3. Download the zip file for mammalian PTMs
  4. Extract contents of zip file to 'ps_data_dir'

In addition, you will need to obtain the ProteomeScoutAPI and make a local copy of that repository. Eventually this will be streamlined as well to automatically download for use with the package. Download the API here.

Ensembl Data

From ensembl, you will need to manually download the exon and coding sequences. To download use Biomart:

  1. Go to ensembl
  2. Navigate to biomart tab
  3. Choose latest Ensembl Genes version
  4. Under 'Filters', go the gene tab. Check the 'Transcript type' and select 'protein_coding'.
  5. Navigate to attributes
    1. For exon information, click sequences -> Exon sequences. Under header information, check 'Exon stable ID'. Call this file 'exon_sequences.fasta.gz'.
    2. For coding sequences, click sequences -> Coding sequences. Call this file 'coding_sequences.fasta.gz'
  6. Click 'Results' (upper left corner)
  7. Download compressed file and save in processed_data_dir

The pipeline will automatically download any other necessary meta information about the exons, transcripts, and genes.

APPRIS (optional)

Optionally, if a file is provided, you can append TRIFID functional scores downloaded from APPRIS to the transcript meta information. You will need to download these scores directly from APPRIS, specifically only the TRIFID scores information.

Setting up config.py

To configure ExonPTMapper, need to indicate to the package where your data files will be located, as well as where the api is located. Open config.py and change the variables:

  1. 'api_dir': location where ProteomeScoutAPI is stored
  2. 'ps_data_dir': directory which contains the ProteomeScout data files
  3. 'source_data_dir': where data from various databases (like ensembl) is saved
  4. 'processed_data_dir': directory which will contain processed created data files

Config.py performs several roles:

  1. Loads ProteomeScoutAPI
  2. Processes the ensemble to uniprot mapping file into a usable pandas dataframe (called 'translator')
  3. Loads the 'available_transcripts.json' file, if it has been generated by processing.py previously. This file contains a list that indicates which transcripts have amino acid sequences matching those of their corresponding uniprot record.

Mapping and Projecting PTMs onto Alternative Transcripts

Loading in data and running the entire mapping pipeline only requires the use of one master function:

from ExonPTMapper import mapping

mapper = mapping.run_mapping()

This will load any data found within the processed_data_dir indicated in the config file. Based on the available information, it will start the mapping/projection procedure at a point that is not redundant (i.e. it will not calculate/obtain any information that already exists). If you wish to start the process from scratch, you may set restart = True.

The pipeline will also automatically save new data files in the processed_data_dir.

About

Python package for processing exon information from Ensembl/Biomart, mapping this information to PTM information from ProteomeScout, and doing downstream analysis

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages