Dependencies#
+-
+
numpy==1.26.*
+pandas==2.2.*
+gseapy==1.1.*
+tqdm==4.66.*
+seaborn==0.13.*
+biopython==1.83.*
+xlrd==2.0.*
+
diff --git a/.buildinfo b/.buildinfo new file mode 100644 index 0000000..55009ed --- /dev/null +++ b/.buildinfo @@ -0,0 +1,4 @@ +# Sphinx build info version 1 +# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. +config: c2505eda61830c7fd38f6e8ae0a85121 +tags: 645f666f9bcd5a90fca523b33c5a78b7 diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..cad11e1 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,2 @@ + +ptm_pose/Resource_Files/ptm_coordinates.csv filter=lfs diff=lfs merge=lfs -text diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md new file mode 100644 index 0000000..76c5d2b --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.md @@ -0,0 +1,34 @@ +--- +name: Bug report +about: Create a report to help us improve +title: '' +labels: '' +assignees: '' + +--- + +#### Description +A clear and concise description of what the issue is about. + +#### Screenshots +![Downhill Windmills](http://i.giphy.com/KO8AG2EByqkFi.gif) + +#### Files +A list of relevant files for this issue. This will help people navigate the project and offer some clues of where to start. + +#### To Reproduce +Steps to reproduce the behavior: +1. Go to '...' +2. Click on '....' +3. Scroll down to '....' +4. See error + +#### Expected behavior +A clear and concise description of what you expected to happen. + + +#### Tasks +Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at, if known +- [ ] Task 1 +- [ ] Task 2 +- [ ] Task 3 diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md new file mode 100644 index 0000000..df4910a --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature_request.md @@ -0,0 +1,26 @@ +--- +name: Feature request +about: Suggest an idea for this project +title: '' +labels: '' +assignees: '' + +--- + +**Is your feature request related to a problem? Please describe.** +A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] + +**Describe the solution you'd like** +A clear and concise description of what you want to happen. + +**Describe alternatives you've considered** +A clear and concise description of any alternative solutions or features you've considered. + +#### Tasks +Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at. +- [ ] Task 1 +- [ ] Task 2 +- [ ] Task 3 + +**Additional context** +Add any other context or screenshots about the feature request here. diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..21550d8 --- /dev/null +++ b/.gitignore @@ -0,0 +1,141 @@ +# package specific files +ptm_pose/Resource_Files/translator.csv +ptm_pose/Resource_Files/uniprot_to_gene_name.json +ptm_pose/Resource_Files/nonconstitutive_ptm_list.txt +ptm_pose/Resource_Files/background_annotations/ +Test_datasets/ +Testing.ipynb +.pybiomart.sqlite + +#sql lite(pybiomart generates) +*.sqllite + +# Byte-compiled / optimized / DLL files +__pycache__/ +*.py[cod] +*$py.class + +# C extensions +*.so + +# Distribution / packaging +.Python +build/ +develop-eggs/ +dist/ +downloads/ +eggs/ +.eggs/ +lib/ +lib64/ +parts/ +sdist/ +var/ +wheels/ +pip-wheel-metadata/ +share/python-wheels/ +*.egg-info/ +.installed.cfg +*.egg +MANIFEST + +# PyInstaller +# Usually these files are written by a python script from a template +# before PyInstaller builds the exe, so as to inject date/other infos into it. +*.manifest +*.spec + +# Installer logs +pip-log.txt +pip-delete-this-directory.txt + +# Unit test / coverage reports +htmlcov/ +.tox/ +.nox/ +.coverage +.coverage.* +.cache +nosetests.xml +coverage.xml +*.cover +*.py,cover +.hypothesis/ +.pytest_cache/ + +# Translations +*.mo +*.pot + +# Django stuff: +*.log +local_settings.py +db.sqlite3 +db.sqlite3-journal + +# Flask stuff: +instance/ +.webassets-cache + +# Scrapy stuff: +.scrapy + +# Sphinx documentation +docs/_build/ + +# PyBuilder +target/ + +# Jupyter Notebook +.ipynb_checkpoints + +# IPython +profile_default/ +ipython_config.py + +# pyenv +.python-version + +# pipenv +# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. +# However, in case of collaboration, if having platform-specific dependencies or dependencies +# having no cross-platform support, pipenv may install dependencies that don't work, or not +# install all needed dependencies. +#Pipfile.lock + +# PEP 582; used by e.g. github.com/David-OConnor/pyflow +__pypackages__/ + +# Celery stuff +celerybeat-schedule +celerybeat.pid + +# SageMath parsed files +*.sage.py + +# Environments +.env +.venv +env/ +venv/ +ENV/ +env.bak/ +venv.bak/ + +# Spyder project settings +.spyderproject +.spyproject + +# Rope project settings +.ropeproject + +# mkdocs documentation +/site + +# mypy +.mypy_cache/ +.dmypy.json +dmypy.json + +# Pyre type checker +.pyre/ diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/Dependencies.html b/Dependencies.html new file mode 100644 index 0000000..bad7d12 --- /dev/null +++ b/Dependencies.html @@ -0,0 +1,447 @@ + + + + + + + +
+ + + +numpy==1.26.*
pandas==2.2.*
gseapy==1.1.*
tqdm==4.66.*
seaborn==0.13.*
biopython==1.83.*
xlrd==2.0.*
In this notebook, we will explore the role of ESRP1 expression in prostate cancer, where it is commonly amplified and correlated with worsened prognosis. We will obtain splicing quantification across the TCGA-PRAD cohort using data from TCGASpliceSeq, and project PTMs onto the splice events that were identified by SpliceSeq. We will then explore the various ways ESRP1 expression may drive changes through changes to PTM inclusion and +flanking sequences. The analysis here corresponds to Figures 4 and 5 of our manuscript
+This notebook is divided into the following sections: 1. Load ESRP1 expression data from CBioPortal 2. Project PTMs onto splice events and identify events that are correlated with ESRP1 expression 3. Explore the functional consequence of ESRP1-correlated PTMs
+While this is not a part of PTM-POSE, in order to explore the role of ESRP1 expression in prostate cancer, we first need to know which patients are express high or low levels of ESRP1. We can do this directly through `CBioPortal’s API <>`__ (which requires the bravado python package). Alternatively, you can choose to download the data from the CBioPortal website, and upload it here.
+[1]:
+
from bravado.client import SwaggerClient
+import pandas as pd
+
+#initialize swagger client
+cbioportal = SwaggerClient.from_url('https://www.cbioportal.org/api/v2/api-docs',
+ config={"validate_requests":False,"validate_responses":False,"validate_swagger_spec": False})
+
+for a in dir(cbioportal):
+ cbioportal.__setattr__(a.replace(' ', '_').lower(), cbioportal.__getattr__(a))
+
+# ESRP1 Entrez Gene ID = 54845
+gene_id = 54845
+
+#download rna sequencing data for ESRP1
+study_id = 'prad_tcga_pan_can_atlas_2018'
+expression_data = cbioportal.Molecular_Data.getAllMolecularDataInMolecularProfileUsingGET(molecularProfileId = study_id + '_rna_seq_v2_mrna',
+ sampleListId = study_id + '_all', entrezGeneId = gene_id).result()
+#extract expression data and normalize by z-score
+sample_id = [samp.sampleId for samp in expression_data]
+rsem = [samp.value for samp in expression_data]
+rsem = pd.Series(rsem, index = sample_id)
+rsem_zscore = (rsem - rsem.mean())/rsem.std()
+
+#extract high and low patients (absolute z-score > 1)
+high_patients = rsem_zscore[rsem_zscore > 1].index
+low_patients = rsem_zscore[rsem_zscore < -1].index
+
[ ]:
+
+
Here is an example of running PTM-POSE on MATS analysis of RNA sequencing data from ESRP1 knockdown experiments performed by Yang et al, 2016
+First, let’s focus on skipped exon events.
+To identify differentially included PTMs as a result of ESRP1 knockdown, we need three layers of information for each splice event: 1. Chromosome 2. DNA strand 2. Start and end coordinates of the event (either hg19 or hg38)
+Optionally, we can also provide: 1. Gene name 2. Event ID 3. Delta PSI for the event 4. Significance of the event
+With PTM-POSE, we need to indicate where to find this information within the splice data
+[1]:
+
import pandas as pd
+
+SE_data = pd.read_excel('../../ESRP1_data/Yang2016/esrp1_knockdown_data_Yang2016.xlsx', sheet_name='rMATS ESRP KD', header = 2).iloc[0:179]
+
+
+# required column information
+chromosome_col = 'chr'
+strand_col = 'strand'
+region_start_col = 'exonStart_0base'
+region_end_col = 'exonEnd'
+
+# optional column information (None if nothing is provided and will not be appended to the output)
+gene_col = 'geneSymbol'
+event_id_col = None #not in the data
+dPSI_col = 'meanDeltaPSI'
+sig_col = 'FDR'
+
+#look at the data
+SE_data[[gene_col, chromosome_col, strand_col, region_start_col, region_end_col, dPSI_col, sig_col]].head()
+
[1]:
+
+ | geneSymbol | +chr | +strand | +exonStart_0base | +exonEnd | +meanDeltaPSI | +FDR | +
---|---|---|---|---|---|---|---|
0 | +SPAG9 | +chr17 | +- | +49053223 | +49053262 | +0.227 | +0 | +
1 | +ARHGAP17 | +chr16 | +- | +24950684 | +24950918 | +0.413 | +0 | +
2 | +ITGA6 | +chr2 | ++ | +173366499 | +173366629 | +-0.361 | +0 | +
3 | +KRAS | +chr12 | +- | +25368370 | +25368494 | +-0.068 | +0 | +
4 | +TCIRG1 | +chr11 | ++ | +67817953 | +67818131 | +0.368 | +0 | +
The strand can either be provided use ‘+’ and ‘-’ or using 1 and -1 to indicate the forward and reverse strand, the code will convert strand to integer format (-1 or 1) when running.
+If this is the first time running PTM-POSE, you will need to download ptm_coordinates. If you set save = True, the coordinates will be saved for the future so you do not need to redownload them, but you can also set save = False to avoid saving the coordinates (will take ~60MB of space)
+[3]:
+
from ptm_pose import pose_config
+pose_config.ptm_coordinates = pose_config.download_ptm_coordinates(save = True)
+
We can then use the project module of PTM-POSE to identify PTMs that can be found in these regions. This dataset uses the hg19 genome build, so we need to specify this using the ‘coordinate_type’ parameter.
+[2]:
+
from ptm_pose import project
+
+splice_data, spliced_ptms = project.project_ptms_onto_splice_events(SE_data, chromosome_col = chromosome_col, strand_col = strand_col, region_start_col = region_start_col, region_end_col = region_end_col, gene_col = gene_col, event_id_col = event_id_col, dPSI_col = dPSI_col, sig_col = sig_col, coordinate_type = 'hg19')
+
+Translator file not found. Downloading mapping information between UniProt and Gene Names from pybiomart
+
+Projecting PTMs onto splice events using hg19 coordinates.: 100%|██████████| 179/179 [00:03<00:00, 48.82it/s]
+
+PTMs projection successful (475 identified).
+
+
+
+
From this, there are two outputs: 1. The original splice dataframe with additional PTM information added
+[4]:
+
splice_data[[gene_col, chromosome_col, strand_col, region_start_col, region_end_col, dPSI_col, sig_col] + ['PTMs', 'Number of PTMs Affected', 'Number of Unique PTM Sites by Position', 'Event Length', 'PTM Density (PTMs/bp)']].head()
+
[4]:
+
+ | geneSymbol | +chr | +strand | +exonStart_0base | +exonEnd | +meanDeltaPSI | +FDR | +PTMs | +Number of PTMs Affected | +Number of Unique PTM Sites by Position | +Event Length | +PTM Density (PTMs/bp) | +
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | +SPAG9 | +17 | +- | +49053223 | +49053262 | +0.227 | +0 | +NaN | +0 | +0 | +39 | +0.0 | +
1 | +ARHGAP17 | +16 | +- | +24950684 | +24950918 | +0.413 | +0 | +Q68EM7_S575.0 (Phosphorylation)/Q68EM7_S570.0 ... | +6 | +1 | +234 | +0.004274 | +
2 | +ITGA6 | +2 | ++ | +173366499 | +173366629 | +-0.361 | +0 | +P23229_Ynan (Phosphorylation)/P23229_Tnan (Pho... | +7 | +4 | +130 | +0.030769 | +
3 | +KRAS | +12 | +- | +25368370 | +25368494 | +-0.068 | +0 | +P01116_C186 (Methylation)/P01116_C180 (Palmito... | +3 | +2 | +124 | +0.016129 | +
4 | +TCIRG1 | +11 | ++ | +67817953 | +67818131 | +0.368 | +0 | +NaN | +0 | +0 | +178 | +0.0 | +
New dataframe that has each PTM and additional information about the PTM in its own row
[5]:
+
spliced_ptms.head()
+
[5]:
+
+ | dPSI | +Significance | +Gene | +Source of PTM | +UniProtKB Accession | +Residue | +PTM Position in Canonical Isoform | +Gene Location (hg19) | +Modification | +Modification Class | +Proximity to Region Start (bp) | +Proximity to Region End (bp) | +Proximity to Splice Boundary (bp) | +
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | +0.413 | +0.0 | +ARHGAP17 | +Q68EM7-1_S575 | +Q68EM7 | +S | +575.0 | +24950686.0 | +Phosphoserine | +Phosphorylation | +2.0 | +232.0 | +2.0 | +
1 | +0.413 | +0.0 | +ARHGAP17 | +Q68EM7-1_S570 | +Q68EM7 | +S | +570.0 | +24950701.0 | +Phosphoserine | +Phosphorylation | +17.0 | +217.0 | +17.0 | +
2 | +0.413 | +0.0 | +ARHGAP17 | +Q68EM7-1_S560 | +Q68EM7 | +S | +560.0 | +24950731.0 | +Phosphoserine | +Phosphorylation | +47.0 | +187.0 | +47.0 | +
3 | +0.413 | +0.0 | +ARHGAP17 | +Q68EM7-1_S553 | +Q68EM7 | +S | +553.0 | +24950752.0 | +Phosphoserine | +Phosphorylation | +68.0 | +166.0 | +68.0 | +
4 | +0.413 | +0.0 | +ARHGAP17 | +Q68EM7-1_S547 | +Q68EM7 | +S | +547.0 | +24950770.0 | +Phosphoserine | +Phosphorylation | +86.0 | +148.0 | +86.0 | +
For MATS data, there is also a built in function for running PTM-POSE on MATS data, including all events:
+In addition to differential inclusion of PTMs, some PTMs may experience altered flanking sequences. We can use the project module of PTM-POSE to identify PTMs for which this happens. You will need to provide the same layers of information, plus the genomic coordinates of the regions flanking the spliced region.
+[12]:
+
from ptm_pose import flanking_sequences
+
+first_flank_start_col = 'firstFlankingES'
+first_flank_end_col='firstFlankingEE'
+second_flank_start_col = 'secondFlankingES'
+second_flank_end_col = 'secondFlankingEE'
+
+flanks = flanking_sequences.get_flanking_changes_from_splice_data(SE_data, chromosome_col = chromosome_col, strand_col = strand_col, first_flank_start_col = first_flank_start_col, first_flank_end_col=first_flank_end_col, second_flank_start_col = second_flank_start_col, second_flank_end_col = second_flank_end_col , spliced_region_start_col = region_start_col, spliced_region_end_col = region_end_col, dPSI_col=dPSI_col, sig_col = sig_col, event_id_col = event_id_col, coordinate_type = 'hg19')
+
+c:\Users\Sam\miniconda3\envs\testing_pose\Lib\site-packages\Bio\pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
+ warnings.warn(
+
[3]:
+
flanks.head()
+
[3]:
+
+ | Event ID | +Source of PTM | +Residue | +PTM Position in Canonical Isoform | +Inclusion Sequence | +Exclusion Sequence | +Region | +Translation Success | +Matched | +
---|---|---|---|---|---|---|---|---|---|
0 | +3 | +P01116-2_T148;P01116-1_T148 | +T | +148 | +ETSAKtRQESG | +ETSAKtRQGC* | +Second | +True | +False | +
1 | +3 | +P01116-1_K147;P01116-2_K147 | +K | +147 | +IETSAkTRQES | +IETSAkTRQGC | +Second | +True | +False | +
0 | +8 | +Q9UPQ0-1_S746 | +S | +746 | +LPNLNsQGVAW | +LPNLNsQGGFS | +First | +True | +False | +
1 | +8 | +Q9UPQ0-10_S750;Q9UPQ0-6_S596;Q9UPQ0-1_S750 | +S | +750 | +PSQVDsPSSEK | +ILKVDsPSSEK | +Second | +True | +False | +
0 | +11 | +P62847-1_K129 | +K | +NaN | +NVGAGkKSVSW | +NVGAGkKAEGV | +First | +True | +False | +
We can also do additional comparisons, such as comparing sequence identity and looking for matching elm motifs.
+[ ]:
+
flanks = flanking_sequences.compare_flanking_sequences(flanks)
+flanks = flanking_sequences.compare_inclusion_motifs(flanks)
+flanks[['Source of PTM','Sequence Identity', 'Altered Positions','Residue Changes', 'Altered Flank Side', 'Motif only in Inclusion', 'Motif only in Exclusion']].head()
+
Once we have PTMs impacted by splicing, we can also annotate them with additional information. This can be done using the annotate module of PTM-POSE, and can be used with outputs from either the project module (differentially included PTMs) or the flanking_sequence module (PTMs with altered flanking sequences).
+Currently, there are functions for appending information from: 1. PhosphoSitePlus (function, biological process, disease association, interactions, and kinase-substrate), 2. PTMsigDB (iKiP db, perturbations) 3. RegPhos (kinase-substrate), 4. PTMcode (inter and intraprotein interactions) 5. PTMInt (interactions) 6. DEPOD (Phosphatase-substrate) 7. ELM (interactions, motifs)
+[3]:
+
from ptm_pose import annotate
+
+#where to find PhosphoSitePlus data
+psp_regulatory_file = '/PhosphoSitePlus/Regulatory_sites.gz'
+psp_disease_file = '/PhosphoSitePlus/Disease-associated_sites.gz'
+psp_kinase_file = '/Database_Information/PhosphoSitePlus/Kinase_Substrate_Dataset.gz'
+
+#where to find ELM data
+
+#PhosphoSitePlus data (due to licencsing issues, must be downloaded manually from PhosphoSitePlus and the file path provided)
+spliced_ptms = annotate.add_PSP_regulatory_site_data(spliced_ptms, '/PhosphoSitePlus/Regulatory_sites.gz')
+spliced_ptms = annotate.add_PSP_disease_association(spliced_ptms, '/PhosphoSitePlus/Disease-associated_sites.gz')
+spliced_ptms = annotate.add_PSP_kinase_substrate_data(spliced_ptms, '/Database_Information/PhosphoSitePlus/Kinase_Substrate_Dataset.gz')
+
+#ELM interactions (will be faster if file is downloaded manually from ELM and the file path provided)
+spliced_ptms = annotate.add_ELM_interactions(spliced_ptms)
+
+#PTMint interactions
+spliced_ptms = annotate.add_PTMint_data(spliced_ptms)
+
+#PTMcode interactions (will be faster/more reliable if file is downloaded manually from PTMcode and the file path provided)
+ptm_code_interprotein = '/PTMcode2_associations_between_proteins.txt.gz'
+
+#DEPOD phosphatase data
+spliced_ptms = annotate.add_DEPOD_phosphatase_data(spliced_ptms)
+
+#RegPhos data
+spliced_ptms = annotate.add_RegPhos_data(spliced_ptms)
+
+#annotate ptms
+spliced_ptms = annotate.annotate_ptms(spliced_ptms)
+
+PhosphoSitePlus regulatory_site information added:
+ ->6 PTMs in dataset found associated with a molecular function
+ ->7 PTMs in dataset found associated with a biological process
+ ->2 PTMs in dataset found associated with a protein interaction
+PhosphoSitePlus disease associations added: 1 PTM sites in dataset found associated with a disease in PhosphoSitePlus
+PhosphoSitePlus kinase-substrate interactions added: 6 phosphorylation sites in dataset found associated with a kinase in PhosphoSitePlus
+ELM interaction instances added: 1 PTMs in dataset found associated with at least one known ELM instance
+PTMInt data added: 2 PTMs in dataset found with PTMInt interaction information
+PTMcode interprotein interactions added: 27 PTMs in dataset found with PTMcode interprotein interaction information
+DEPOD Phosphatase substrates added: 0 PTMs in dataset found with Phosphatase substrate information
+RegPhos kinase-substrate data added: 3 PTMs in dataset found with kinase-substrate information
+
+c:\Users\Sam\miniconda3\envs\testing_pose\Lib\site-packages\ptm_pose\annotate.py:558: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
+ regphos = pd.read_csv('http://140.138.144.141/~RegPhos/download/RegPhos_Phos_human.txt', sep = '\t')
+
Once we have all of this information, we can start to assess how PTMs are impacted by splicing. Let’s first get an idea for how many PTMs have different annotations associated with them from the various sources
+[6]:
+
from ptm_pose import analyze
+
+analyze.show_available_annotations(spliced_ptms, figsize = (5,5))
+
There are several ptms that have previously been annotated with specific functions, let’s take a look at those:
+[10]:
+
annotations, annotation_counts = analyze.get_ptm_annotations(spliced_ptms, annotation_type = 'Process', database = 'PhosphoSitePlus')
+annotations.head()
+
[10]:
+
+ | Gene | +UniProtKB Accession | +Residue | +PTM Position in Canonical Isoform | +Modification Class | +PSP:ON_PROCESS | +
---|---|---|---|---|---|---|
145 | +CEACAM1 | +P13688 | +S | +461.0 | +Phosphorylation | +apoptosis, altered | +
184 | +YAP1 | +P46937 | +K | +342.0 | +Ubiquitination | +carcinogenesis, altered | +
217 | +TSC2 | +P49815 | +S | +981.0 | +Phosphorylation | +carcinogenesis, inhibited; cell growth, inhibi... | +
395 | +SPHK2 | +Q9NRA0 | +S | +387.0 | +Phosphorylation | +cell motility, altered | +
407 | +SPHK2 | +Q9NRA0 | +T | +614.0 | +Phosphorylation | +cell motility, altered | +
[11]:
+
annotation_counts
+
[11]:
+
+ | PSP:ON_PROCESS | +count | +
---|---|---|
0 | +cell motility, altered | +3 | +
1 | +cell growth, induced | +2 | +
2 | +apoptosis, altered | +1 | +
3 | +carcinogenesis, altered | +1 | +
4 | +carcinogenesis, inhibited | +1 | +
5 | +cell growth, inhibited | +1 | +
6 | +autophagy, inhibited | +1 | +
7 | +signaling pathway regulation | +1 | +
8 | +cytoskeletal reorganization | +1 | +
9 | +cell adhesion, inhibited | +1 | +
\n", + " | geneSymbol | \n", + "chr | \n", + "strand | \n", + "exonStart_0base | \n", + "exonEnd | \n", + "meanDeltaPSI | \n", + "FDR | \n", + "
---|---|---|---|---|---|---|---|
0 | \n", + "SPAG9 | \n", + "chr17 | \n", + "- | \n", + "49053223 | \n", + "49053262 | \n", + "0.227 | \n", + "0 | \n", + "
1 | \n", + "ARHGAP17 | \n", + "chr16 | \n", + "- | \n", + "24950684 | \n", + "24950918 | \n", + "0.413 | \n", + "0 | \n", + "
2 | \n", + "ITGA6 | \n", + "chr2 | \n", + "+ | \n", + "173366499 | \n", + "173366629 | \n", + "-0.361 | \n", + "0 | \n", + "
3 | \n", + "KRAS | \n", + "chr12 | \n", + "- | \n", + "25368370 | \n", + "25368494 | \n", + "-0.068 | \n", + "0 | \n", + "
4 | \n", + "TCIRG1 | \n", + "chr11 | \n", + "+ | \n", + "67817953 | \n", + "67818131 | \n", + "0.368 | \n", + "0 | \n", + "
\n", + " | geneSymbol | \n", + "chr | \n", + "strand | \n", + "exonStart_0base | \n", + "exonEnd | \n", + "meanDeltaPSI | \n", + "FDR | \n", + "PTMs | \n", + "Number of PTMs Affected | \n", + "Number of Unique PTM Sites by Position | \n", + "Event Length | \n", + "PTM Density (PTMs/bp) | \n", + "
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", + "SPAG9 | \n", + "17 | \n", + "- | \n", + "49053223 | \n", + "49053262 | \n", + "0.227 | \n", + "0 | \n", + "NaN | \n", + "0 | \n", + "0 | \n", + "39 | \n", + "0.0 | \n", + "
1 | \n", + "ARHGAP17 | \n", + "16 | \n", + "- | \n", + "24950684 | \n", + "24950918 | \n", + "0.413 | \n", + "0 | \n", + "Q68EM7_S575.0 (Phosphorylation)/Q68EM7_S570.0 ... | \n", + "6 | \n", + "1 | \n", + "234 | \n", + "0.004274 | \n", + "
2 | \n", + "ITGA6 | \n", + "2 | \n", + "+ | \n", + "173366499 | \n", + "173366629 | \n", + "-0.361 | \n", + "0 | \n", + "P23229_Ynan (Phosphorylation)/P23229_Tnan (Pho... | \n", + "7 | \n", + "4 | \n", + "130 | \n", + "0.030769 | \n", + "
3 | \n", + "KRAS | \n", + "12 | \n", + "- | \n", + "25368370 | \n", + "25368494 | \n", + "-0.068 | \n", + "0 | \n", + "P01116_C186 (Methylation)/P01116_C180 (Palmito... | \n", + "3 | \n", + "2 | \n", + "124 | \n", + "0.016129 | \n", + "
4 | \n", + "TCIRG1 | \n", + "11 | \n", + "+ | \n", + "67817953 | \n", + "67818131 | \n", + "0.368 | \n", + "0 | \n", + "NaN | \n", + "0 | \n", + "0 | \n", + "178 | \n", + "0.0 | \n", + "
\n", + " | dPSI | \n", + "Significance | \n", + "Gene | \n", + "Source of PTM | \n", + "UniProtKB Accession | \n", + "Residue | \n", + "PTM Position in Canonical Isoform | \n", + "Gene Location (hg19) | \n", + "Modification | \n", + "Modification Class | \n", + "Proximity to Region Start (bp) | \n", + "Proximity to Region End (bp) | \n", + "Proximity to Splice Boundary (bp) | \n", + "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", + "0.413 | \n", + "0.0 | \n", + "ARHGAP17 | \n", + "Q68EM7-1_S575 | \n", + "Q68EM7 | \n", + "S | \n", + "575.0 | \n", + "24950686.0 | \n", + "Phosphoserine | \n", + "Phosphorylation | \n", + "2.0 | \n", + "232.0 | \n", + "2.0 | \n", + "
1 | \n", + "0.413 | \n", + "0.0 | \n", + "ARHGAP17 | \n", + "Q68EM7-1_S570 | \n", + "Q68EM7 | \n", + "S | \n", + "570.0 | \n", + "24950701.0 | \n", + "Phosphoserine | \n", + "Phosphorylation | \n", + "17.0 | \n", + "217.0 | \n", + "17.0 | \n", + "
2 | \n", + "0.413 | \n", + "0.0 | \n", + "ARHGAP17 | \n", + "Q68EM7-1_S560 | \n", + "Q68EM7 | \n", + "S | \n", + "560.0 | \n", + "24950731.0 | \n", + "Phosphoserine | \n", + "Phosphorylation | \n", + "47.0 | \n", + "187.0 | \n", + "47.0 | \n", + "
3 | \n", + "0.413 | \n", + "0.0 | \n", + "ARHGAP17 | \n", + "Q68EM7-1_S553 | \n", + "Q68EM7 | \n", + "S | \n", + "553.0 | \n", + "24950752.0 | \n", + "Phosphoserine | \n", + "Phosphorylation | \n", + "68.0 | \n", + "166.0 | \n", + "68.0 | \n", + "
4 | \n", + "0.413 | \n", + "0.0 | \n", + "ARHGAP17 | \n", + "Q68EM7-1_S547 | \n", + "Q68EM7 | \n", + "S | \n", + "547.0 | \n", + "24950770.0 | \n", + "Phosphoserine | \n", + "Phosphorylation | \n", + "86.0 | \n", + "148.0 | \n", + "86.0 | \n", + "
\n", + " | Event ID | \n", + "Source of PTM | \n", + "Residue | \n", + "PTM Position in Canonical Isoform | \n", + "Inclusion Sequence | \n", + "Exclusion Sequence | \n", + "Region | \n", + "Translation Success | \n", + "Matched | \n", + "
---|---|---|---|---|---|---|---|---|---|
0 | \n", + "3 | \n", + "P01116-2_T148;P01116-1_T148 | \n", + "T | \n", + "148 | \n", + "ETSAKtRQESG | \n", + "ETSAKtRQGC* | \n", + "Second | \n", + "True | \n", + "False | \n", + "
1 | \n", + "3 | \n", + "P01116-1_K147;P01116-2_K147 | \n", + "K | \n", + "147 | \n", + "IETSAkTRQES | \n", + "IETSAkTRQGC | \n", + "Second | \n", + "True | \n", + "False | \n", + "
0 | \n", + "8 | \n", + "Q9UPQ0-1_S746 | \n", + "S | \n", + "746 | \n", + "LPNLNsQGVAW | \n", + "LPNLNsQGGFS | \n", + "First | \n", + "True | \n", + "False | \n", + "
1 | \n", + "8 | \n", + "Q9UPQ0-10_S750;Q9UPQ0-6_S596;Q9UPQ0-1_S750 | \n", + "S | \n", + "750 | \n", + "PSQVDsPSSEK | \n", + "ILKVDsPSSEK | \n", + "Second | \n", + "True | \n", + "False | \n", + "
0 | \n", + "11 | \n", + "P62847-1_K129 | \n", + "K | \n", + "NaN | \n", + "NVGAGkKSVSW | \n", + "NVGAGkKAEGV | \n", + "First | \n", + "True | \n", + "False | \n", + "
\n", + " | Gene | \n", + "UniProtKB Accession | \n", + "Residue | \n", + "PTM Position in Canonical Isoform | \n", + "Modification Class | \n", + "PSP:ON_PROCESS | \n", + "
---|---|---|---|---|---|---|
145 | \n", + "CEACAM1 | \n", + "P13688 | \n", + "S | \n", + "461.0 | \n", + "Phosphorylation | \n", + "apoptosis, altered | \n", + "
184 | \n", + "YAP1 | \n", + "P46937 | \n", + "K | \n", + "342.0 | \n", + "Ubiquitination | \n", + "carcinogenesis, altered | \n", + "
217 | \n", + "TSC2 | \n", + "P49815 | \n", + "S | \n", + "981.0 | \n", + "Phosphorylation | \n", + "carcinogenesis, inhibited; cell growth, inhibi... | \n", + "
395 | \n", + "SPHK2 | \n", + "Q9NRA0 | \n", + "S | \n", + "387.0 | \n", + "Phosphorylation | \n", + "cell motility, altered | \n", + "
407 | \n", + "SPHK2 | \n", + "Q9NRA0 | \n", + "T | \n", + "614.0 | \n", + "Phosphorylation | \n", + "cell motility, altered | \n", + "
\n", + " | PSP:ON_PROCESS | \n", + "count | \n", + "
---|---|---|
0 | \n", + "cell motility, altered | \n", + "3 | \n", + "
1 | \n", + "cell growth, induced | \n", + "2 | \n", + "
2 | \n", + "apoptosis, altered | \n", + "1 | \n", + "
3 | \n", + "carcinogenesis, altered | \n", + "1 | \n", + "
4 | \n", + "carcinogenesis, inhibited | \n", + "1 | \n", + "
5 | \n", + "cell growth, inhibited | \n", + "1 | \n", + "
6 | \n", + "autophagy, inhibited | \n", + "1 | \n", + "
7 | \n", + "signaling pathway regulation | \n", + "1 | \n", + "
8 | \n", + "cytoskeletal reorganization | \n", + "1 | \n", + "
9 | \n", + "cell adhesion, inhibited | \n", + "1 | \n", + "
Below you will find different ways you might choose to analyze the PTMs identified by PTM-POSE:
+[59]:
+
from ptm_pose import plots as pose_plots
+import pandas as pd
+
+# Load spliced ptm and altered flank data
+spliced_ptms = pd.read_csv('spliced_ptms.csv')
+altered_flanks = pd.read_csv('altered_flanks.csv')
+
[70]:
+
pose_plots.modification_breakdown(spliced_ptms = spliced_ptms, altered_flanks = altered_flanks)
+
[1]:
+
from ptm_pose import analyze
+import pandas as pd
+
+# Load spliced ptm and altered flank data
+spliced_ptms = pd.read_csv('spliced_ptms.csv')
+altered_flanks = pd.read_csv('altered_flanks.csv')
+combined_output = analyze.combine_outputs(spliced_ptms, altered_flanks)
+
+Some annotations in spliced ptms dataframe not found in altered flanks dataframe. These annotations will be ignored. To avoid this, make sure to add annotations to both dataframes, or annotate the combined dataframe.
+
[3]:
+
analyze.show_available_annotations(spliced_ptms)
+
+---------------------------------------------------------------------------
+AttributeError Traceback (most recent call last)
+Cell In[3], line 3
+ 1 from ptm_pose import plots as pose_plots
+----> 3 analyze.show_available_annotations()
+
+AttributeError: module 'ptm_pose.analyze' has no attribute 'show_available_annotations'
+
Often, we will want to dig deeper into the specific functions, processes, interactions, etc. associated with the proteins in our dataset. First, we can look at the annotations currently available for analysis, based on annotations that have been appended using the annotate module:
+[4]:
+
from ptm_pose import analyze
+from ptm_pose import plots as pose_plots
+import pandas as pd
+
+# Load spliced ptm and altered flank data
+spliced_ptms = pd.read_csv('spliced_ptms.csv')
+altered_flanks = pd.read_csv('altered_flanks.csv')
+combined_output = analyze.combine_outputs(spliced_ptms, altered_flanks)
+
+annot_categories = analyze.get_annotation_categories(combined_output)
+annot_categories
+
+Some annotations in spliced ptms dataframe not found in altered flanks dataframe. These annotations will be ignored. To avoid this, make sure to add annotations to both dataframes, or annotate the combined dataframe.
+
[4]:
+
+ | database | +annotation_type | +column | +
---|---|---|---|
5 | +Combined | +Interactions | +Combined:Interactions | +
8 | +Combined | +Kinase | +Combined:Kinase | +
1 | +DEPOD | +Phosphatase | +DEPOD:Phosphatase | +
2 | +ELM | +Interactions | +ELM:Interactions | +
0 | +PhosphoSitePlus | +Interactions | +PSP:ON_PROT_INTERACT | +
3 | +PhosphoSitePlus | +Disease | +PSP:Disease_Association | +
4 | +PhosphoSitePlus | +Process | +PSP:ON_PROCESS | +
6 | +PhosphoSitePlus | +Function | +PSP:ON_FUNCTION | +
7 | +RegPhos | +Kinase | +RegPhos:Kinase | +
This will tell us what database information is available, the types of information from that database, and the column associated with that information. Let’s take a closer look at the biological process information from PhosphoSitePlus:
+[5]:
+
ptms_with_annotation, annotation_counts = analyze.get_ptm_annotations(spliced_ptms, database = "PhosphoSitePlus", annotation_type = 'Process')
+print('Specific PTMs with annotation:')
+ptms_with_annotation
+
+Specific PTMs with annotation:
+
[5]:
+
+ | Gene | +UniProtKB Accession | +Residue | +PTM Position in Canonical Isoform | +Modification Class | +PSP:ON_PROCESS | +dPSI | +Significance | +Impact | +
---|---|---|---|---|---|---|---|---|---|
0 | +BCAR1 | +P56945 | +Y | +267.0 | +Phosphorylation | +cell growth, induced | +-0.07 | +0.0458775672499 | +Excluded | +
1 | +BCAR1 | +P56945 | +Y | +287.0 | +Phosphorylation | +cell growth, induced | +-0.07 | +0.0458775672499 | +Excluded | +
2 | +BIN1 | +O00499 | +T | +348.0 | +Phosphorylation | +signaling pathway regulation | +-0.112 | +0.0233903490744 | +Excluded | +
3 | +CEACAM1 | +P13688 | +S | +461.0 | +Phosphorylation | +apoptosis, altered | +0.525 | +1.73943268451e-09 | +Included | +
4 | +CTTN | +Q14247 | +K | +272.0 | +Acetylation | +cell motility, inhibited | +0.09 | +0.0355211287599 | +Included | +
5 | +CTTN | +Q14247 | +S | +298.0 | +Phosphorylation | +cell motility, altered; cytoskeletal reorganiz... | +0.09 | +0.0355211287599 | +Included | +
6 | +SPHK2 | +Q9NRA0 | +S | +387.0 | +Phosphorylation | +cell motility, altered | +0.253 | +0.0129400018182 | +Included | +
7 | +SPHK2 | +Q9NRA0 | +T | +614.0 | +Phosphorylation | +cell motility, altered | +0.253 | +0.0129400018182 | +Included | +
8 | +TSC2 | +P49815 | +S | +981.0 | +Phosphorylation | +carcinogenesis, inhibited; cell growth, inhibi... | +-0.219 | +4.18472157275e-05 | +Excluded | +
9 | +YAP1 | +P46937 | +K | +342.0 | +Ubiquitination | +carcinogenesis, altered | +-0.188;-0.161 | +0.000211254197372;4.17884655686e-07 | +Excluded | +
From this, we note a total of 9 impacted PTMs from 7 genes that have biological process information available. While we could manually look through to look for common processes, we can also inspect the annotation counts object to see the most common processes, including a breakdown by the type of impact (included [dPSI > 0], excluded [dPSI < 0], or altered flanking sequence):
+[6]:
+
print('Number of PTMs associated with each annotation:')
+annotation_counts
+
+Number of PTMs associated with each annotation:
+
[6]:
+
+ | All Impacted | +Included | +Excluded | +Altered Flank | +
---|---|---|---|---|
PSP:ON_PROCESS | ++ | + | + | + |
cell motility, altered | +3 | +3.0 | +0.0 | +0.0 | +
cell growth, induced | +2 | +0.0 | +2.0 | +0.0 | +
signaling pathway regulation | +2 | +0.0 | +2.0 | +0.0 | +
apoptosis, altered | +1 | +1.0 | +0.0 | +0.0 | +
cell motility, inhibited | +1 | +1.0 | +0.0 | +0.0 | +
cytoskeletal reorganization | +1 | +1.0 | +0.0 | +0.0 | +
cell adhesion, inhibited | +1 | +1.0 | +0.0 | +0.0 | +
carcinogenesis, inhibited | +1 | +0.0 | +1.0 | +0.0 | +
cell growth, inhibited | +1 | +0.0 | +1.0 | +0.0 | +
autophagy, inhibited | +1 | +0.0 | +1.0 | +0.0 | +
carcinogenesis, altered | +1 | +0.0 | +1.0 | +0.0 | +
Finally, you may prefer to visualize this information as a figure. Here, we can plot the top 10 most common biological processes for the included, excluded, and altered flanking sequence impacts. Notably, we can plot either the annotations as outputted above (includes directionality of PTM role) or we can collapse this information into similar groups (e.g. “cell motility, altered” and “cell motility, included” would be grouped as “cell motility”). Here, we will plot the full information on the +left and the collapsed information on the right:
+[10]:
+
import matplotlib.pyplot as plt
+
+fig, ax = plt.subplots(ncols = 2, figsize = (6, 3))
+fig.subplots_adjust(wspace = 2)
+pose_plots.plot_annotations(combined_output, ax = ax[0], collapse_on_similar = False, database = 'PhosphoSitePlus', annot_type = 'Process', top_terms = 10)
+ax[0].set_title('Full Annotation')
+pose_plots.plot_annotations(combined_output, ax = ax[1], collapse_on_similar = True, database = 'PhosphoSitePlus', annot_type = 'Process', top_terms = 10)
+ax[1].set_title('Collapsed Annotation')
+
[10]:
+
+Text(0.5, 1.0, 'Collapsed Annotation')
+
Of note, you can also choose to only show collapsed annotation information for analyze.get_ptm_annotations()
by setting collapse_on_similar=True
in the function call, like we have done for the plot on the right.
In some cases, you may want to identify PTM-specific annotations that appear more commonly than might be expected based on how often the annotation appears across the entire proteome. We have provided a function to perform this analysis, analyze.ptm_annotation_enrichment()
. By default, this function will compare the annotations found in your data to the annotations found in the entire proteome (based on ptm_coordinates dataframe), but you can also choose to perform enrichment analysis by
+significance. Here, we will we perform enrichment analysis using the entire proteome as the background. First, let’s look at the available annotations for enrichment analysis:
[3]:
+
from ptm_pose import analyze
+import pandas as pd
+
+# Load spliced ptm and altered flank data
+spliced_ptms = pd.read_csv('spliced_ptms.csv')
+altered_flanks = pd.read_csv('altered_flanks.csv')
+combined_output = analyze.combine_outputs(spliced_ptms, altered_flanks)
+
+annot_categories = analyze.get_annotation_categories(combined_output)
+annot_categories
+
+Some annotations in spliced ptms dataframe not found in altered flanks dataframe. These annotations will be ignored. To avoid this, make sure to add annotations to both dataframes, or annotate the combined dataframe.
+
[3]:
+
+ | database | +annotation_type | +column | +
---|---|---|---|
4 | +Combined | +Interactions | +Combined:Interactions | +
5 | +Combined | +Kinase | +Combined:Kinase | +
2 | +DEPOD | +Phosphatase | +DEPOD:Phosphatase | +
3 | +ELM | +Interactions | +ELM:Interactions | +
0 | +PhosphoSitePlus | +Process | +PSP:ON_PROCESS | +
1 | +PhosphoSitePlus | +Interactions | +PSP:ON_PROT_INTERACT | +
6 | +PhosphoSitePlus | +Disease | +PSP:Disease_Association | +
8 | +PhosphoSitePlus | +Function | +PSP:ON_FUNCTION | +
7 | +RegPhos | +Kinase | +RegPhos:Kinase | +
We would like to know if the PTMs have been implicated in any biological processes more than expected by chance. We can perform enrichment analysis on the biological process annotations from PhosphoSitePlus. To maximize the ability of the hypergeometric test to capture these results, we will use the collapsed annotation information (ignores directionality of PTM role):
+[4]:
+
enrichment = analyze.annotation_enrichment(combined_output, database = 'PhosphoSitePlus', annotation_type = 'Process', collapse_on_similar=True)
+enrichment
+
+Using pregenerated background information on all PTMs in the proteome.
+
[4]:
+
+ | Fraction Impacted | +p-value | +Adjusted p-value | +PTM | +
---|---|---|---|---|
PSP:ON_PROCESS | ++ | + | + | + |
cell motility | +5/1078 | +0.052579 | +0.420633 | +ABI1_S392;CTTN_K272;CTTN_S298;SPHK2_S387;SPHK2... | +
cell adhesion | +2/324 | +0.122466 | +0.489864 | +CTTN_S298;MPZL1_Y241 | +
cell growth | +4/1793 | +0.427134 | +1.000000 | +BCAR1_Y267;BCAR1_Y287;BCAR1_Y306;TSC2_S981 | +
autophagy | +1/306 | +0.434215 | +0.868429 | +TSC2_S981 | +
cytoskeletal reorganization | +2/796 | +0.435637 | +0.868429 | +ABI1_S392;CTTN_S298 | +
apoptosis | +2/1179 | +0.644065 | +0.868429 | +CEACAM1_S461;CEACAM1_T457 | +
signaling pathway regulation | +2/1206 | +0.656208 | +0.868429 | +BIN1_T348;TSC2_S981 | +
carcinogenesis | +2/1501 | +0.768091 | +0.868429 | +TSC2_S981;YAP1_K342 | +
We can also plot the annotations and include which annotations are enriched (p-value < 0.05) in the plot:
+[ ]:
+
print('not yet implemented')
+
In addition to looking at the annotations associated with the PTMs, we can also look at the genes themselves with impacted PTMs. We can perform gene set enrichment analysis using EnrichR module of gseapy to identify if any gene sets are enriched in the PTM dataset, as well as break it down by the type of modication. Here, we will use the analyze.gene_set_enrichment()
function to perform this analysis. First, let’s look at the available gene sets for enrichment analysis:
[1]:
+
from ptm_pose import analyze
+import pandas as pd
+
+# Load spliced ptm and altered flank data
+spliced_ptms = pd.read_csv('spliced_ptms.csv')
+altered_flanks = pd.read_csv('altered_flanks.csv')
+combined_output = analyze.combine_outputs(spliced_ptms, altered_flanks)
+
+Some annotations in spliced ptms dataframe not found in altered flanks dataframe. These annotations will be ignored. To avoid this, make sure to add annotations to both dataframes, or annotate the combined dataframe.
+
[2]:
+
enrichr_results = analyze.gene_set_enrichment(combined = combined_output, gene_sets = ['GO_Biological_Process_2023', 'Reactome_2022'])
+
[3]:
+
enrichr_results.head()
+
[3]:
+
+ | Gene_set | +Term | +Overlap | +P-value | +Adjusted P-value | +Old P-value | +Old Adjusted P-value | +Odds Ratio | +Combined Score | +Genes | +Type | +Genes with Differentially Included PTMs only | +Genes with PTM with Altered Flanking Sequence only | +Genes with Both | +
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | +GO_Biological_Process_2023 | +Regulation Of Neurogenesis (GO:0050767) | +5/67 | +0.000018 | +0.011675 | +0 | +0 | +17.392181 | +189.619722 | +YAP1;APLP2;DOCK7;NUMB;NF2 | +Differentially Included + Altered Flanking Seq... | +YAP1;APLP2 | +NF2 | +DOCK7;NUMB | +
1 | +GO_Biological_Process_2023 | +Enzyme-Linked Receptor Protein Signaling Pathw... | +6/124 | +0.000031 | +0.011675 | +0 | +0 | +11.055131 | +114.642865 | +CSF1;FGFR3;FGFR2;PTPRF;BCAR1;MPZL1 | +Differentially Included + Altered Flanking Seq... | +FGFR2;CSF1;FGFR3 | ++ | MPZL1;BCAR1;PTPRF | +
2 | +GO_Biological_Process_2023 | +Protein Localization To Cell-Cell Junction (GO... | +3/15 | +0.000048 | +0.011675 | +0 | +0 | +52.901596 | +525.813416 | +TJP1;LSR;SCRIB | +Differentially Included + Altered Flanking Seq... | ++ | LSR | +SCRIB;TJP1 | +
3 | +GO_Biological_Process_2023 | +Regulation Of Cell Migration (GO:0030334) | +10/434 | +0.000049 | +0.011675 | +0 | +0 | +5.280579 | +52.425684 | +TJP1;CEACAM1;CSF1;ADAM15;LIMCH1;APLP2;NUMB;ITG... | +Differentially Included + Altered Flanking Seq... | +APLP2;CSF1;ITGA6 | +NF2 | +ADAM15;NUMB;LIMCH1;BCAR1;TJP1;CEACAM1 | +
4 | +GO_Biological_Process_2023 | +Integrin-Mediated Signaling Pathway (GO:0007229) | +5/85 | +0.000058 | +0.011675 | +0 | +0 | +13.466712 | +131.282293 | +CEACAM1;ADAM15;ITGA6;CD47;BCAR1 | +Differentially Included + Altered Flanking Seq... | +ITGA6;CD47 | ++ | ADAM15;CEACAM1;BCAR1 | +
The result is the standard output of gseapy, with the specific genes in the gene set with differentially include or altered flanking sequence PTM sites listed. We can also plot the output of the gene set enrichment analysis:
+[4]:
+
from ptm_pose import plots as pose_plots
+
+pose_plots.plot_EnrichR_pies(enrichr_results, top_terms = 10)
+
[1]:
+
from ptm_pose import analyze
+import pandas as pd
+
+# Load spliced ptm and altered flank data
+spliced_ptms = pd.read_csv('spliced_ptms.csv')
+
[10]:
+
interaction_graph, network_data = analyze.get_interaction_network(spliced_ptms, node_type = 'Gene')
+network_stats = analyze.get_interaction_stats(interaction_graph)
+
+PhosphoSitePlus regulatory site data found and added
+Combined kinase-substrate data found and added
+PTMInt data found and added
+ELM data found and added
+
[20]:
+
network_data.head()
+
[20]:
+
+ | Modified Gene | +Interacting Gene | +Residue | +Type | +Source | +dPSI | +Regulation Change | +
---|---|---|---|---|---|---|---|
0 | +ADAM15 | +HCK | +Y735;Y715 | +REGULATES | +PSP/RegPhos | +0.181;-0.052 | ++;- | +
1 | +ADAM15 | +LCK | +Y715 | +REGULATES | +PSP/RegPhos | +0.181;-0.052 | ++;- | +
2 | +ADAM15 | +SRC | +Y735;Y715 | +REGULATES | +PSP/RegPhos | +0.181;-0.052 | ++;- | +
3 | +BCAR1 | +SRC | +Y267;Y287 | +REGULATES | +PSP/RegPhos | +-0.07 | +- | +
4 | +BIN1 | +MAPT | +T348 | +INDUCES | +PhosphoSitePlus;PTMInt | +-0.112 | +- | +
[24]:
+
import importlib
+importlib.reload(analyze)
+
[24]:
+
+<module 'ptm_pose.analyze' from 'C:\\Users\\Sam\\OneDrive\\Documents\\GradSchool\\Research\\Splicing\\PTM_POSE\\ptm_pose\\analyze.py'>
+
[25]:
+
analyze.summarize_protein_network(protein = 'TSC2', interaction_graph = interaction_graph, network_data = network_data, network_stats = network_stats)
+
+Decreased interaction likelihoods: AKT1, YWHAE, YWHAZ
+Number of interactions: 3 (Rank: 2)
+Centrality measures - Degree = 0.2 (Rank: 2)
+ Betweenness = 0.028571428571428574 (Rank: 3)
+ Closeness = 0.2 (Rank: 3)
+
[17]:
+
pose_plots.plot_interaction_network(interaction_graph, network_data, network_stats = network_stats)
+
[13]:
+
from ptm_pose import plots as pose_plots
+
+network_stats = analyze.get_interaction_stats(interaction_graph)
+pose_plots.plot_network_centrality(network_stats, network_data, top_N = 10, modified_color = 'coral', interacting_color = 'grey')
+
While we provide functions for performing enrichment of known kinase substrates from databases like PhosphoSitePlus, RegPhos, and PTMsigDB, these resources are limited by the overall number of validated substrates (<5%). For this purpose, we have adapted a previously developed algorithm called KSTAR (Kinase Substrate to Activity Relationships) for use with spliced PTM data, which harnesses kinase-substrate predictions to expand the overall number of phosphorylation sites that can be used as +evidence. This particularly important as you may find many of the spliced PTMs in your dataset are less well studied and may not have any annotated kinases.
+In order to perform KSTAR analysis, you will first need to download KSTAR networks from the following figshare.
+Once you have downloaded the networks, all you need is your PTM data.
+[2]:
+
from ptm_pose import analyze
+import pandas as pd
+
+# Load spliced ptm and altered flank data
+spliced_ptms = pd.read_csv('spliced_ptms.csv')
+
[30]:
+
kstar_enrichment = analyze.kstar_enrichment(spliced_ptms, network_dir = '../../../../Database_Information/NETWORKS/NetworKIN/', phospho_type = 'Y')
+kstar_enrichment.run_kstar_enrichment()
+kstar_enrichment.return_enriched_kinases()
+
You can also run the same analysis for serine/threonine kinases:
+[34]:
+
kstar_enrichment = analyze.kstar_enrichment(spliced_ptms, network_dir = '../../../../Database_Information/NETWORKS/NetworKIN/', phospho_type = 'ST')
+kstar_enrichment.run_kstar_enrichment()
+kstar_enrichment.return_enriched_kinases()
+
[34]:
+
+array(['PRKG2', 'MAPK14', 'PRKCH', 'PRKCG', 'PRKD1', 'PRKCE', 'ROCK1',
+ 'TTK'], dtype=object)
+
[72]:
+
from ptm_pose import flanking_sequences as fs
+import pandas as pd
+
+# Load altered flank data
+altered_flanks = pd.read_csv('altered_flanks.csv')
+
[73]:
+
altered_flanks = fs.compare_flanking_sequences(altered_flanks)
+altered_flanks[['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class', 'Inclusion Sequence', 'Exclusion Sequence', 'Sequence Identity', 'Altered Positions', 'Residue Change', 'Altered Flank Side']].head()
+
[73]:
+
+ | UniProtKB Accession | +Residue | +PTM Position in Canonical Isoform | +Modification Class | +Inclusion Sequence | +Exclusion Sequence | +Sequence Identity | +Altered Positions | +Residue Change | +Altered Flank Side | +
---|---|---|---|---|---|---|---|---|---|---|
0 | +P01116 | +T | +148 | +Phosphorylation | +ETSAKtRQESG | +ETSAKtRQGC* | +NaN | +NaN | +NaN | +NaN | +
1 | +P01116 | +K | +147 | +Acetylation | +IETSAkTRQES | +IETSAkTRQGC | +0.818182 | +[4.0, 5.0] | +[E->G, S->C] | +C-term only | +
2 | +P01116 | +K | +147 | +Ubiquitination | +IETSAkTRQES | +IETSAkTRQGC | +0.818182 | +[4.0, 5.0] | +[E->G, S->C] | +C-term only | +
3 | +Q9UPQ0 | +S | +746 | +Phosphorylation | +LPNLNsQGVAW | +LPNLNsQGGFS | +0.727273 | +[3.0, 4.0, 5.0] | +[V->G, A->F, W->S] | +C-term only | +
4 | +Q9UPQ0 | +S | +750 | +Phosphorylation | +PSQVDsPSSEK | +ILKVDsPSSEK | +0.727273 | +[-5.0, -4.0, -3.0] | +[P->I, S->L, Q->K] | +N-term only | +
Note, we only calculate these metrics for cases where altered flanking sequences do not cause a stop codon to be introduced, as this is harder to interpret (such as for the first PTM in the list). The above table will indicate the positions in the flanking sequence that are altered, how similar the altered flanking sequence is to the original flanking sequence, and the specific residue change that takes place. We can also plot some of this information to get a better sense of the distribution of +altered flanking sequences:
+[47]:
+
importlib.reload(pose_plots)
+
[47]:
+
+<module 'ptm_pose.plots' from 'C:\\Users\\Sam\\OneDrive\\Documents\\GradSchool\\Research\\Splicing\\PTM_POSE\\ptm_pose\\plots.py'>
+
[48]:
+
from ptm_pose import plots as pose_plots
+
+pose_plots.location_of_altered_flanking_residues(altered_flanks)
+
+C:\Users\Sam\OneDrive\Documents\GradSchool\Research\Splicing\PTM_POSE\ptm_pose\plots.py:391: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
+ ax[0].set_xticklabels(['N-term\nonly', 'C-term\nonly'])
+
We can even create the same plot for specific modification types or residues, as well as label the specific residue changes that occur:
+[76]:
+
pose_plots.location_of_altered_flanking_residues(altered_flanks, modification_class='Acetylation')
+
+C:\Users\Sam\OneDrive\Documents\GradSchool\Research\Splicing\PTM_POSE\ptm_pose\plots.py:437: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
+ ax[0].set_xticklabels(['N-term\nonly', 'C-term\nonly'])
+
If we want to dig deeper, we can look at the specific changes that occurring, although this is only recommended with a selected subset of PTMs, such as those that may have a functional impact:
+[102]:
+
pose_plots.alterations_matrix(altered_flanks.head(10))
+
[119]:
+
importlib.reload(analyze)
+
[119]:
+
+<module 'ptm_pose.analyze' from 'C:\\Users\\Sam\\OneDrive\\Documents\\GradSchool\\Research\\Splicing\\PTM_POSE\\ptm_pose\\analyze.py'>
+
[120]:
+
altered_flanks = analyze.compare_inclusion_motifs(altered_flanks)
+
[126]:
+
sh2_motif_changes = analyze.identify_change_to_specific_motif(altered_flanks, elm_motif_name = '14-3-3', modification_class = 'Phosphorylation', residues = ['S','T'])
+
[127]:
+
sh2_motif_changes
+
[127]:
+
+ | Gene | +UniProtKB Accession | +Residue | +PTM Position in Canonical Isoform | +Modification Class | +Inclusion Sequence | +Exclusion Sequence | +Motif only in Inclusion | +Motif only in Exclusion | +Altered Positions | +Residue Change | +
---|---|---|---|---|---|---|---|---|---|---|---|
22 | +MLPH | +Q9BV36 | +S | +337 | +Phosphorylation | +RGRASsESQDL | +RGRASsESQGS | +LIG_14-3-3_CanoR_1 | +NaN | +[4.0, 5.0] | +[D->G, L->S] | +
23 | +MLPH | +Q9BV36 | +S | +339 | +Phosphorylation | +RASSEsQDL*A | +RASSEsQGSRC | +LIG_14-3-3_CanoR_1 | +NaN | +NaN | +NaN | +
50 | +CEACAM1 | +P13688 | +T | +457 | +Phosphorylation | +LHFGKtGRGKR | +LHFGKtGRLRT | +NaN | +LIG_14-3-3_CterR_2 | +[3.0, 4.0, 5.0] | +[G->L, K->R, R->T] | +
67 | +ENAH | +Q8N8S7 | +S | +512 | +Phosphorylation | +KSPVIsRTGFS | +KSPVIsRTKIH | +LIG_14-3-3_CterR_2 | +NaN | +[3.0, 4.0, 5.0] | +[G->K, F->I, S->H] | +
93 | +LMO7 | +Q8WWI1-3 | +S | +356 | +Phosphorylation | +ADGTFsRTLSK | +ADGTFsRE*VH | +LIG_14-3-3_CterR_2 | +NaN | +NaN | +NaN | +
129 | +MAP3K7 | +O43318 | +T | +403 | +Phosphorylation | +RIAATtGLFQA | +RIAATtGQRTA | +LIG_14-3-3_CanoR_1 | +NaN | +[2.0, 3.0, 4.0] | +[L->Q, F->R, Q->T] | +
141 | +LMO7 | +Q8WWI1-3 | +T | +354 | +Phosphorylation | +TEADGtFSR*S | +TEADGtFSRE* | +LIG_14-3-3_CterR_2 | +NaN | +NaN | +NaN | +
[128]:
+
pose_plots.alterations_matrix(sh2_motif_changes)
+
\n", + " | database | \n", + "annotation_type | \n", + "column | \n", + "
---|---|---|---|
5 | \n", + "Combined | \n", + "Interactions | \n", + "Combined:Interactions | \n", + "
8 | \n", + "Combined | \n", + "Kinase | \n", + "Combined:Kinase | \n", + "
1 | \n", + "DEPOD | \n", + "Phosphatase | \n", + "DEPOD:Phosphatase | \n", + "
2 | \n", + "ELM | \n", + "Interactions | \n", + "ELM:Interactions | \n", + "
0 | \n", + "PhosphoSitePlus | \n", + "Interactions | \n", + "PSP:ON_PROT_INTERACT | \n", + "
3 | \n", + "PhosphoSitePlus | \n", + "Disease | \n", + "PSP:Disease_Association | \n", + "
4 | \n", + "PhosphoSitePlus | \n", + "Process | \n", + "PSP:ON_PROCESS | \n", + "
6 | \n", + "PhosphoSitePlus | \n", + "Function | \n", + "PSP:ON_FUNCTION | \n", + "
7 | \n", + "RegPhos | \n", + "Kinase | \n", + "RegPhos:Kinase | \n", + "
\n", + " | Gene | \n", + "UniProtKB Accession | \n", + "Residue | \n", + "PTM Position in Canonical Isoform | \n", + "Modification Class | \n", + "PSP:ON_PROCESS | \n", + "dPSI | \n", + "Significance | \n", + "Impact | \n", + "
---|---|---|---|---|---|---|---|---|---|
0 | \n", + "BCAR1 | \n", + "P56945 | \n", + "Y | \n", + "267.0 | \n", + "Phosphorylation | \n", + "cell growth, induced | \n", + "-0.07 | \n", + "0.0458775672499 | \n", + "Excluded | \n", + "
1 | \n", + "BCAR1 | \n", + "P56945 | \n", + "Y | \n", + "287.0 | \n", + "Phosphorylation | \n", + "cell growth, induced | \n", + "-0.07 | \n", + "0.0458775672499 | \n", + "Excluded | \n", + "
2 | \n", + "BIN1 | \n", + "O00499 | \n", + "T | \n", + "348.0 | \n", + "Phosphorylation | \n", + "signaling pathway regulation | \n", + "-0.112 | \n", + "0.0233903490744 | \n", + "Excluded | \n", + "
3 | \n", + "CEACAM1 | \n", + "P13688 | \n", + "S | \n", + "461.0 | \n", + "Phosphorylation | \n", + "apoptosis, altered | \n", + "0.525 | \n", + "1.73943268451e-09 | \n", + "Included | \n", + "
4 | \n", + "CTTN | \n", + "Q14247 | \n", + "K | \n", + "272.0 | \n", + "Acetylation | \n", + "cell motility, inhibited | \n", + "0.09 | \n", + "0.0355211287599 | \n", + "Included | \n", + "
5 | \n", + "CTTN | \n", + "Q14247 | \n", + "S | \n", + "298.0 | \n", + "Phosphorylation | \n", + "cell motility, altered; cytoskeletal reorganiz... | \n", + "0.09 | \n", + "0.0355211287599 | \n", + "Included | \n", + "
6 | \n", + "SPHK2 | \n", + "Q9NRA0 | \n", + "S | \n", + "387.0 | \n", + "Phosphorylation | \n", + "cell motility, altered | \n", + "0.253 | \n", + "0.0129400018182 | \n", + "Included | \n", + "
7 | \n", + "SPHK2 | \n", + "Q9NRA0 | \n", + "T | \n", + "614.0 | \n", + "Phosphorylation | \n", + "cell motility, altered | \n", + "0.253 | \n", + "0.0129400018182 | \n", + "Included | \n", + "
8 | \n", + "TSC2 | \n", + "P49815 | \n", + "S | \n", + "981.0 | \n", + "Phosphorylation | \n", + "carcinogenesis, inhibited; cell growth, inhibi... | \n", + "-0.219 | \n", + "4.18472157275e-05 | \n", + "Excluded | \n", + "
9 | \n", + "YAP1 | \n", + "P46937 | \n", + "K | \n", + "342.0 | \n", + "Ubiquitination | \n", + "carcinogenesis, altered | \n", + "-0.188;-0.161 | \n", + "0.000211254197372;4.17884655686e-07 | \n", + "Excluded | \n", + "
\n", + " | All Impacted | \n", + "Included | \n", + "Excluded | \n", + "Altered Flank | \n", + "
---|---|---|---|---|
PSP:ON_PROCESS | \n", + "\n", + " | \n", + " | \n", + " | \n", + " |
cell motility, altered | \n", + "3 | \n", + "3.0 | \n", + "0.0 | \n", + "0.0 | \n", + "
cell growth, induced | \n", + "2 | \n", + "0.0 | \n", + "2.0 | \n", + "0.0 | \n", + "
signaling pathway regulation | \n", + "2 | \n", + "0.0 | \n", + "2.0 | \n", + "0.0 | \n", + "
apoptosis, altered | \n", + "1 | \n", + "1.0 | \n", + "0.0 | \n", + "0.0 | \n", + "
cell motility, inhibited | \n", + "1 | \n", + "1.0 | \n", + "0.0 | \n", + "0.0 | \n", + "
cytoskeletal reorganization | \n", + "1 | \n", + "1.0 | \n", + "0.0 | \n", + "0.0 | \n", + "
cell adhesion, inhibited | \n", + "1 | \n", + "1.0 | \n", + "0.0 | \n", + "0.0 | \n", + "
carcinogenesis, inhibited | \n", + "1 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "
cell growth, inhibited | \n", + "1 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "
autophagy, inhibited | \n", + "1 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "
carcinogenesis, altered | \n", + "1 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "
\n", + " | database | \n", + "annotation_type | \n", + "column | \n", + "
---|---|---|---|
4 | \n", + "Combined | \n", + "Interactions | \n", + "Combined:Interactions | \n", + "
5 | \n", + "Combined | \n", + "Kinase | \n", + "Combined:Kinase | \n", + "
2 | \n", + "DEPOD | \n", + "Phosphatase | \n", + "DEPOD:Phosphatase | \n", + "
3 | \n", + "ELM | \n", + "Interactions | \n", + "ELM:Interactions | \n", + "
0 | \n", + "PhosphoSitePlus | \n", + "Process | \n", + "PSP:ON_PROCESS | \n", + "
1 | \n", + "PhosphoSitePlus | \n", + "Interactions | \n", + "PSP:ON_PROT_INTERACT | \n", + "
6 | \n", + "PhosphoSitePlus | \n", + "Disease | \n", + "PSP:Disease_Association | \n", + "
8 | \n", + "PhosphoSitePlus | \n", + "Function | \n", + "PSP:ON_FUNCTION | \n", + "
7 | \n", + "RegPhos | \n", + "Kinase | \n", + "RegPhos:Kinase | \n", + "
\n", + " | Fraction Impacted | \n", + "p-value | \n", + "Adjusted p-value | \n", + "PTM | \n", + "
---|---|---|---|---|
PSP:ON_PROCESS | \n", + "\n", + " | \n", + " | \n", + " | \n", + " |
cell motility | \n", + "5/1078 | \n", + "0.052579 | \n", + "0.420633 | \n", + "ABI1_S392;CTTN_K272;CTTN_S298;SPHK2_S387;SPHK2... | \n", + "
cell adhesion | \n", + "2/324 | \n", + "0.122466 | \n", + "0.489864 | \n", + "CTTN_S298;MPZL1_Y241 | \n", + "
cell growth | \n", + "4/1793 | \n", + "0.427134 | \n", + "1.000000 | \n", + "BCAR1_Y267;BCAR1_Y287;BCAR1_Y306;TSC2_S981 | \n", + "
autophagy | \n", + "1/306 | \n", + "0.434215 | \n", + "0.868429 | \n", + "TSC2_S981 | \n", + "
cytoskeletal reorganization | \n", + "2/796 | \n", + "0.435637 | \n", + "0.868429 | \n", + "ABI1_S392;CTTN_S298 | \n", + "
apoptosis | \n", + "2/1179 | \n", + "0.644065 | \n", + "0.868429 | \n", + "CEACAM1_S461;CEACAM1_T457 | \n", + "
signaling pathway regulation | \n", + "2/1206 | \n", + "0.656208 | \n", + "0.868429 | \n", + "BIN1_T348;TSC2_S981 | \n", + "
carcinogenesis | \n", + "2/1501 | \n", + "0.768091 | \n", + "0.868429 | \n", + "TSC2_S981;YAP1_K342 | \n", + "
\n", + " | Gene_set | \n", + "Term | \n", + "Overlap | \n", + "P-value | \n", + "Adjusted P-value | \n", + "Old P-value | \n", + "Old Adjusted P-value | \n", + "Odds Ratio | \n", + "Combined Score | \n", + "Genes | \n", + "Type | \n", + "Genes with Differentially Included PTMs only | \n", + "Genes with PTM with Altered Flanking Sequence only | \n", + "Genes with Both | \n", + "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", + "GO_Biological_Process_2023 | \n", + "Regulation Of Neurogenesis (GO:0050767) | \n", + "5/67 | \n", + "0.000018 | \n", + "0.011675 | \n", + "0 | \n", + "0 | \n", + "17.392181 | \n", + "189.619722 | \n", + "YAP1;APLP2;DOCK7;NUMB;NF2 | \n", + "Differentially Included + Altered Flanking Seq... | \n", + "YAP1;APLP2 | \n", + "NF2 | \n", + "DOCK7;NUMB | \n", + "
1 | \n", + "GO_Biological_Process_2023 | \n", + "Enzyme-Linked Receptor Protein Signaling Pathw... | \n", + "6/124 | \n", + "0.000031 | \n", + "0.011675 | \n", + "0 | \n", + "0 | \n", + "11.055131 | \n", + "114.642865 | \n", + "CSF1;FGFR3;FGFR2;PTPRF;BCAR1;MPZL1 | \n", + "Differentially Included + Altered Flanking Seq... | \n", + "FGFR2;CSF1;FGFR3 | \n", + "\n", + " | MPZL1;BCAR1;PTPRF | \n", + "
2 | \n", + "GO_Biological_Process_2023 | \n", + "Protein Localization To Cell-Cell Junction (GO... | \n", + "3/15 | \n", + "0.000048 | \n", + "0.011675 | \n", + "0 | \n", + "0 | \n", + "52.901596 | \n", + "525.813416 | \n", + "TJP1;LSR;SCRIB | \n", + "Differentially Included + Altered Flanking Seq... | \n", + "\n", + " | LSR | \n", + "SCRIB;TJP1 | \n", + "
3 | \n", + "GO_Biological_Process_2023 | \n", + "Regulation Of Cell Migration (GO:0030334) | \n", + "10/434 | \n", + "0.000049 | \n", + "0.011675 | \n", + "0 | \n", + "0 | \n", + "5.280579 | \n", + "52.425684 | \n", + "TJP1;CEACAM1;CSF1;ADAM15;LIMCH1;APLP2;NUMB;ITG... | \n", + "Differentially Included + Altered Flanking Seq... | \n", + "APLP2;CSF1;ITGA6 | \n", + "NF2 | \n", + "ADAM15;NUMB;LIMCH1;BCAR1;TJP1;CEACAM1 | \n", + "
4 | \n", + "GO_Biological_Process_2023 | \n", + "Integrin-Mediated Signaling Pathway (GO:0007229) | \n", + "5/85 | \n", + "0.000058 | \n", + "0.011675 | \n", + "0 | \n", + "0 | \n", + "13.466712 | \n", + "131.282293 | \n", + "CEACAM1;ADAM15;ITGA6;CD47;BCAR1 | \n", + "Differentially Included + Altered Flanking Seq... | \n", + "ITGA6;CD47 | \n", + "\n", + " | ADAM15;CEACAM1;BCAR1 | \n", + "
\n", + " | Modified Gene | \n", + "Interacting Gene | \n", + "Residue | \n", + "Type | \n", + "Source | \n", + "dPSI | \n", + "Regulation Change | \n", + "
---|---|---|---|---|---|---|---|
0 | \n", + "ADAM15 | \n", + "HCK | \n", + "Y735;Y715 | \n", + "REGULATES | \n", + "PSP/RegPhos | \n", + "0.181;-0.052 | \n", + "+;- | \n", + "
1 | \n", + "ADAM15 | \n", + "LCK | \n", + "Y715 | \n", + "REGULATES | \n", + "PSP/RegPhos | \n", + "0.181;-0.052 | \n", + "+;- | \n", + "
2 | \n", + "ADAM15 | \n", + "SRC | \n", + "Y735;Y715 | \n", + "REGULATES | \n", + "PSP/RegPhos | \n", + "0.181;-0.052 | \n", + "+;- | \n", + "
3 | \n", + "BCAR1 | \n", + "SRC | \n", + "Y267;Y287 | \n", + "REGULATES | \n", + "PSP/RegPhos | \n", + "-0.07 | \n", + "- | \n", + "
4 | \n", + "BIN1 | \n", + "MAPT | \n", + "T348 | \n", + "INDUCES | \n", + "PhosphoSitePlus;PTMInt | \n", + "-0.112 | \n", + "- | \n", + "
\n", + " | UniProtKB Accession | \n", + "Residue | \n", + "PTM Position in Canonical Isoform | \n", + "Modification Class | \n", + "Inclusion Sequence | \n", + "Exclusion Sequence | \n", + "Sequence Identity | \n", + "Altered Positions | \n", + "Residue Change | \n", + "Altered Flank Side | \n", + "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", + "P01116 | \n", + "T | \n", + "148 | \n", + "Phosphorylation | \n", + "ETSAKtRQESG | \n", + "ETSAKtRQGC* | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "
1 | \n", + "P01116 | \n", + "K | \n", + "147 | \n", + "Acetylation | \n", + "IETSAkTRQES | \n", + "IETSAkTRQGC | \n", + "0.818182 | \n", + "[4.0, 5.0] | \n", + "[E->G, S->C] | \n", + "C-term only | \n", + "
2 | \n", + "P01116 | \n", + "K | \n", + "147 | \n", + "Ubiquitination | \n", + "IETSAkTRQES | \n", + "IETSAkTRQGC | \n", + "0.818182 | \n", + "[4.0, 5.0] | \n", + "[E->G, S->C] | \n", + "C-term only | \n", + "
3 | \n", + "Q9UPQ0 | \n", + "S | \n", + "746 | \n", + "Phosphorylation | \n", + "LPNLNsQGVAW | \n", + "LPNLNsQGGFS | \n", + "0.727273 | \n", + "[3.0, 4.0, 5.0] | \n", + "[V->G, A->F, W->S] | \n", + "C-term only | \n", + "
4 | \n", + "Q9UPQ0 | \n", + "S | \n", + "750 | \n", + "Phosphorylation | \n", + "PSQVDsPSSEK | \n", + "ILKVDsPSSEK | \n", + "0.727273 | \n", + "[-5.0, -4.0, -3.0] | \n", + "[P->I, S->L, Q->K] | \n", + "N-term only | \n", + "
\n", + " | Gene | \n", + "UniProtKB Accession | \n", + "Residue | \n", + "PTM Position in Canonical Isoform | \n", + "Modification Class | \n", + "Inclusion Sequence | \n", + "Exclusion Sequence | \n", + "Motif only in Inclusion | \n", + "Motif only in Exclusion | \n", + "Altered Positions | \n", + "Residue Change | \n", + "
---|---|---|---|---|---|---|---|---|---|---|---|
22 | \n", + "MLPH | \n", + "Q9BV36 | \n", + "S | \n", + "337 | \n", + "Phosphorylation | \n", + "RGRASsESQDL | \n", + "RGRASsESQGS | \n", + "LIG_14-3-3_CanoR_1 | \n", + "NaN | \n", + "[4.0, 5.0] | \n", + "[D->G, L->S] | \n", + "
23 | \n", + "MLPH | \n", + "Q9BV36 | \n", + "S | \n", + "339 | \n", + "Phosphorylation | \n", + "RASSEsQDL*A | \n", + "RASSEsQGSRC | \n", + "LIG_14-3-3_CanoR_1 | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "
50 | \n", + "CEACAM1 | \n", + "P13688 | \n", + "T | \n", + "457 | \n", + "Phosphorylation | \n", + "LHFGKtGRGKR | \n", + "LHFGKtGRLRT | \n", + "NaN | \n", + "LIG_14-3-3_CterR_2 | \n", + "[3.0, 4.0, 5.0] | \n", + "[G->L, K->R, R->T] | \n", + "
67 | \n", + "ENAH | \n", + "Q8N8S7 | \n", + "S | \n", + "512 | \n", + "Phosphorylation | \n", + "KSPVIsRTGFS | \n", + "KSPVIsRTKIH | \n", + "LIG_14-3-3_CterR_2 | \n", + "NaN | \n", + "[3.0, 4.0, 5.0] | \n", + "[G->K, F->I, S->H] | \n", + "
93 | \n", + "LMO7 | \n", + "Q8WWI1-3 | \n", + "S | \n", + "356 | \n", + "Phosphorylation | \n", + "ADGTFsRTLSK | \n", + "ADGTFsRE*VH | \n", + "LIG_14-3-3_CterR_2 | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "
129 | \n", + "MAP3K7 | \n", + "O43318 | \n", + "T | \n", + "403 | \n", + "Phosphorylation | \n", + "RIAATtGLFQA | \n", + "RIAATtGQRTA | \n", + "LIG_14-3-3_CanoR_1 | \n", + "NaN | \n", + "[2.0, 3.0, 4.0] | \n", + "[L->Q, F->R, Q->T] | \n", + "
141 | \n", + "LMO7 | \n", + "Q8WWI1-3 | \n", + "T | \n", + "354 | \n", + "Phosphorylation | \n", + "TEADGtFSR*S | \n", + "TEADGtFSRE* | \n", + "LIG_14-3-3_CterR_2 | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "
PTM-POSE is an easily implementable tool to project PTM sites onto splice event data generated from RNA sequencing data and is compatible with any splice event quantification tool that outputs genomic coordinates of different splice events (MATS, SpliceSeq, etc.). PTM-POSE harnesses PTMs that have been mapped to their genomic location by a sister package, [ExonPTMapper](NaegleLab/ExonPTMapper). It also contains functions for annotating these PTMs with information from various databases, like PhosphoSitePlus and ELM.
+To run PTM-POSE, you first need to process your data such that each row corresponds to a unique splice event with the genomic location of that splice event (chromosome, strand, and the bounds of the spliced region). Strand can be indicated using either ‘+’/’-’ or 1/-1. If desired, you can also provide a delta PSI and significance value which will be included in the final PTM dataframe. Any additional columns will be kept. At a minimum, the dataframe should look something like this (optional but recommended parameters indicated):
+event id +(optional) |
+Gene name +(recommended) |
+chromosome |
+strand |
+region start |
+region end |
+dPSI +(optional) |
+significance +(optional) |
+
---|---|---|---|---|---|---|---|
first_event |
+CSTN1 |
+1 |
+-1 |
+9797555 |
+9797612 |
+0.362 |
+0.032 |
+
PTM-POSE allows you to assess two potential impacts of splicing on PTMs:
+lost or gained from the isoform as a result of a splice event
+the PTM site is present in both isoforms, but the adjacent residues around a PTM are changed in one isoform such that its linear motif that drives many protein interactions is unique
+Once the data is in the correct format, simply run the project_ptms_onto_splice_events() function, indicating the column names corresponding each data element. By default, PTM-POSE assumes the provided coordinates are in hg38 coordinates, but you can use older coordinate systems with the coordinate_type parameter. If you have saved ptm_coordinates locally, you can set this parameter to None.
+from ptm-pose import project
+
+my_splice_data_annotated, spliced_ptms = project.project_ptms_onto_splice_events(my_splice_data,
+ ptm_coordinates,
+ chromosome_col = 'chromosome',
+ strand_col = 'strand',
+ region_start_col = 'region start',
+ region_end_col = 'region end',
+ event_id_col = 'event id',
+ gene_col = 'Gene name',
+ dPSI_col='dPSI',
+ coordinate_type = 'hg19')
+
In addition to the previously mentioned columns, we will need to know the location of the flanking exonic regions next to the spliced region. Make sure your dataframe contains the following information prior to running flanking sequence analysis:
+++ ++ |
+++ ++ |
+++ ++ |
+++ ++ |
+++ ++ |
+++ ++ |
+++ ++ |
+++ ++ |
+++ ++ |
+++ ++ |
+++ ++ |
+++ ++ |
+
Then, as with differentially included PTMs, you only need to run get_flanking_changes_from_splice_data() function:
+from ptm-pose import project
+
+altered_flanks = project.get_flanking_changes_from_splice_data(my_splice_data,
+ ptm_coordinates,
+ chromosome_col = 'chromosome',
+ strand_col = 'strand',
+ region_start_col = 'region_start',
+ region_end_col = 'region_end',
+ first_flank_start_col = 'first_flank_start',
+ first_flank_end_col = 'first_flank_end',
+ second_flank_start_col = 'second_flank_start',
+ second_flank_end_col = 'second_flank_start',
+ event_id_col = 'event_id',
+ gene_col = 'Gene name',
+ dPSI_col='dPSI',
+ coordinate_type = 'hg19')
+
In some cases you may wish to work with a combined file that indicates both differential inclusion and altered flanking sequence events. This can be done quickly by running:
+from ptm_pose import analyze
+combined_output = analyze.combine_outputs(spliced_ptms, altered_flanks)
+
Beyond projecting PTMs onto your data, we have also provided additional functions for appending information on the function, relationships, and interactions of each post-translational modification that have been recorded in various databases. These annotations include information from:
+Database |
+Annotation types |
+PTM-POSE function |
+
---|---|---|
+ |
|
+annotate.add_PSP_regulatory_site_data(spliced_ptms, file = "/path/to/file/Regulatory_sites.gz")
+ |
+
|
+annotate.add_PSP_kinase_substrate_data(spliced_ptms, file = "/path/to/file/Kinase_Substrate_Dataset.gz"
+ |
+|
+ |
|
+annotate.add_DEPOD_data(spliced_ptms, file = "/path/to/file/")
+ |
+
+ |
|
+annotate.add_RegPhos_data(spliced_ptms, file = "/path/to/file/")
+ |
+
+ |
|
+annotate.add_PTMcode_interprotein(spliced_ptms, file = "/path/to/file/")
+ |
+
|
+annotate.add_PTMcode_intraprotein(spliced_ptms, file = "/path/to/file/")
+ |
+|
+ |
|
+annotate.add_PTMcode_interprotein(spliced_ptms, file = "/path/to/file/")
+ |
+
|
+annotate.add_PTMcode_intraprotein(spliced_ptms, file = "/path/to/file/")
+ |
+
Rather than running each function individually, you can also use the master function annotate_ptms() to annotate with all desired information at once.
+We are continuing to work on adding functions to append more contextual information for individual PTMs. If you have suggestions for what information you would like to be added, please let us know!
+PTM-POSE also provides functions in the annotate module for annotating the above outputs with functional information from various databases: PhosphoSitePlus, RegPhos, PTMcode, PTMInt, ELM, DEPOD. You can then identify PTMs with specific functions, interaction, etc. with the analyze module. See an example on a real dataset [here](Examples/ESRP1_knockdown).
+PTM-POSE is an easily implementable tool to project PTM sites onto splice event data generated from RNA sequencing data and is compatible with any splice event quantification tool that outputs genomic coordinates of different splice events (MATS, SpliceSeq, etc.). PTM-POSE harnesses PTMs that have been mapped to their genomic location by a sister package, ExonPTMapper. It also contains functions for annotating these PTMs with information from various databases, like PhosphoSitePlus and ELM.
+For more details about PTM projection and how it can be used to understand the impacts of splicing on cell signaling and other processes, see our pre-print: https://www.biorxiv.org/content/10.1101/2024.01.10.575062v2
+Download ptm_coordinates dataframe from GitHub Large File Storage (LFS). By default, this will not save the file locally due the larger size (do not want to force users to download but highly encourage), but an option to save the file is provided if desired
+Whether to save the file locally into Resource Files directory. The default is False.
+Number of times to attempt to download the file. The default is 5.
+Time to wait between download attempts. The default is 10.
+Given splice quantification from the MATS algorithm, annotate with PTMs that are found in the differentially included regions.
+dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs
+dataframe containing skipped exon event information from MATS
+dataframe containing 5’ alternative splice site event information from MATS
+dataframe containing 3’ alternative splice site event information from MATS
+dataframe containing retained intron event information from MATS
+dataframe containing mutually exclusive exon event information from MATS
+indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg38’.
+Indicate whether to look for altered flanking sequences from spliced events, in addition to those directly in the spliced region. Default is False. (not yet active)
+Number of processes to use for multiprocessing. Default is 1.
+Given splice event quantification data, project PTMs onto the regions impacted by the splice events. Assumes that the splice event data will have chromosome, strand, and genomic start/end positions for the regions of interest, and each row of the splice_event_data corresponds to a unique region.
+Parameters
+dataframe containing splice event information, including chromosome, strand, and genomic location of regions of interest
+dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs. If none, it will pull from the config file.
+column name in splice_data that contains chromosome information. Default is ‘chr’. Expects it to be a str with only the chromosome number: ‘Y’, ‘1’, ‘2’, etc.
+column name in splice_data that contains strand information. Default is ‘strand’. Expects it to be a str with ‘+’ or ‘-’, or integers as 1 or -1. Will convert to integers automatically if string format is provided.
+column name in splice_data that contains the start position of the region of interest. Default is ‘exonStart_0base’.
+column name in splice_data that contains the end position of the region of interest. Default is ‘exonEnd’.
+column name in splice_data that contains the unique identifier for the splice event. If provided, will be used to annotate the ptm information with the specific splice event ID. Default is None.
+column name in splice_data that contains the gene name. If provided, will be used to make sure the projected PTMs stem from the same gene (some cases where genomic coordiantes overlap between distinct genes). Default is None.
+column name in splice_data that contains the delta PSI value for the splice event. Default is None, which will not include this information in the output
+column name in splice_data that contains the significance value for the splice event. Default is None, which will not include this information in the output.
+list of additional columns to include in the output dataframe. Default is None, which will not include any additional columns.
+indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg38’.
+Indicate whether to store PTM sites with multiple modification types as multiple rows. For example, if a site at K100 was both an acetylation and methylation site, these will be separated into unique rows with the same site number but different modification types. Default is True.
+Label to display in the tqdm progress bar. Default is None, which will automatically state “Projecting PTMs onto regions using —– coordinates”.
+Number of processes to use for multiprocessing. Default is 1 (single processing)
+Contains the PTMs identified across the different splice events
+dataframe containing the original splice data with an additional column ‘PTMs’ that contains the PTMs found in the region of interest, in the format of ‘SiteNumber(ModificationType)’. If no PTMs are found, the value will be np.nan.
+Given a region id and the splicegraph from SpliceSeq, extract the chromosome, strand, and start and stop locations of that exon. Start and stop are forced to be in ascending order, which is not necessarily true from the splice graph (i.e. start > stop for negative strand exons). This is done to make the region extraction consistent with the rest of the codebase.
+SpliceSeq splicegraph dataframe, with region_id as index
+Region ID to extract information from, in the format of ‘GeneName_ExonNumber’
+List containing the chromosome, strand (1 for forward, -1 for negative), start, and stop locations of the region
+Currently has been tested with MATS splicing events.
+Given flanking and spliced regions associated with a splice event, identify PTMs that have potential to have an altered flanking sequence depending on whether spliced region is included or excluded (if PTM is close to splice boundary). For these PTMs, extract the flanking sequences associated with the inclusion and exclusion cases and translate into amino acid sequences. If the PTM is not associated with a codon that codes for the expected amino acid, the PTM will be excluded from the results.
+DataFrame containing PTM coordinate information for identify PTMs in the flanking regions
+Chromosome associated with the splice event
+Strand associated with the splice event (1 for forward, -1 for negative)
+List containing the start and stop locations of the first flanking region (first is currently defined based on location the genome not coding sequence)
+List containing the start and stop locations of the spliced region
+List containing the start and stop locations of the second flanking region (second is currently defined based on location the genome not coding sequence)
+Event ID associated with the splice event, by default None
+Number of amino acids to include flanking the PTM, by default 7
+Coordinate system used for the regions, by default ‘hg38’. Other options is hg19.
+Whether to lowercase the amino acid associated with the PTM in returned flanking sequences, by default True
+Whether the first, spliced and second regions are defined by their genomic coordinates (first has smallest coordinate, spliced next, then second), or if they are defined by their translation (first the first when translated, etc.)
+DataFrame containing the PTMs associated with the flanking regions and the amino acid sequences of the flanking regions in the inclusion and exclusion cases
+Given a DataFrame containing information about splice events, extract the flanking sequences associated with the PTMs in the flanking regions if there is potential for this to be altered. The DataFrame should contain columns for the chromosome, strand, start and stop locations of the first flanking region, spliced region, and second flanking region. The DataFrame should also contain a column for the event ID associated with the splice event. If the DataFrame does not contain the necessary columns, the function will raise an error.
+DataFrame containing information about splice events
+DataFrame containing PTM coordinate information for identify PTMs in the flanking regions
+Column name indicating chromosome, by default None
+Column name indicating strand, by default None
+Column name indicating start location of the first flanking region, by default None
+Column name indicating end location of the first flanking region, by default None
+Column name indicating start location of the spliced region, by default None
+Column name indicating end location of the spliced region, by default None
+Column name indicating start location of the second flanking region, by default None
+Column name indicating end location of the second flanking region, by default None
+Column name indicating event ID, by default None
+Number of amino acids to include flanking the PTM, by default 7
+Coordinate system used for the regions, by default ‘hg38’. Other options is hg19.
+Whether to lowercase the amino acid associated with the PTM in returned flanking sequences, by default True
+List containing DataFrames with the PTMs associated with the flanking regions and the amino acid sequences of the flanking regions in the inclusion and exclusion cases
+Given a DataFrame containing information about splice events obtained from SpliceSeq and the corresponding splicegraph, extract the flanking sequences of PTMs that are nearby the splice boundary (potential for flanking sequence to be altered). Coordinate information of individual exons should be found in splicegraph. You can also provide columns with specific psi or significance information. Extra cols not in these categories can be provided with extra_cols parameter.
+DataFrame containing information about splice events obtained from SpliceSeq
+DataFrame containing information about individual exons and their coordinates
+DataFrame containing PTM coordinate information for identify PTMs in the flanking regions
+Column name indicating delta PSI value, by default None
+Column name indicating significance of the event, by default None
+Column name indicating event ID, by default None
+List of column names for additional information to add to the results, by default None
+Column name indicating gene symbol of spliced gene, by default ‘symbol’
+Number of amino acids to include flanking the PTM, by default 5
+Coordinate system used for the regions, by default ‘hg19’. Other options is hg38.
+DataFrame containing the PTMs associated with the flanking regions that are altered, and the flanking sequences that arise depending on whether the flanking sequence is included or not
+Given a PTM location in a sequence of DNA, extract the flanking sequence around the PTM location and translate into the amino acid sequence. If the sequence is not the correct length, the function will attempt to extract the flanking sequence with spaces to account for missing parts if full_flanking_seq is not True. If the sequence is still not the correct length, the function will raise an error. Any unrecognized codons that are found in the sequence and are not in the standard codon table, including stop codons, will be translated as ‘X’ (unknown) or ‘*’ (stop codon).
+Location of the first base pair associated with PTM in the DNA sequence
+DNA sequence containing the PTM
+Amino acid residue associated with the PTM
+Number of amino acids to include flanking the PTM, by default 5
+Whether to lowercase the amino acid associated with the PTM, by default True
+Whether to require the flanking sequence to be the correct length, by default False
+Amino acid sequence of the flanking sequence around the PTM if translation was successful, otherwise np.nan
+Given the location of a PTM in a flanking sequence, extract the location of the PTM in the Inclusion Flanking Sequence and the Exclusion Flanking Sequence associated with a given splice event. Inclusion Flanking Sequence will include the skipped exon region, retained intron, or longer alternative splice site depending on event type. The PTM location should be associated with where the PTM is located relative to spliced region (before = ‘First’, after = ‘Second’).
+Location of the PTM in the flanking sequence it is found (either first or second)
+Flanking exon sequence before the spliced region
+Spliced region sequence
+Flanking exon sequence after the spliced region
+Which flank the PTM is associated with, by default ‘First’
+Whether the first, spliced and second regions are defined by their genomic coordinates (first has smallest coordinate, spliced next, then second), or if they are defined by their translation (first the first when translated, etc.)
+Tuple containing the PTM location in the Inclusion Flanking Sequence and the Exclusion Flanking Sequence
+Given all exons associated with a splicegraph event, obtain the coordinates associated with the flanking exons and the spliced region. The spliced region is defined as the exons that are associated with psi values, while flanking regions include the “from” and “to” exons that indicate the adjacent, unspliced exons.
+Gene name associated with the splice event
+Exon number associated with the first flanking exon
+Exon numbers associated with the spliced region, separated by colons for each unique exon
+Exon number associated with the second flanking exon
+DataFrame containing information about individual exons and their coordinates
+Tuple containing the genomic coordinates of the first flanking region, spliced regions, and second flanking region
+Given ptm information for identifying flanking sequences from splicegraph information, extract the relative location of the ptm in the flanking region (where it is located in translation of the flanking region).
+Series containing PTM information
+Strand associated with the splice event (1 for forward, -1 for negative)
+List containing the chromosome, strand, start, and stop locations of the first flanking region
+List containing the chromosome, strand, start, and stop locations of the second flanking region
+Relative location of the PTM in the flanking region
+Given a DNA sequence, translate the sequence into an amino acid sequence. If the sequence is not the correct length, the function will attempt to extract the flanking sequence with spaces to account for missing parts if full_flanking_seq is not True. If the sequence is still not the correct length, the function will raise an error. Any unrecognized codons that are found in the sequence and are not in the standard codon table, including stop codons, will be translated as ‘X’ (unknown) or ‘*’ (stop codon).
+DNA sequence to translate
+Number of amino acids to include flanking the PTM, by default 7
+Whether to require the flanking sequence to be the correct length, by default True
+Whether to lowercase the amino acid associated with the PTM, by default True
+Length of the flanking sequence in front of the PTM, by default None. If full_flanking_seq is False and sequence is not the correct length, this is required.
+Symbol to use for stop codons, by default ‘*’
+Symbol to use for unknown codons, by default ‘X’
+Amino acid sequence of the flanking sequence if translation was successful, otherwise np.nan
+Given a spliced ptms dataframe from the project module, add ELM interaction data to the dataframe
+Process disease asociation data from PhosphoSitePlus (Disease-associated_sites.gz), and add to spliced_ptms dataframe from project_ptms_onto_splice_events() function
+Path to the PhosphoSitePlus Kinase_Substrate_Dataset.gz file. Should be downloaded from PhosphoSitePlus in the zipped format
+Contains the PTMs identified across the different splice events with an additional column indicating the kinases known to phosphorylate that site (not relevant to non-phosphorylation PTMs)
+Add kinase substrate data from PhosphoSitePlus (Kinase_Substrate_Dataset.gz) to spliced_ptms dataframe from project_ptms_onto_splice_events() function
+Path to the PhosphoSitePlus Kinase_Substrate_Dataset.gz file. Should be downloaded from PhosphoSitePlus in the zipped format
+Contains the PTMs identified across the different splice events with an additional column indicating the kinases known to phosphorylate that site (not relevant to non-phosphorylation PTMs)
+Add functional information from PhosphoSitePlus (Regulatory_sites.gz) to spliced_ptms dataframe from project_ptms_onto_splice_events() function
+Path to the PhosphoSitePlus Regulatory_sites.gz file. Should be downloaded from PhosphoSitePlus in the zipped format
+Contains the PTMs identified across the different splice events with additional columns for regulatory site information, including domains, biological process, functions, and protein interactions associated with the PTMs
+Given spliced_ptms data from project module, add PTMInt interaction data, which will include the protein that is being interacted with, whether it enchances or inhibits binding, and the localization of the interaction. This will be added as a new column labeled PTMInt:Interactions and each entry will be formatted like ‘Protein->Effect|Localization’. If multiple interactions, they will be separated by a semicolon
+Given a desired database and annotation type, add the corresponding annotation data to the spliced ptm dataframe
+Dataframe containing PTM data
+Database to extract annotation data from. Options include ‘PhosphoSitePlus’, ‘PTMcode’, ‘PTMInt’, ‘RegPhos’, ‘DEPOD’
+Type of annotation to extract. Options include ‘Function’, ‘Process’, ‘Interactions’, ‘Disease’, ‘Kinase’, ‘Phosphatase’, but depend on the specific database (run analyze.get_annotation_categories())
+File path to annotation data. If None, will download from online source, except for PhosphoSitePlus (due to licensing restrictions)
+Add custom annotation data to spliced_ptms or altered flanking sequence dataframes
+Dataframe containing the annotation data to be added to the spliced_ptms dataframe. Must contain columns for UniProtKB Accession, Residue, PTM Position in Canonical Isoform, and the annotation data to be added
+Name of the source of the annotation data, will be used to label the columns in the spliced_ptms dataframe
+Type of annotation data being added, will be used to label the columns in the spliced_ptms dataframe
+Column name in the annotation data that contains the annotation data to be added to the spliced_ptms dataframe
+Contains the PTMs identified across the different splice events with an additional column for the custom annotation data
+Given spliced ptm data, add annotations from various databases. The annotations that can be added are the following: +- PhosphoSitePlus
++++
+- +
regulatory site data (file must be provided)
- +
kinase-substrate data (file must be provided)
- +
disease association data (file must be provided)
interaction data (can be downloaded automatically or provided as a file)
motif matches (elm class data can be downloaded automatically or provided as a file)
interaction data (will be downloaded automatically)
intraprotein interactions (can be downloaded automatically or provided as a file)
interprotein interactions (can be downloaded automatically or provided as a file)
phosphatase-substrate data (will be downloaded automatically)
kinase-substrate data (will be downloaded automatically)
Spliced PTM data from project module
+File path to PhosphoSitePlus regulatory site data
+File path to PhosphoSitePlus kinase-substrate data
+File path to PhosphoSitePlus disease association data
+If True, download ELM interaction data automatically. If str, provide file path to ELM interaction data
+If True, download ELM motif data automatically. If str, provide file path to ELM motif data
+If True, download PTMInt data automatically
+If True, download PTMcode intraprotein data automatically. If str, provide file path to PTMcode intraprotein data
+If True, download PTMcode interprotein data automatically. If str, provide file path to PTMcode interprotein data
+If True, download DEPOD data automatically
+If True, download RegPhos data automatically
+File path to PTMsigDB data
+List of databases to combine interaction data from. Default is [‘PTMcode’, ‘PhosphoSitePlus’, ‘RegPhos’, ‘PTMInt’]
+List of databases to combine kinase-substrate data from. Default is [‘PhosphoSitePlus’, ‘RegPhos’]
+Whether to combine annotations of similar information (kinase, interactions, etc) from multiple databases into another column labeled as ‘Combined’. Default is True
+Given a file name, check if the file exists and has the expected extension. If the file does not exist or has the wrong extension, raise an error.
+File name to check
+Expected file extension. Default is ‘.tsv’
+Given spliced ptm information, combine kinase-substrate data from multiple databases (currently support PhosphoSitePlus and RegPhos), assuming that the kinase data from these resources has already been added to the spliced ptm data. The combined kinase data will be added as a new column labeled ‘Combined:Kinase’
+Spliced PTM data from project module
+List of databases to combine kinase data from. Currently support PhosphoSitePlus and RegPhos
+Allows conversion of RegPhos names to matching names in PhosphoSitePlus.
+Spliced PTM data with combined kinase data added
+Given annotated spliced ptm data, extract interaction data from various databases and combine into a single dataframe. This will include the interacting protein, the type of interaction, and the source of the interaction data
+Dataframe containing PTM data and associated interaction annotations from various databases
+List of databases to extract interaction data from. Options include ‘PhosphoSitePlus’, ‘PTMcode’, ‘PTMInt’, ‘RegPhos’, ‘DEPOD’. These should already have annotation columns in the spliced_ptms dataframe, otherwise they will be ignored. For kinase-substrate interactions, if combined column is present, will use that instead of individual databases
+If True, will include kinase-substrate and phosphatase interactions in the output dataframe
+List of dataframes containing PTMs and their interacting proteins, the type of influence the PTM has on the interaction (DISRUPTS, INDUCES, or REGULATES), and the source of the interaction data
+Given a label for an interacting protein from PhosphoSitePlus, convert to UniProtKB accession. Required as PhosphoSitePlus interactions are recorded in various ways that aren’t necessarily consistent with other databases (i.e. not always gene name)
+Label for interacting protein from PhosphoSitePlus
+Given string object consisting of multiple modifications in the form of ‘Residue-Position’ separated by ‘, ‘, extract the residue and position. Ignore any excess details in the string.
+Given spliced ptm data and a column containing interaction data, extract the interacting protein, type of interaction, and convert to UniProtKB accession. This will be added as a new column labeled ‘Interacting ID’
+Dataframe containing PTM data
+column containing interaction information from a specific database
+dictionary to convert names within given database to UniProt IDs. For cases when name is not necessarily one of the gene names listed in UniProt
+Contains PTMs and their interacting proteins, the type of influence the PTM has on the interaction (DISRUPTS, INDUCES, or REGULATES)
+In progress, needs to be tested
+Given spliced ptm information (differential inclusion, altered flanking sequences, or both), calculate the enrichment of specific annotations in the dataset using a hypergeometric test. Background data can be provided/constructed in a few ways:
+Use preconstructed background data for the annotation of interest, which considers the entire proteome present in the ptm_coordinates dataframe. While this is the default, it may not be the most accurate representation of your data, so you may alternative decide to use the other options which will be more specific to your context.
Use the alpha and min_dPSI parameter to construct a foreground that only includes significantly spliced PTMs, and use the entire provided spliced_ptms dataframe as the background. This will allow you to compare the enrichment of specific annotations in the significantly spliced PTMs compared to the entire dataset. Will do this automatically if alpha or min_dPSI is provided.
Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
+database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, ‘PTMInt’, ‘PTMcode’, ‘DEPOD’, ‘RegPhos’, ‘PTMsigDB’. Default is ‘PhosphoSitePlus’.
+Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
+how to construct the background data. Options include ‘pregenerated’ (default) and ‘significance’. If ‘significance’ is selected, the alpha and min_dPSI parameters must be provided. Otherwise, will use whole proteome in the ptm_coordinates dataframe as the background.
+Whether to collapse similar annotations (for example, increasing and decreasing functions) into a single category. Default is False.
+modification class to subset, if any
+significance threshold to use to subset foreground PTMs. Default is None.
+minimum delta PSI value to use to subset foreground PTMs. Default is None.
+file to use to annotate custom background data. Default is None.
+Whether to save the background data constructed from the ptm_coordinates dataframe into Resource_Files within package. Default is False.
+Given the spliced_ptms (differentially included) and altered_flanks (altered flanking sequences) dataframes obtained from project and flanking_sequences modules, combine the two into a single dataframe that categorizes each PTM by the impact on the PTM site
+Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
+Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases
+modification class to subset, if any
+Whether to include PTMs that introduce stop codons in the altered flanks. Default is False.
+Whether to remove PTMs that are both included and excluded across different splicing events. Default is True.
+Given a DataFrame containing flanking sequences with changes and a DataFrame containing ELM class information, identify motifs that are found in the inclusion and exclusion events, identifying motifs unique to each case. This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).
+DataFrame containing flanking sequences with changes, obtained from get_flanking_changes_from_splice_data()
+DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv). Recommended to download this file and input it manually, but will download from ELM otherwise
+DataFrame containing flanking sequences with changes and motifs found in the inclusion and exclusion events
+Convert flanking sequence to version accepted by kinase library (modified residue denoted by asterick)
+Given two sequences, identify the location of positions that have changed
+sequences to compare (order does not matter)
+size of the flanking sequences (default is 5). This is used to make sure the provided sequences are the correct length
+list of positions that have changed
+list of residues that have changed associated with that position
+indicates which side of the flanking sequence the change has occurred (N-term, C-term, or Both)
+Given a sequence and a dataframe containinn ELM class information, identify motifs that can be found in the provided sequence using the RegEx expression provided by ELM (PTMs not considered). This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).
+sequence to search for motifs
+DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv)
+Given spliced_ptms and/or altered_flanks dataframes (or the dataframes combined from combine_outputs()), perform gene set enrichment analysis using the enrichr API
+Dataframe with differentially included PTMs projected onto splicing events and with annotations appended from various databases. Default is None (will not be considered in analysis). If combined dataframe is provided, this dataframe will be ignored.
+Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases. Default is None (will not be considered). If combined dataframe is provided, this dataframe will be ignored.
+Combined dataframe with spliced_ptms and altered_flanks dataframes. Default is None. If provided, spliced_ptms and altered_flanks dataframes will be ignored.
+List of gene sets to use in enrichment analysis. Default is [‘KEGG_2021_Human’, ‘GO_Biological_Process_2023’, ‘GO_Cellular_Component_2023’, ‘GO_Molecular_Function_2023’,’Reactome_2022’]. Look at gseapy and enrichr documentation for other available gene sets
+List of genes to use as background in enrichment analysis. Default is None (all genes in the gene set database will be used).
+Whether to return only significantly enriched gene sets. Default is True.
+Number of times to retry downloading gene set enrichment data from enrichr API. Default is 5.
+Number of seconds to wait between retries. Default is 10.
+Dataframe with gene set enrichment results from enrichr API
+Given two flanking sequences, calculate the sequence identity between them using Biopython and parameters definded by Pillman et al. BMC Bioinformatics 2011
+flanking sequence
+normalized score of sequence similarity between flanking sequences (calculated similarity/max possible similarity)
+Given spliced ptm information, return the available annotation categories that have been appended to dataframe
+PTMs projected onto splicing events and with annotations appended from various databases
+Dataframe that indicates the available databases, annotations from each database, and column associated with that annotation
+Given the database of interest and annotation type, return the annotation column that will be found in a annotated spliced_ptm dataframe
+Dataframe with PTM annotations added from annotate module
+Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
+database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, ‘PTMInt’, ‘PTMcode’, ‘DEPOD’, and ‘RegPhos’. Default is ‘PhosphoSitePlus’.
+Column name in spliced_ptms dataframe that contains the requested annotation
+Given the spliced ptms, altered_flanks, or combined PTMs dataframe, identify the number of PTMs corresponding to specific annotations in the foreground (PTMs impacted by splicing) and the background (all PTMs in the proteome or all PTMs in dataset not impacted by splicing). This information can be used to calculate the enrichment of specific annotations among PTMs impacted by splicing. Several options are provided for constructing the background data: pregenerated (based on entire proteome in the ptm_coordinates dataframe) or significance (foreground PTMs are extracted from provided spliced PTMs based on significance and minimum delta PSI)
+Given PTM data (either spliced ptms, altered flanks, or combined data), return the counts of each modification class
+Dataframe with PTMs projected onto splicing events or with altered flanking sequences
+Series with the counts of each modification class
+Given spliced ptm information obtained from project and annotate modules, grab PTMs in spliced ptms associated with specific PTM modules
+PTMs projected onto splicing events and with annotations appended from various databases
+Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
+database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, or ‘PTMInt’. ELM and PTMInt data will automatically be downloaded, but due to download restrictions, PhosphoSitePlus data must be manually downloaded and annotated in the spliced_ptms data using functions from the annotate module. Default is ‘PhosphoSitePlus’.
+modification class to subset
+Given an annotation, remove additional information such as whether or not a function is increasing or decreasing. For example, ‘cell growth, induced’ would be simplified to ‘cell growth’
+Annotation to simplify
+Separator that splits the core annotation from additional detail. Default is ‘,’. Assumes the first element is the core annotation.
+Simplified annotation
+In progress, needs to be tested
+Given spliced ptm information (differential inclusion, altered flanking sequences, or both), calculate the enrichment of specific annotations in the dataset using a hypergeometric test. Background data can be provided/constructed in a few ways:
+Use preconstructed background data for the annotation of interest, which considers the entire proteome present in the ptm_coordinates dataframe. While this is the default, it may not be the most accurate representation of your data, so you may alternative decide to use the other options which will be more specific to your context.
Use the alpha and min_dPSI parameter to construct a foreground that only includes significantly spliced PTMs, and use the entire provided spliced_ptms dataframe as the background. This will allow you to compare the enrichment of specific annotations in the significantly spliced PTMs compared to the entire dataset. Will do this automatically if alpha or min_dPSI is provided.
Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
+database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, ‘PTMInt’, ‘PTMcode’, ‘DEPOD’, ‘RegPhos’, ‘PTMsigDB’. Default is ‘PhosphoSitePlus’.
+Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
+how to construct the background data. Options include ‘pregenerated’ (default) and ‘significance’. If ‘significance’ is selected, the alpha and min_dPSI parameters must be provided. Otherwise, will use whole proteome in the ptm_coordinates dataframe as the background.
+Whether to collapse similar annotations (for example, increasing and decreasing functions) into a single category. Default is False.
+modification class to subset, if any
+significance threshold to use to subset foreground PTMs. Default is None.
+minimum delta PSI value to use to subset foreground PTMs. Default is None.
+file to use to annotate custom background data. Default is None.
+Whether to save the background data constructed from the ptm_coordinates dataframe into Resource_Files within package. Default is False.
+Given the spliced_ptms (differentially included) and altered_flanks (altered flanking sequences) dataframes obtained from project and flanking_sequences modules, combine the two into a single dataframe that categorizes each PTM by the impact on the PTM site
+Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
+Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases
+modification class to subset, if any
+Whether to include PTMs that introduce stop codons in the altered flanks. Default is False.
+Whether to remove PTMs that are both included and excluded across different splicing events. Default is True.
+Given a DataFrame containing flanking sequences with changes and a DataFrame containing ELM class information, identify motifs that are found in the inclusion and exclusion events, identifying motifs unique to each case. This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).
+DataFrame containing flanking sequences with changes, obtained from get_flanking_changes_from_splice_data()
+DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv). Recommended to download this file and input it manually, but will download from ELM otherwise
+DataFrame containing flanking sequences with changes and motifs found in the inclusion and exclusion events
+Convert flanking sequence to version accepted by kinase library (modified residue denoted by asterick)
+Given two sequences, identify the location of positions that have changed
+sequences to compare (order does not matter)
+size of the flanking sequences (default is 5). This is used to make sure the provided sequences are the correct length
+list of positions that have changed
+list of residues that have changed associated with that position
+indicates which side of the flanking sequence the change has occurred (N-term, C-term, or Both)
+Given a sequence and a dataframe containinn ELM class information, identify motifs that can be found in the provided sequence using the RegEx expression provided by ELM (PTMs not considered). This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).
+sequence to search for motifs
+DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv)
+Given spliced_ptms and/or altered_flanks dataframes (or the dataframes combined from combine_outputs()), perform gene set enrichment analysis using the enrichr API
+Dataframe with differentially included PTMs projected onto splicing events and with annotations appended from various databases. Default is None (will not be considered in analysis). If combined dataframe is provided, this dataframe will be ignored.
+Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases. Default is None (will not be considered). If combined dataframe is provided, this dataframe will be ignored.
+Combined dataframe with spliced_ptms and altered_flanks dataframes. Default is None. If provided, spliced_ptms and altered_flanks dataframes will be ignored.
+List of gene sets to use in enrichment analysis. Default is [‘KEGG_2021_Human’, ‘GO_Biological_Process_2023’, ‘GO_Cellular_Component_2023’, ‘GO_Molecular_Function_2023’,’Reactome_2022’]. Look at gseapy and enrichr documentation for other available gene sets
+List of genes to use as background in enrichment analysis. Default is None (all genes in the gene set database will be used).
+Whether to return only significantly enriched gene sets. Default is True.
+Number of times to retry downloading gene set enrichment data from enrichr API. Default is 5.
+Number of seconds to wait between retries. Default is 10.
+Dataframe with gene set enrichment results from enrichr API
+Given two flanking sequences, calculate the sequence identity between them using Biopython and parameters definded by Pillman et al. BMC Bioinformatics 2011
+flanking sequence
+normalized score of sequence similarity between flanking sequences (calculated similarity/max possible similarity)
+Given spliced ptm information, return the available annotation categories that have been appended to dataframe
+PTMs projected onto splicing events and with annotations appended from various databases
+Dataframe that indicates the available databases, annotations from each database, and column associated with that annotation
+Given the database of interest and annotation type, return the annotation column that will be found in a annotated spliced_ptm dataframe
+Dataframe with PTM annotations added from annotate module
+Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
+database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, ‘PTMInt’, ‘PTMcode’, ‘DEPOD’, and ‘RegPhos’. Default is ‘PhosphoSitePlus’.
+Column name in spliced_ptms dataframe that contains the requested annotation
+Given the spliced ptms, altered_flanks, or combined PTMs dataframe, identify the number of PTMs corresponding to specific annotations in the foreground (PTMs impacted by splicing) and the background (all PTMs in the proteome or all PTMs in dataset not impacted by splicing). This information can be used to calculate the enrichment of specific annotations among PTMs impacted by splicing. Several options are provided for constructing the background data: pregenerated (based on entire proteome in the ptm_coordinates dataframe) or significance (foreground PTMs are extracted from provided spliced PTMs based on significance and minimum delta PSI)
+Given PTM data (either spliced ptms, altered flanks, or combined data), return the counts of each modification class
+Dataframe with PTMs projected onto splicing events or with altered flanking sequences
+Series with the counts of each modification class
+Given spliced ptm information obtained from project and annotate modules, grab PTMs in spliced ptms associated with specific PTM modules
+PTMs projected onto splicing events and with annotations appended from various databases
+Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
+database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, or ‘PTMInt’. ELM and PTMInt data will automatically be downloaded, but due to download restrictions, PhosphoSitePlus data must be manually downloaded and annotated in the spliced_ptms data using functions from the annotate module. Default is ‘PhosphoSitePlus’.
+modification class to subset
+Given an annotation, remove additional information such as whether or not a function is increasing or decreasing. For example, ‘cell growth, induced’ would be simplified to ‘cell growth’
+Annotation to simplify
+Separator that splits the core annotation from additional detail. Default is ‘,’. Assumes the first element is the core annotation.
+Simplified annotation
+
+import numpy as np
+import pandas as pd
+import pickle
+
+import os
+import time
+
+#plotting
+import matplotlib.pyplot as plt
+import matplotlib.lines as mlines
+import seaborn as sns
+from ptm_pose import plots as pose_plots
+
+#analysis packages
+from Bio.Align import PairwiseAligner
+import gseapy as gp
+import networkx as nx
+import re
+
+
+#custom stat functions
+from ptm_pose import stat_utils, pose_config, annotate, helpers
+
+package_dir = os.path.dirname(os.path.abspath(__file__))
+
+[docs]def get_modification_counts(ptms):
+ """
+ Given PTM data (either spliced ptms, altered flanks, or combined data), return the counts of each modification class
+
+ Parameters
+ ----------
+ ptms: pd.DataFrame
+ Dataframe with PTMs projected onto splicing events or with altered flanking sequences
+
+ Returns
+ -------
+ modification_counts: pd.Series
+ Series with the counts of each modification class
+ """
+ ptms['Modification Class'] = ptms['Modification Class'].apply(lambda x: x.split(';'))
+ ptms = ptms.explode('Modification Class')
+ modification_counts = ptms.groupby('Modification Class').size()
+ modification_counts = modification_counts.sort_values(ascending = True)
+ return modification_counts
+
+[docs]def get_annotation_col(spliced_ptms, annotation_type = 'Function', database = 'PhosphoSitePlus'):
+ """
+ Given the database of interest and annotation type, return the annotation column that will be found in a annotated spliced_ptm dataframe
+
+ Parameters
+ ----------
+ spliced_ptms: pd.DataFrame
+ Dataframe with PTM annotations added from annotate module
+ annotation_type: str
+ Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is 'Function'.
+ database: str
+ database from which PTMs are pulled. Options include 'PhosphoSitePlus', 'ELM', 'PTMInt', 'PTMcode', 'DEPOD', and 'RegPhos'. Default is 'PhosphoSitePlus'.
+
+ Returns
+ -------
+ annotation_col: str
+ Column name in spliced_ptms dataframe that contains the requested annotation
+ """
+ if database == 'Combined':
+ if f'Combined:{annotation_type}' not in spliced_ptms.columns:
+ raise ValueError(f'Requested annotation data has not yet been added to spliced_ptms dataframe. Please run the annotate.{pose_config.annotation_function_dict[database]} function to append this information.')
+ return f'Combined:{annotation_type}'
+ elif annotation_type in pose_config.annotation_col_dict[database].keys():
+ annotation_col = pose_config.annotation_col_dict[database][annotation_type]
+ if annotation_col not in spliced_ptms.columns:
+ raise ValueError(f'Requested annotation data has not yet been added to spliced_ptms dataframe. Please run the annotate.{pose_config.annotation_function_dict[database][annotation_type]} function to append this information.')
+ return annotation_col
+ else:
+ raise ValueError(f"Invalid annotation type for {database}. Available annotation data for {database} includes: {', '.join(pose_config.annotation_col_dict[database].keys())}")
+
+
+[docs]def combine_outputs(spliced_ptms, altered_flanks, mod_class = None, include_stop_codon_introduction = False, remove_conflicting = True):
+ """
+ Given the spliced_ptms (differentially included) and altered_flanks (altered flanking sequences) dataframes obtained from project and flanking_sequences modules, combine the two into a single dataframe that categorizes each PTM by the impact on the PTM site
+
+ Parameters
+ ----------
+ spliced_ptms: pd.DataFrame
+ Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
+ altered_flanks: pd.DataFrame
+ Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases
+ mod_class: str
+ modification class to subset, if any
+ include_stop_codon_introduction: bool
+ Whether to include PTMs that introduce stop codons in the altered flanks. Default is False.
+ remove_conflicting: bool
+ Whether to remove PTMs that are both included and excluded across different splicing events. Default is True.
+ """
+ #process differentially included PTMs and altered flanking sequences
+ if mod_class is not None:
+ spliced_ptms = get_modification_class_data(spliced_ptms, mod_class)
+ altered_flanks = get_modification_class_data(altered_flanks, mod_class)
+
+ #extract specific direction of splicing change and add to dataframe
+ spliced_ptms['Impact'] = spliced_ptms['dPSI'].apply(lambda x: 'Included' if x > 0 else 'Excluded')
+
+ #restrict altered flanks to those that are changed and are not disrupted by stop codons
+ if altered_flanks['Stop Codon Introduced'].dtypes != bool:
+ altered_flanks['Stop Codon Introduced'] = altered_flanks['Stop Codon Introduced'].astype(bool)
+ if include_stop_codon_introduction:
+ altered_flanks['Impact'] = altered_flanks['Stop Codon Introduced'].apply(lambda x: 'Stop Codon Introduced' if x else 'Altered Flank')
+ else:
+ altered_flanks = altered_flanks[~altered_flanks['Stop Codon Introduced']].copy()
+ altered_flanks['Impact'] = 'Altered Flank'
+
+ #identify annotations that are found in both datasets
+ annotation_columns_in_spliced_ptms = [col for col in spliced_ptms.columns if ':' in col]
+ annotation_columns_in_altered_flanks = [col for col in altered_flanks.columns if ':' in col]
+ annotation_columns = list(set(annotation_columns_in_spliced_ptms).intersection(annotation_columns_in_altered_flanks))
+ if len(annotation_columns) != annotation_columns_in_spliced_ptms:
+ annotation_columns_only_in_spliced = list(set(annotation_columns_in_spliced_ptms) - set(annotation_columns_in_altered_flanks))
+ annotation_columns_only_in_altered = list(set(annotation_columns_in_altered_flanks) - set(annotation_columns_in_spliced_ptms))
+ if len(annotation_columns_only_in_spliced) > 0:
+ print(f'Warning: some annotations in spliced ptms dataframe not found in altered flanks dataframe: {", ".join(annotation_columns_only_in_spliced)}. These annotations will be ignored. To avoid this, make sure to add annotations to both dataframes, or annotate the combined dataframe.')
+ if len(annotation_columns_only_in_altered) > 0:
+ print(f'Warning: some annotations in altered flanks dataframe not found in spliced ptms dataframe: {", ".join(annotation_columns_only_in_altered)}. These annotations will be ignored. To avoid this, make sure to add annotations to both dataframes, or annotate the combined dataframe.')
+
+ #check if dPSI or sig columns are in both dataframes
+ sig_cols = []
+ if 'dPSI' in spliced_ptms.columns and 'dPSI' in altered_flanks.columns:
+ sig_cols.append('dPSI')
+ if 'Significance' in spliced_ptms.columns and 'Significance' in altered_flanks.columns:
+ sig_cols.append('Significance')
+
+ shared_columns = ['Impact', 'Gene', 'UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class'] + sig_cols + annotation_columns
+ combined = pd.concat([spliced_ptms[shared_columns], altered_flanks[shared_columns]])
+ combined = combined.groupby([col for col in combined.columns if col != 'Impact'], as_index = False, dropna = False)['Impact'].apply(lambda x: ';'.join(set(x)))
+
+ #remove ptms that are both included and excluded across different events
+ if remove_conflicting:
+ combined = combined[~((combined['Impact'].str.contains('Included')) & (combined['Impact'].str.contains('Excluded')))]
+
+ return combined
+
+[docs]def simplify_annotation(annotation, sep = ','):
+ """
+ Given an annotation, remove additional information such as whether or not a function is increasing or decreasing. For example, 'cell growth, induced' would be simplified to 'cell growth'
+
+ Parameters
+ ----------
+ annotation: str
+ Annotation to simplify
+ sep: str
+ Separator that splits the core annotation from additional detail. Default is ','. Assumes the first element is the core annotation.
+
+ Returns
+ -------
+ annotation: str
+ Simplified annotation
+ """
+ annotation = annotation.split(sep)[0].strip(' ') if annotation == annotation else annotation
+ return annotation
+
+def collapse_annotations(annotations, database = 'PhosphoSitePlus', annotation_type = 'Function'):
+ sep_dict = {'PhosphoSitePlus':{'Function':',', 'Process':',','Interactions':'(', 'Disease':'->', 'Perturbation':'->'}, 'ELM': {'Interactions': ' ', 'Motif Match': ' '}, 'PTMInt':{'Interactions':'->'}, 'PTMcode':{'Interactions':'_', 'Intraprotein':' '}, 'RegPhos':{'Kinase':' '}, 'DEPOD':{'Phosphatase':' '}, 'Combined':{'Kinase':' ', 'Interactions':'->'}, 'PTMsigDB': {'WikiPathway':'->', 'NetPath':'->','mSigDB':'->', 'Perturbation (DIA2)':'->', 'Perturbation (DIA)': '->', 'Perturbation (PRM)':'->','Kinase':'->'}}
+
+ if annotation_type == 'Kinase' and database != 'PTMsigDB':
+ collapsed = annotations
+ else:
+ sep = sep_dict[database][annotation_type]
+ collapsed = []
+ for annot in annotations:
+ if annot == annot:
+ collapsed.append(simplify_annotation(annot, sep = sep))
+ else:
+ collapsed.append(annot)
+ return collapsed
+
+
+def get_modification_class_data(spliced_ptms, mod_class):
+ #check if specific modification class was provided and subset data by modification if so
+ if mod_class in spliced_ptms['Modification Class'].values:
+ ptms_of_interest = spliced_ptms[spliced_ptms['Modification Class'].str.contains(mod_class)].copy()
+ else:
+ ptms_of_interest['Modification Class'] = ptms_of_interest['Modification Class'].apply(lambda x: x.split(';') if x == x else np.nan)
+ ptms_of_interest = ptms_of_interest.explode('Modification Class').dropna(subset = 'Modification Class')
+ available_ptms = ptms_of_interest['Modification Class'].unique()
+ raise ValueError(f"Requested modification class not present in the data. The available modifications include {', '.join(available_ptms)}")
+
+ return ptms_of_interest
+
+[docs]def get_ptm_annotations(spliced_ptms, annotation_type = 'Function', database = 'PhosphoSitePlus', mod_class = None, collapse_on_similar = False, dPSI_col = None, sig_col = None):
+ """
+ Given spliced ptm information obtained from project and annotate modules, grab PTMs in spliced ptms associated with specific PTM modules
+
+ Parameters
+ ----------
+ spliced_ptms: pd.DataFrame
+ PTMs projected onto splicing events and with annotations appended from various databases
+ annotation_type: str
+ Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is 'Function'.
+ database: str
+ database from which PTMs are pulled. Options include 'PhosphoSitePlus', 'ELM', or 'PTMInt'. ELM and PTMInt data will automatically be downloaded, but due to download restrictions, PhosphoSitePlus data must be manually downloaded and annotated in the spliced_ptms data using functions from the annotate module. Default is 'PhosphoSitePlus'.
+ mod_class: str
+ modification class to subset
+ """
+ #check to make sure requested annotation is available
+ if database != 'Combined':
+ annotation_col = get_annotation_col(spliced_ptms, database = database, annotation_type = annotation_type)
+ else:
+ annotation_col = f'Combined:{annotation_type}'
+
+
+ #check if specific modification class was provided and subset data by modification if so
+ if mod_class is not None:
+ ptms_of_interest = get_modification_class_data(spliced_ptms, mod_class)
+ else:
+ ptms_of_interest = spliced_ptms.copy()
+
+ #extract relevant annotation and remove PTMs without an annotation
+ optional_cols = [col for col in ptms_of_interest.columns if col in ['Impact', 'dPSI', 'Significance'] or col == dPSI_col or col == sig_col ]
+ annotations = ptms_of_interest[['Gene', 'UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class'] + [annotation_col] + optional_cols].copy()
+ annotations = annotations.dropna(subset = annotation_col).drop_duplicates()
+
+ if annotations.empty:
+ print("No PTMs with associated annotation")
+ return None, None
+
+ #combine repeat entries for same PTM (with multiple impacts)
+ annotations = annotations.groupby(['Gene', 'UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform'], as_index = False).agg(lambda x: ';'.join(set([str(i) for i in x if i == i])))
+
+ #separate distinct modification annotations in unique rows
+ annotations_exploded = annotations.copy()
+ annotations_exploded[annotation_col] = annotations_exploded[annotation_col].apply(lambda x: x.split(';') if isinstance(x, str) else np.nan)
+ annotations_exploded = annotations_exploded.explode(annotation_col)
+ annotations_exploded[annotation_col] = annotations_exploded[annotation_col].apply(lambda x: x.strip() if isinstance(x, str) else np.nan)
+
+ #if desired collapse similar annotations (for example, same function but increasing or decreasing)
+ if collapse_on_similar:
+ annotations_exploded[annotation_col] = collapse_annotations(annotations_exploded[annotation_col].values, database = database, annotation_type = annotation_type)
+ annotations_exploded.drop_duplicates(inplace = True)
+ annotations = annotations_exploded.groupby([col for col in annotations_exploded.columns if col != annotation_col], as_index = False, dropna = False)[annotation_col].apply(lambda x: ';'.join(set(x)))
+
+ #get the number of annotations found
+ annotation_counts = annotations_exploded.drop_duplicates(subset = ['Gene', 'UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform'] + [annotation_col])[annotation_col].value_counts()
+
+ #additional_counts
+ sub_counts = []
+ if 'Impact' in annotations_exploded.columns:
+ for imp in ['Included', 'Excluded', 'Altered Flank']:
+ tmp_annotations = annotations_exploded[annotations_exploded['Impact'] == imp].copy()
+ tmp_annotations = tmp_annotations.drop_duplicates(subset = ['Gene', 'UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform'] + [annotation_col])
+ sub_counts.append(tmp_annotations[annotation_col].value_counts())
+
+ annotation_counts = pd.concat([annotation_counts] + sub_counts, axis = 1)
+ annotation_counts.columns = ['All Impacted', 'Included', 'Excluded', 'Altered Flank']
+ annotation_counts = annotation_counts.replace(np.nan, 0)
+
+ #combine repeat entries for same PTM (with multiple impacts)
+ annotations = annotations.groupby(['Gene', 'UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform'], as_index = False).agg(lambda x: ';'.join(set([str(i) for i in x if i == i])))
+
+ return annotations, annotation_counts
+
+[docs]def get_annotation_categories(spliced_ptms):
+ """
+ Given spliced ptm information, return the available annotation categories that have been appended to dataframe
+
+ Parameters
+ ----------
+ spliced_ptms: pd.DataFrame
+ PTMs projected onto splicing events and with annotations appended from various databases
+
+ Returns
+ -------
+ annot_categories: pd.DataFrame
+ Dataframe that indicates the available databases, annotations from each database, and column associated with that annotation
+ """
+ database_list = []
+ type_list = []
+ column_list = []
+ #get available phosphositeplus annotations
+ for col in spliced_ptms.columns:
+ if ':' in col:
+ database = col.split(':')[0] if 'PSP' not in col else 'PhosphoSitePlus'
+ if database != 'Combined' and database != 'Unnamed':
+ col_dict = pose_config.annotation_col_dict[database]
+
+ #flip through annotation types in col_dict and add the one that matches the column
+ for key, value in col_dict.items():
+ if value == col:
+ type_list.append(key)
+ database_list.append(database)
+ column_list.append(col)
+ elif database == 'Combined':
+ type_list.append(col.split(':')[1])
+ database_list.append('Combined')
+ column_list.append(col)
+ else:
+ continue
+
+ if len(type_list) > 0:
+ annot_categories = pd.DataFrame({'database':database_list, 'annotation_type':type_list, 'column': column_list}).sort_values(by = 'database')
+ return annot_categories
+ else:
+ print('No annotation information found. Please run functions from annotate module to append annotation information')
+ return None
+
+
+def construct_background(file = None, annotation_type = 'Function', database = 'PhosphoSitePlus', modification = None, collapse_on_similar = False, save = False):
+ ptm_coordinates = pose_config.ptm_coordinates.copy()
+ ptm_coordinates = ptm_coordinates.rename({'Gene name':'Gene'}, axis = 1)
+ if modification is not None:
+ ptm_coordinates = ptm_coordinates[ptm_coordinates['Modification Class'].str.contains(modification)].copy()
+ if ptm_coordinates.empty:
+ raise ValueError(f'No PTMs found with modification class {modification}. Please provide a valid modification class. Examples include Phosphorylation, Glycosylation, Ubiquitination, etc.')
+
+
+ if database == 'PhosphoSitePlus':
+ if file is None:
+ raise ValueError('Please provide PhosphoSitePlus source file to construct the background dataframe')
+ elif annotation_type in ['Function', 'Process', 'Interactions']:
+ ptm_coordinates = annotate.add_PSP_regulatory_site_data(ptm_coordinates, file = file, report_success=False)
+ elif annotation_type == 'Kinase':
+ ptm_coordinates = annotate.add_PSP_kinase_substrate_data(ptm_coordinates, file = file, report_success=False)
+ elif annotation_type == 'Disease':
+ ptm_coordinates = annotate.add_PSP_disease_association(ptm_coordinates, file = file, report_success=False)
+ elif annotation_type == 'Perturbation':
+ ptm_coordinates = annotate.add_PTMsigDB_data(ptm_coordinates, file = file, report_success=False)
+ if database == 'ELM':
+ if annotation_type == 'Interactions':
+ ptm_coordinates = annotate.add_ELM_interactions(ptm_coordinates, file = file, report_success = False)
+ elif annotation_type == 'Motif Match':
+ ptm_coordinates = annotate.add_ELM_matched_motifs(ptm_coordinates, file = file, report_success = False)
+ if database == 'PTMInt':
+ ptm_coordinates = annotate.add_PTMInt_data(ptm_coordinates, file = file, report_success=False)
+ if database == 'PTMcode':
+ if annotation_type == 'Intraprotein':
+ ptm_coordinates = annotate.add_PTMcode_intraprotein(ptm_coordinates, file = file, report_success=False)
+ elif annotation_type == 'Interactions':
+ ptm_coordinates = annotate.add_PTMcode_interprotein(ptm_coordinates, file = file, report_success=False)
+ if database == 'RegPhos':
+ ptm_coordinates = annotate.add_RegPhos_data(ptm_coordinates, file = file, report_success=False)
+ if database == 'DEPOD':
+ ptm_coordinates = annotate.add_DEPOD_phosphatase_data(ptm_coordinates, report_success=False)
+ if database == 'PTMsigDB':
+ ptm_coordinates = annotate.add_PTMsigDB_data(ptm_coordinates, file = file, report_success=False)
+ if database == 'Combined':
+ raise ValueError('Combined information is not supported for constructing background data from entire proteome at this time. Please provide a specific database to construct background data.')
+
+
+ _, annotation_counts = get_ptm_annotations(ptm_coordinates, annotation_type = annotation_type, database = database, collapse_on_similar = collapse_on_similar)
+ if save:
+ package_dir = os.path.dirname(os.path.abspath(__file__))
+ if collapse_on_similar and modification is not None:
+ annotation_counts.to_csv(package_dir + f'/Resource_Files/background_annotations/{database}_{annotation_type}_{modification}_collapsed.csv')
+ elif collapse_on_similar:
+ annotation_counts.to_csv(package_dir + f'/Resource_Files/background_annotations/{database}_{annotation_type}_collapsed.csv')
+ elif modification is not None:
+ annotation_counts.to_csv(package_dir + f'/Resource_Files/background_annotations/{database}_{annotation_type}_{modification}.csv')
+ else:
+ annotation_counts.to_csv(package_dir + f'/Resource_Files/background_annotations/{database}_{annotation_type}.csv')
+
+ return annotation_counts
+
+
+
+
+[docs]def get_enrichment_inputs(spliced_ptms, annotation_type = 'Function', database = 'PhosphoSitePlus', background_type = 'pregenerated', background = None, collapse_on_similar = False, mod_class = None, alpha = 0.05, min_dPSI = 0, annotation_file = None, save_background = False):
+ """
+ Given the spliced ptms, altered_flanks, or combined PTMs dataframe, identify the number of PTMs corresponding to specific annotations in the foreground (PTMs impacted by splicing) and the background (all PTMs in the proteome or all PTMs in dataset not impacted by splicing). This information can be used to calculate the enrichment of specific annotations among PTMs impacted by splicing. Several options are provided for constructing the background data: pregenerated (based on entire proteome in the ptm_coordinates dataframe) or significance (foreground PTMs are extracted from provided spliced PTMs based on significance and minimum delta PSI)
+
+ Parameters
+ ----------
+ spliced_ptms: pd.DataFrame
+
+ """
+ if background_type == 'pregenerated':
+ print('Using pregenerated background information on all PTMs in the proteome.')
+ #first look for pregenerated background data
+ try:
+ background_annotation_count = pose_config.download_background(annotation_type = annotation_type, database = database, mod_class = mod_class, collapsed=collapse_on_similar)
+ except:
+ if annotation_file is None:
+ print('Note: To avoid having to constructing background each time (which is slower), you can choose to set save_background = True to save the background data to Resource Files in package directory.')
+ background_annotation_count = construct_background(file = annotation_file, annotation_type = annotation_type, database = database, collapse_on_similar = collapse_on_similar, save = save_background)
+
+ if mod_class is None:
+ background_size = pose_config.ptm_coordinates.drop_duplicates(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform']).shape[0]
+ else:
+ background_size = pose_config.ptm_coordinates[pose_config.ptm_coordinates['Modification Class'].str.contains(mod_class)].drop_duplicates(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform']).shape[0]
+
+ elif background_type == 'significance':
+ if 'Significance' not in spliced_ptms.columns or 'dPSI' not in spliced_ptms.columns:
+ raise ValueError('Significance and dPSI columns must be present in spliced_ptms dataframe to construct a background based on significance (these columns must be provided during projection).')
+
+ background = spliced_ptms.copy()
+ #restrict sample to significantly spliced ptms
+ spliced_ptms = spliced_ptms[(spliced_ptms['Significance'] <= alpha) & (spliced_ptms['dPSI'].abs() >= min_dPSI)].copy()
+
+
+ #check to make sure there are significant PTMs in the data and that there is a difference in the number of significant and background PTMs
+ if spliced_ptms.shape[0] == 0:
+ raise ValueError('No significantly spliced PTMs found in the data')
+ elif spliced_ptms.shape[0] == background.shape[0]:
+ raise ValueError(f'The foreground and background PTM sets are the same size when considering significance. Please provide a different background set with the background_ptms parameter, or make sure spliced_ptms also includes non-significant PTMs. Instead using pregenerated background sets of the whole proteome.')
+ else:
+ if mod_class is not None:
+ background = get_modification_class_data(background, mod_class)
+
+ #get background counts
+ background_size = background.shape[0]
+ _, background_annotation_count = get_ptm_annotations(background, annotation_type = annotation_type, database = database, collapse_on_similar = collapse_on_similar)
+ #elif background is not None: #if custom background is provided
+ # print('Using the provided custom background')
+ # if isinstance(background, list) or isinstance(background, np.ndarray):
+ # #from list of PTM strings, separate into uniprot id, residue, and position
+ # uniprot_id = [ptm.split('_')[0] for ptm in background]
+ # residue = [ptm.split('_')[1][0] for ptm in background]
+ # position = [int(ptm.split('_')[1][1:]) for ptm in background]
+ # background = pd.DataFrame({'UniProtKB Accession':uniprot_id, 'Residue':residue, 'PTM Position in Canonical Isoform':position, 'Modification Class':mod_class})
+ # if isinstance(background, pd.DataFrame):
+ # #check to make sure ptm data has key columns to identify ptms
+ # if 'UniProtKB Accession' not in background.columns or 'Residue' not in background.columns or 'PTM Position in Canonical Isoform' not in background.columns or #'Modification Class' not in background.columns:
+ # raise ValueError('Background dataframe must have UniProtKB Accession, Residue, PTM Position in Canonical Isoform, and Modification Class columns to identify PTMs')
+
+ # #restrict to specific modification class
+ # if mod_class is not None and 'Modification Class' in background.columns:
+ # background = get_modification_class_data(background, mod_class)
+ # elif mod_class is not None:
+ # raise ValueError('Custom background dataframe must have a Modification Class column to subset by modification class.')
+ # else:
+ # raise ValueError('Custom backgrounds must be provided as a list/array of PTMs in the form of "P00533_Y1068" (Uniprot ID followed by site number) or as a custom background dataframe with UniProtKB Accession, Residue, PTM Position in Canonical Isoform, and Modification Class columns.')
+
+ # background = annotate.add_annotation(background, annotation_type = annotation_type, database = database, check_existing = True, file = annotation_file)
+ # background_size = background.drop_duplicates(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform']).shape[0]
+
+ #get background counts
+ # _, background_annotation_count = get_ptm_annotations(background, annotation_type = annotation_type, database = database, collapse_on_similar = collapse_on_similar)
+ #elif background_type == 'custom':
+ # raise ValueError('Please provide a custom background dataframe or list of PTMs to use as the background if wanting to use custom background data.')
+ else:
+ raise ValueError('Invalid background type. Must be pregenerated (default) or significance')
+
+ #get counts
+ foreground_size = spliced_ptms.shape[0]
+ annotation_details, foreground_annotation_count = get_ptm_annotations(spliced_ptms, annotation_type = annotation_type, database = database, collapse_on_similar=collapse_on_similar)
+
+ #process annotation details into usable format
+ if annotation_details is None:
+ print('No PTMs with requested annotation type, so could not perform enrichment analysis')
+ return np.repeat(None, 5)
+ else:
+ annotation_col = get_annotation_col(spliced_ptms, database = database, annotation_type = annotation_type)
+ annotation_details[annotation_col] = annotation_details[annotation_col].str.split(';')
+ annotation_details = annotation_details.explode(annotation_col)
+ annotation_details[annotation_col] = annotation_details[annotation_col].str.strip()
+ annotation_details['PTM'] = annotation_details['Gene'] + '_' + annotation_details['Residue'] + annotation_details['PTM Position in Canonical Isoform'].astype(int).astype(str)
+ annotation_details = annotation_details.groupby(annotation_col)['PTM'].agg(';'.join)
+
+ return foreground_annotation_count, foreground_size, background_annotation_count, background_size, annotation_details
+
+
+[docs]def annotation_enrichment(spliced_ptms, database = 'PhosphoSitePlus', annotation_type = 'Function', background_type = 'pregenerated', collapse_on_similar = False, mod_class = None, alpha = None, min_dPSI = None, annotation_file = None, save_background = False):#
+ """
+ In progress, needs to be tested
+
+ Given spliced ptm information (differential inclusion, altered flanking sequences, or both), calculate the enrichment of specific annotations in the dataset using a hypergeometric test. Background data can be provided/constructed in a few ways:
+
+ 1. Use preconstructed background data for the annotation of interest, which considers the entire proteome present in the ptm_coordinates dataframe. While this is the default, it may not be the most accurate representation of your data, so you may alternative decide to use the other options which will be more specific to your context.
+ 2. Use the alpha and min_dPSI parameter to construct a foreground that only includes significantly spliced PTMs, and use the entire provided spliced_ptms dataframe as the background. This will allow you to compare the enrichment of specific annotations in the significantly spliced PTMs compared to the entire dataset. Will do this automatically if alpha or min_dPSI is provided.
+
+ Parameters
+ ----------
+ spliced_ptms: pd.DataFrame
+ Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
+ database: str
+ database from which PTMs are pulled. Options include 'PhosphoSitePlus', 'ELM', 'PTMInt', 'PTMcode', 'DEPOD', 'RegPhos', 'PTMsigDB'. Default is 'PhosphoSitePlus'.
+ annotation_type: str
+ Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is 'Function'.
+ background_type: str
+ how to construct the background data. Options include 'pregenerated' (default) and 'significance'. If 'significance' is selected, the alpha and min_dPSI parameters must be provided. Otherwise, will use whole proteome in the ptm_coordinates dataframe as the background.
+ collapse_on_similar: bool
+ Whether to collapse similar annotations (for example, increasing and decreasing functions) into a single category. Default is False.
+ mod_class: str
+ modification class to subset, if any
+ alpha: float
+ significance threshold to use to subset foreground PTMs. Default is None.
+ min_dPSI: float
+ minimum delta PSI value to use to subset foreground PTMs. Default is None.
+ annotation_file: str
+ file to use to annotate custom background data. Default is None.
+ save_background: bool
+ Whether to save the background data constructed from the ptm_coordinates dataframe into Resource_Files within package. Default is False.
+ """
+ foreground_annotation_count, foreground_size, background_annotations, background_size, annotation_details = get_enrichment_inputs(spliced_ptms, background_type = background_type, annotation_type = annotation_type, database = database, collapse_on_similar = collapse_on_similar, mod_class = mod_class, alpha = alpha, min_dPSI = min_dPSI, annotation_file = annotation_file, save_background = save_background)
+
+
+ if foreground_annotation_count is not None:
+ #iterate through all annotations and calculate enrichment with a hypergeometric test
+ results = pd.DataFrame(columns = ['Fraction Impacted', 'p-value'], index = foreground_annotation_count.index)
+ for i, n in background_annotations.items():
+ #number of PTMs in the foreground with the annotation
+ if i in foreground_annotation_count.index.values:
+ if foreground_annotation_count.shape[1] == 1:
+ k = foreground_annotation_count.loc[i, 'count']
+ elif foreground_annotation_count.shape[1] > 1:
+ k = foreground_annotation_count.loc[i, 'All Impacted']
+
+ p = stat_utils.getEnrichment(background_size, n, foreground_size, k, fishers = False)
+ results.loc[i, 'Fraction Impacted'] = f"{k}/{n}"
+ results.loc[i, 'p-value'] = p
+
+ results = results.sort_values('p-value')
+ results['Adjusted p-value'] = stat_utils.adjustP(results['p-value'].values)
+ results = pd.concat([results, annotation_details], axis = 1)
+ else:
+ results = None
+
+ return results
+
+
+[docs]def gene_set_enrichment(spliced_ptms = None, altered_flanks = None, combined = None, alpha = 0.05, min_dPSI = None, gene_sets = ['KEGG_2021_Human', 'GO_Biological_Process_2023', 'GO_Cellular_Component_2023', 'GO_Molecular_Function_2023','Reactome_2022'], background = None, return_sig_only = True, max_retries = 5, delay = 10):
+ """
+ Given spliced_ptms and/or altered_flanks dataframes (or the dataframes combined from combine_outputs()), perform gene set enrichment analysis using the enrichr API
+
+ Parameters
+ ----------
+ spliced_ptms: pd.DataFrame
+ Dataframe with differentially included PTMs projected onto splicing events and with annotations appended from various databases. Default is None (will not be considered in analysis). If combined dataframe is provided, this dataframe will be ignored.
+ altered_flanks: pd.DataFrame
+ Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases. Default is None (will not be considered). If combined dataframe is provided, this dataframe will be ignored.
+ combined: pd.DataFrame
+ Combined dataframe with spliced_ptms and altered_flanks dataframes. Default is None. If provided, spliced_ptms and altered_flanks dataframes will be ignored.
+ gene_sets: list
+ List of gene sets to use in enrichment analysis. Default is ['KEGG_2021_Human', 'GO_Biological_Process_2023', 'GO_Cellular_Component_2023', 'GO_Molecular_Function_2023','Reactome_2022']. Look at gseapy and enrichr documentation for other available gene sets
+ background: list
+ List of genes to use as background in enrichment analysis. Default is None (all genes in the gene set database will be used).
+ return_sig_only: bool
+ Whether to return only significantly enriched gene sets. Default is True.
+ max_retries: int
+ Number of times to retry downloading gene set enrichment data from enrichr API. Default is 5.
+ delay: int
+ Number of seconds to wait between retries. Default is 10.
+
+ Returns
+ -------
+ results: pd.DataFrame
+ Dataframe with gene set enrichment results from enrichr API
+
+ """
+ if combined is not None:
+ if spliced_ptms is not None or altered_flanks is not None:
+ print('If combined dataframe is provided, you do not need to include spliced_ptms or altered_flanks dataframes. Ignoring these inputs.')
+
+ foreground = combined.copy()
+ type = 'Differentially Included + Altered Flanking Sequences'
+
+ #isolate the type of impact on the gene
+ combined_on_gene = combined.groupby('Gene')['Impact'].apply(lambda x: ';'.join(set(x)))
+ included = combined_on_gene.str.contains('Included')
+ excluded = combined_on_gene.str.contains('Excluded')
+ differential = included | excluded
+ altered_flank = combined_on_gene.str.contains('Altered Flank')
+
+ altered_flank_only = altered_flank & ~differential
+ differential_only = differential & ~altered_flank
+ both = differential & altered_flank
+
+ altered_flank_only = combined_on_gene[altered_flank_only].index.tolist()
+ differential_only = combined_on_gene[differential_only].index.tolist()
+ both = combined_on_gene[both].index.tolist()
+ elif spliced_ptms is not None and altered_flanks is not None:
+ #gene information (total and spliced genes)
+ combined = combine_outputs(spliced_ptms, altered_flanks)
+ foreground = combined.copy()
+ type = 'Differentially Included + Altered Flanking Sequences'
+
+ #isolate the type of impact on the gene
+ combined_on_gene = combined.groupby('Gene')['Impact'].apply(lambda x: ';'.join(set(x)))
+ included = combined_on_gene.str.contains('Included')
+ excluded = combined_on_gene.str.contains('Excluded')
+ differential = included | excluded
+ altered_flank = combined_on_gene.str.contains('Altered Flank')
+
+ altered_flank_only = altered_flank & ~differential
+ differential_only = differential & ~altered_flank
+ both = differential & altered_flank
+
+ altered_flank_only = combined_on_gene[altered_flank_only].index.tolist()
+ differential_only = combined_on_gene[differential_only].index.tolist()
+ both = combined_on_gene[both].index.tolist()
+ elif spliced_ptms is not None:
+ foreground = spliced_ptms.copy()
+ type = 'Differentially Included'
+
+ #isolate the type of impact on the gene
+ altered_flank_only = []
+ differential_only = spliced_ptms['Gene'].unique().tolist()
+ both = []
+ elif altered_flanks is not None:
+ foreground = altered_flanks.copy()
+ type = 'Altered Flanking Sequences'
+
+ #isolate the type of impact on the gene
+ altered_flank_only = altered_flanks['Gene'].unique().tolist()
+ differential_only = []
+ both = []
+ else:
+ raise ValueError('No dataframes provided. Please provide spliced_ptms, altered_flanks, or the combined dataframe.')
+
+ #restrict to significant ptms, if available
+ if 'Significance' in combined.columns and (min_dPSI is not None and 'dPSI' in foreground.columns):
+ foreground = combined[combined['Significance'] <= alpha].copy()
+ foreground = foreground[foreground['dPSI'].abs() >= min_dPSI]
+ elif 'Significance' in combined.columns:
+ foreground = combined[combined['Significance'] <= alpha].copy()
+ elif min_dPSI is not None and 'dPSI' in combined.columns:
+ foreground = combined[combined['dPSI'].abs() >= min_dPSI].copy()
+ else:
+ print('Significance column not found and min_dPSI not provided. All PTMs in dataframe will be considered as the foreground')
+
+ foreground = foreground['Gene'].unique().tolist()
+
+ #construct background
+ if isinstance(background, list):
+ pass
+ elif isinstance(background, np.ndarray):
+ background = list(background)
+ elif background == 'Significance' and 'Significance' in foreground.columns:
+ background = combined.copy()
+ background = background['Gene'].unique().tolist()
+
+
+
+ #perform gene set enrichment analysis and save data
+ for i in range(max_retries):
+ try:
+ enr = gp.enrichr(foreground, background = background, gene_sets = gene_sets, organism='human')
+ break
+ except:
+ time.sleep(delay)
+ else:
+ raise Exception('Failed to run enrichr analysis after ' + str(max_retries) + ' attempts. Please try again later.')
+
+ results = enr.results.copy()
+ results['Type'] = type
+
+ #indicate the genes in each gene set associated with each type of impact
+ results['Genes with Differentially Included PTMs only'] = results['Genes'].apply(lambda x: ';'.join(set(x.split(';')) & (set(differential_only))))
+ results['Genes with PTM with Altered Flanking Sequence only'] = results['Genes'].apply(lambda x: ';'.join(set(x.split(';')) & (set(altered_flank_only))))
+ results['Genes with Both'] = results['Genes'].apply(lambda x: ';'.join(set(x.split(';')) & (set(both))))
+
+ if return_sig_only:
+ return results[results['Adjusted P-value'] <= 0.05]
+ else:
+ return results
+
+def compare_flanking_sequences(altered_flanks, flank_size = 5):
+ sequence_identity_list = []
+ altered_positions_list = []
+ residue_change_list = []
+ flank_side_list = []
+ for i, row in altered_flanks.iterrows():
+ #if there is sequence info for both and does not introduce stop codons, compare sequence identity
+ if not row['Stop Codon Introduced'] and row['Inclusion Flanking Sequence'] == row['Inclusion Flanking Sequence'] and row['Exclusion Flanking Sequence'] == row['Exclusion Flanking Sequence']:
+ #compare sequence identity
+ sequence_identity = getSequenceIdentity(row['Inclusion Flanking Sequence'], row['Exclusion Flanking Sequence'])
+ #identify where flanking sequence changes
+ altered_positions, residue_change, flank_side = findAlteredPositions(row['Inclusion Flanking Sequence'], row['Exclusion Flanking Sequence'], flank_size = flank_size)
+ else:
+ sequence_identity = np.nan
+ altered_positions = np.nan
+ residue_change = np.nan
+ flank_side = np.nan
+
+
+
+ #add to lists
+ sequence_identity_list.append(sequence_identity)
+ altered_positions_list.append(altered_positions)
+ residue_change_list.append(residue_change)
+ flank_side_list.append(flank_side)
+
+ altered_flanks['Sequence Identity'] = sequence_identity_list
+ altered_flanks['Altered Positions'] = altered_positions_list
+ altered_flanks['Residue Change'] = residue_change_list
+ altered_flanks['Altered Flank Side'] = flank_side_list
+ return altered_flanks
+
+
+
+[docs]def compare_inclusion_motifs(flanking_sequences, elm_classes = None):
+ """
+ Given a DataFrame containing flanking sequences with changes and a DataFrame containing ELM class information, identify motifs that are found in the inclusion and exclusion events, identifying motifs unique to each case. This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).
+
+ Parameters
+ ----------
+ flanking_sequences: pandas.DataFrame
+ DataFrame containing flanking sequences with changes, obtained from get_flanking_changes_from_splice_data()
+ elm_classes: pandas.DataFrame
+ DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv). Recommended to download this file and input it manually, but will download from ELM otherwise
+
+ Returns
+ -------
+ flanking_sequences: pandas.DataFrame
+ DataFrame containing flanking sequences with changes and motifs found in the inclusion and exclusion events
+
+ """
+ if elm_classes is None:
+ elm_classes = pd.read_csv('http://elm.eu.org/elms/elms_index.tsv', sep = '\t', header = 5)
+
+
+
+ only_in_inclusion = []
+ only_in_exclusion = []
+
+ for _, row in flanking_sequences.iterrows():
+ #check if there is a stop codon introduced and both flanking sequences are present
+ if not row['Stop Codon Introduced'] and row['Inclusion Flanking Sequence'] == row['Inclusion Flanking Sequence'] and row['Exclusion Flanking Sequence'] == row['Exclusion Flanking Sequence']:
+ #get elm motifs that match inclusion or Exclusion Flanking Sequences
+ inclusion_matches = find_motifs(row['Inclusion Flanking Sequence'], elm_classes)
+ exclusion_matches = find_motifs(row['Exclusion Flanking Sequence'], elm_classes)
+
+ #get motifs that are unique to each case
+ only_in_inclusion.append(';'.join(set(inclusion_matches) - set(exclusion_matches)))
+ only_in_exclusion.append(';'.join(set(exclusion_matches) - set(inclusion_matches)))
+ else:
+ only_in_inclusion.append(np.nan)
+ only_in_exclusion.append(np.nan)
+
+ #save data
+ flanking_sequences["Motif only in Inclusion"] = only_in_inclusion
+ flanking_sequences["Motif only in Exclusion"] = only_in_exclusion
+ return flanking_sequences
+
+def identify_change_to_specific_motif(altered_flanks, elm_motif_name, elm_classes = None, modification_class = None, residues = None, dPSI_col = None):
+ if 'Altered Positions' not in altered_flanks.columns:
+ altered_flanks = compare_flanking_sequences(altered_flanks)
+
+ #grab elm motifs that match inclusion or Exclusion Flanking Sequences
+ if 'Motif only in Inclusion' not in altered_flanks.columns:
+ altered_flanks = compare_inclusion_motifs(altered_flanks, elm_classes = elm_classes)
+
+ #grab only needed info
+ motif_data = altered_flanks.dropna(subset = ['Inclusion Flanking Sequence', 'Exclusion Flanking Sequence'], how = 'all').copy()
+ cols_to_keep = ['Gene', 'UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class', 'Inclusion Flanking Sequence', 'Exclusion Flanking Sequence', 'Motif only in Inclusion', 'Motif only in Exclusion', 'Altered Positions', 'Residue Change']
+ if dPSI_col is not None:
+ cols_to_keep.append(dPSI_col)
+
+ #go through motif data and identify motifs matching elm motif of interest
+ motif_data = motif_data[cols_to_keep]
+ for i, row in motif_data.iterrows():
+ if row['Motif only in Inclusion'] == row['Motif only in Inclusion']:
+ if elm_motif_name in row['Motif only in Inclusion']:
+ motif_data.loc[i, 'Motif only in Inclusion'] = ';'.join([motif for motif in row['Motif only in Inclusion'].split(';') if elm_motif_name in motif])
+ else:
+ motif_data.loc[i, 'Motif only in Inclusion'] = np.nan
+
+ if row['Motif only in Exclusion'] == row['Motif only in Exclusion']:
+ if elm_motif_name in row['Motif only in Exclusion']:
+ motif_data.loc[i, 'Motif only in Exclusion'] = ';'.join([motif for motif in row['Motif only in Exclusion'].split(';') if elm_motif_name in motif])
+ else:
+ motif_data.loc[i, 'Motif only in Exclusion'] = np.nan
+
+ #restrict to events that are specific modification types or residues (for example, SH2 domain motifs should be phosphotyrosine)
+ motif_data = motif_data.dropna(subset = ['Motif only in Inclusion', 'Motif only in Exclusion'], how = 'all')
+ if modification_class is not None:
+ motif_data = motif_data[motif_data['Modification Class'].str.contains(modification_class)]
+
+ if residues is not None and isinstance(residues, str):
+ motif_data = motif_data[motif_data['Residue'] == residues]
+ elif residues is not None and isinstance(residues, list):
+ motif_data = motif_data[motif_data['Residue'].isin(residues)]
+ elif residues is not None:
+ raise ValueError('residues parameter must be a string or list of strings')
+
+ return motif_data
+
+
+
+
+
+[docs]def findAlteredPositions(seq1, seq2, flank_size = 5):
+ """
+ Given two sequences, identify the location of positions that have changed
+
+ Parameters
+ ----------
+ seq1, seq2: str
+ sequences to compare (order does not matter)
+ flank_size: int
+ size of the flanking sequences (default is 5). This is used to make sure the provided sequences are the correct length
+
+ Returns
+ -------
+ altered_positions: list
+ list of positions that have changed
+ residue_change: list
+ list of residues that have changed associated with that position
+ flank_side: str
+ indicates which side of the flanking sequence the change has occurred (N-term, C-term, or Both)
+ """
+ desired_seq_size = flank_size*2+1
+ altered_positions = []
+ residue_change = []
+ flank_side = []
+ seq_size = len(seq1)
+ flank_size = (seq_size -1)/2
+ if seq_size == len(seq2) and seq_size == desired_seq_size:
+ for i in range(seq_size):
+ if seq1[i] != seq2[i]:
+ altered_positions.append(i-(flank_size))
+ residue_change.append(f'{seq1[i]}->{seq2[i]}')
+ #check to see which side flanking sequence
+ altered_positions = np.array(altered_positions)
+ n_term = any(altered_positions < 0)
+ c_term = any(altered_positions > 0)
+ if n_term and c_term:
+ flank_side = 'Both'
+ elif n_term:
+ flank_side = 'N-term only'
+ elif c_term:
+ flank_side = 'C-term only'
+ else:
+ flank_side = 'Unclear'
+ return altered_positions, residue_change, flank_side
+ else:
+ return np.nan, np.nan, np.nan
+
+[docs]def getSequenceIdentity(seq1, seq2):
+ """
+ Given two flanking sequences, calculate the sequence identity between them using Biopython and parameters definded by Pillman et al. BMC Bioinformatics 2011
+
+ Parameters
+ ----------
+ seq1, seq2: str
+ flanking sequence
+
+ Returns
+ -------
+ normalized_score: float
+ normalized score of sequence similarity between flanking sequences (calculated similarity/max possible similarity)
+ """
+ #make pairwise aligner object
+ aligner = PairwiseAligner()
+ #set parameters, with match score of 10 and mismatch score of -2
+ aligner.mode = 'global'
+ aligner.match_score = 10
+ aligner.mismatch_score = -2
+ #calculate sequence alignment score between two sequences
+ actual_similarity = aligner.align(seq1, seq2)[0].score
+ #calculate sequence alignment score between the same sequence
+ control_similarity = aligner.align(seq1, seq1)[0].score
+ #normalize score
+ normalized_score = actual_similarity/control_similarity
+ return normalized_score
+
+[docs]def find_motifs(seq, elm_classes):
+ """
+ Given a sequence and a dataframe containinn ELM class information, identify motifs that can be found in the provided sequence using the RegEx expression provided by ELM (PTMs not considered). This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).
+
+ Parameters
+ ----------
+ seq: str
+ sequence to search for motifs
+ elm_classes: pandas.DataFrame
+ DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv)
+ """
+ matches = []
+ for j, elm_row in elm_classes.iterrows():
+ reg_ex = elm_row['Regex']
+ if re.search(reg_ex, seq) is not None:
+ matches.append(elm_row['ELMIdentifier'])
+
+ return matches
+
+
+class protein_interactions:
+ def __init__(self, spliced_ptms):
+ self.spliced_ptms = spliced_ptms
+
+
+ def get_interaction_network(self, node_type = 'Gene'):
+ if node_type not in ['Gene', 'PTM']:
+ raise ValueError("node_type parameter (which dictates whether to consider interactions at PTM or gene level) can be either Gene or PTM")
+
+ #extract interaction information in provided data
+ interactions = annotate.combine_interaction_data(self.spliced_ptms)
+ interactions['Residue'] = interactions['Residue'] + interactions['PTM Position in Canonical Isoform'].astype(int).astype(str)
+ interactions = interactions.drop(columns = ['PTM Position in Canonical Isoform'])
+
+ #add regulation change information
+ if 'dPSI' in self.spliced_ptms.columns:
+ interactions['Regulation Change'] = interactions.apply(lambda x: '+' if x['Type'] != 'DISRUPTS' and x['dPSI'] > 0 else '+' if x['Type'] == 'DISRUPTS' and x['dPSI'] < 0 else '-', axis = 1)
+ grouping_cols = ['Residue', 'Type', 'Source', 'dPSI', 'Regulation Change']
+ interactions['dPSI'] = interactions['dPSI'].apply(str)
+ else:
+ grouping_cols = ['Residue', 'Type', 'Source']
+
+ #extract gene_specific network information
+ if node_type == 'Gene':
+ network_data = interactions.groupby(['Modified Gene', 'Interacting Gene'], as_index = False)[grouping_cols].agg(helpers.join_unique_entries)
+ #generate network with all possible PTM-associated interactions
+ interaction_graph = nx.from_pandas_edgelist(network_data, source = 'Modified Gene', target = 'Interacting Gene')
+ else:
+ interactions['Spliced PTM'] = interactions['Modified Gene'] + '_' + interactions['Residue']
+ network_data = interactions.groupby(['Spliced PTM', 'Interacting Gene'], as_index = False)[grouping_cols].agg(helpers.join_unique_entries)
+ network_data = network_data.drop(columns = ['Residue'])
+
+ #generate network with all possible PTM-associated interactions
+ interaction_graph = nx.from_pandas_edgelist(network_data, source = 'Spliced PTM', target = 'Interacting Gene')
+
+ self.network_data = network_data
+ self.interaction_graph = interaction_graph
+
+
+ def get_interaction_stats(self):
+ """
+ Given the networkx interaction graph, calculate various network centrality measures to identify the most relevant PTMs or genes in the network
+ """
+ #calculate network centrality measures
+ degree_centrality = nx.degree_centrality(self.interaction_graph)
+ closeness_centrality = nx.closeness_centrality(self.interaction_graph)
+ betweenness_centrality = nx.betweenness_centrality(self.interaction_graph)
+ network_stats = pd.DataFrame({'Degree': dict(self.interaction_graph.degree()), 'Degree Centrality':degree_centrality, 'Closeness':closeness_centrality,'Betweenness':betweenness_centrality})
+ self.network_stats = network_stats
+
+ def get_protein_interaction_network(self, protein):
+ """
+ Given a specific protein, return the network data for that protein
+
+ Parameters
+ ----------
+ protein: str
+ Gene name of the protein of interest
+
+ Returns
+ -------
+ protein_network: pd.DataFrame
+ Dataframe containing network data for the protein of interest
+ """
+ if not hasattr(self, 'network_data'):
+ self.get_interaction_network()
+
+ if protein not in self.network_data['Modified Gene'].unique():
+ print(f'{protein} is not found in the network data. Please provide a valid gene name.')
+ return None
+
+ protein_network = self.network_data[self.network_data['Modified Gene'] == protein]
+ protein_network = protein_network.drop(columns = ['Modified Gene'])
+ protein_network = protein_network.rename(columns = {'Residue': 'Spliced PTMs facilitating Interacting'})
+ return protein_network
+
+ def summarize_protein_network(self, protein):
+ """
+ Given a protein of interest, summarize the network data for that protein
+ """
+ if not hasattr(self, 'network_data'):
+ self.get_interaction_network()
+
+ if not hasattr(self, 'network_stats'):
+ self.get_interaction_stats()
+
+ protein_network = self.network_data[self.network_data['Modified Gene'] == protein]
+ increased_interactions = protein_network.loc[protein_network['Regulation Change'] == '+', 'Interacting Gene'].values
+ decreased_interactions = protein_network.loc[protein_network['Regulation Change'] == '-', 'Interacting Gene'].values
+ ambiguous_interactions = protein_network.loc[protein_network['Regulation Change'].str.contains(';'), 'Interacting Gene'].values
+
+ #print interactions
+ if len(increased_interactions) > 0:
+ print(f"Increased interaction likelihoods: {', '.join(increased_interactions)}")
+ if len(decreased_interactions) > 0:
+ print(f"Decreased interaction likelihoods: {', '.join(decreased_interactions)}")
+ if len(ambiguous_interactions) > 0:
+ print(f"Ambiguous interaction impact: {', '.join(ambiguous_interactions)}")
+
+ network_ranks = self.network_stats.rank(ascending = False).astype(int)
+ print(f'Number of interactions: {self.network_stats.loc[protein, "Degree"]} (Rank: {network_ranks.loc[protein, "Degree"]})')
+ print(f'Centrality measures - \t Degree = {self.network_stats.loc[protein, "Degree Centrality"]} (Rank: {network_ranks.loc[protein, "Degree Centrality"]})')
+ print(f' \t Betweenness = {self.network_stats.loc[protein, "Betweenness"]} (Rank: {network_ranks.loc[protein, "Betweenness"]})')
+ print(f' \t Closeness = {self.network_stats.loc[protein, "Closeness"]} (Rank: {network_ranks.loc[protein, "Closeness"]})')
+
+ def plot_interaction_network(self, modified_color = 'red', modified_node_size = 10, interacting_color = 'lightblue', interacting_node_size = 1, edgecolor = 'gray', seed = 200, ax = None, proteins_to_label = None, labelcolor = 'black'):
+ """
+ Given the interactiong graph and network data outputted from analyze.get_interaction_network, plot the interaction network, signifying which proteins or ptms are altered by splicing and the specific regulation change that occurs. by default, will only label proteins
+
+ Parameters
+ ----------
+ interaction_graph: nx.Graph
+ NetworkX graph object representing the interaction network, created from analyze.get_interaction_network
+ network_data: pd.DataFrame
+ Dataframe containing details about specifici protein interactions (including which protein contains the spliced PTMs)
+ network_stats: pd.DataFrame
+ Dataframe containing network statistics for each protein in the interaction network, obtained from analyze.get_interaction_stats(). Default is None, which will not label any proteins in the network.
+ """
+ if not hasattr(self, 'interaction_graph'):
+ self.get_interaction_network()
+
+ if not hasattr(self, 'network_stats'):
+ self.get_interaction_stats()
+
+ pose_plots.plot_interaction_network(self.interaction_graph, self.network_data, self.network_stats, modified_color = modified_color, modified_node_size = modified_node_size, interacting_color = interacting_color, interacting_node_size = interacting_node_size, edgecolor = edgecolor, seed = seed, ax = ax, proteins_to_label = proteins_to_label, labelcolor = labelcolor)
+
+ def plot_network_centrality(self, centrality_measure = 'Degree', top_N = 10, modified_color = 'red', interacting_color = 'black', ax = None):
+ if not hasattr(self, 'interaction_graph'):
+ self.get_interaction_network()
+ if not hasattr(self, 'network_stats'):
+ self.get_interaction_stats()
+
+ pose_plots.plot_network_centrality(self.network_stats, self.network_data, centrality_measure=centrality_measure,top_N = top_N, modified_color = modified_color, interacting_color = interacting_color, ax = ax)
+
+[docs]def edit_sequence_for_kinase_library(seq):
+ """
+ Convert flanking sequence to version accepted by kinase library (modified residue denoted by asterick)
+ """
+ if seq == seq:
+ seq = seq.replace('t','t*')
+ seq = seq.replace('s','s*')
+ seq = seq.replace('y','y*')
+ else:
+ return np.nan
+ return seq
+
+
+class KL_flank_analysis:
+ def __init__(self, altered_flanks, odir):
+ self.altered_flanks = altered_flanks
+ self.odir = odir
+
+ def identify_sequences_of_interest(self):
+ self.sequences_of_interest = self.altered_flanks[(~self.altered_flanks['Matched']) & (~self.altered_flanks['Stop Codon Introduced']) & (self.altered_flanks['Modification Class'].str.contains('Phosphorylation'))].copy()
+
+
+ def process_data_for_kinase_library(self):
+ """
+ Extract flanking sequence information for
+ """
+ #restrict to events with changed flanking sequences, no introduced stop codons, and phosphorylation modifications
+ if not hasattr(self, 'sequences_of_interest'):
+ self.identify_sequences_of_interest()
+
+ #generate files to input into Kinase Library (inclusion first then exclusion)
+ inclusion_sequences = self.sequences_of_interest[['PTM', 'Inclusion Flanking Sequence']].drop_duplicates()
+ inclusion_sequences['Inclusion Flanking Sequence'] = inclusion_sequences['Inclusion Flanking Sequence'].apply(edit_sequence_for_kinase_library)
+ inclusion_sequences = inclusion_sequences.dropna(subset = 'Inclusion Flanking Sequence')
+ #write sequences to text file
+ with open(self.odir + 'inclusion_sequences_input.txt', 'w') as f:
+ for _, row in inclusion_sequences.iterrows():
+ f.write(row['Inclusion Flanking Sequence']+'\n')
+
+ exclusion_sequences = self.sequences_of_interest[['PTM', 'Exclusion Flanking Sequence']].drop_duplicates()
+ exclusion_sequences['Exclusion Flanking Sequence'] = exclusion_sequences['Exclusion Flanking Sequence'].apply(edit_sequence_for_kinase_library)
+ exclusion_sequences = exclusion_sequences.dropna(subset = 'Exclusion Flanking Sequence')
+ #write sequences to text file
+ with open(self.odir + 'exclusion_sequences_input.txt', 'w') as f:
+ for _, row in exclusion_sequences.iterrows():
+ f.write(row['Exclusion Flanking Sequence']+'\n')
+
+ print('Input files for Kinase Library generated. Please run upload the file to the "score sites" tab of Kinase Library (https://kinase-library.mit.edu/sites) and download the full results.')
+
+ def format_sequences_to_match_output(self, sequence_type = 'Inclusion'):
+ if not hasattr(self, 'sequences_of_interest'):
+ self.identify_sequences_of_interest()
+
+ sequences = self.sequences_of_interest[['Region ID','PTM', f'{sequence_type} Flanking Sequence']].drop_duplicates().copy()
+ sequences = sequences.dropna(subset = 'Inclusion Flanking Sequence')
+ sequences['Label'] = sequences['Region ID'] + ';' + sequences['PTM']
+ sequences[f'{sequence_type} Flanking Sequence'] = sequences[f'{sequence_type} Flanking Sequence'].apply(lambda x: x.upper().replace(' ', '_')+'_')
+ return sequences
+
+ def process_kinase_library_output(self, scores, sequence_type = 'Inclusion'):
+ """
+ Process output from Kinase Library to connect kinase library scores back to the PTMs in the altered flanks dataframe
+
+ Parameters
+ ----------
+ altered_flanks: pd.DataFrame
+ Dataframe with PTMs associated with altered flanking sequences
+ scores: pd.DataFrame
+ Dataframe with kinase library scores for flanking sequences (loaded from downloaded .tsv outputs from kinase library)
+ flanking_sequence_col: str
+ Column in altered_flanks dataframe that contains the flanking sequence to match with the kinase library scores. Default is 'Inclusion Flanking Sequence'. Can also be 'Exclusion Flanking Sequence'
+
+ Returns
+ -------
+ percentiles_y: pd.DataFrame
+ Dataframe with kinase library scores for tyrosine sites
+ percentiles_st: pd.DataFrame
+ Dataframe with kinase library scores for serine/threonine sites
+
+ """
+ #restrict to events with changed flanking sequences, no introduced stop codons, and phosphorylation modifications
+ if not hasattr(self, 'sequences_of_interest'):
+ self.identify_sequences_of_interest()
+
+ sequences = self.format_sequences_to_match_output(sequence_type = sequence_type)
+
+
+ sequences = sequences.merge(scores, left_on = f'{sequence_type} Flanking Sequence', right_on = 'sequence', how = 'left')
+ #split info into tyrosine vs. serine/threonine
+ sequences_y = sequences[sequences['Label'].str.contains('_Y')]
+ sequences_st = sequences[(sequences['Label'].str.contains('_S')) | (sequences['Label'].str.contains('_T'))]
+
+ #pivot table to get scores for each kinase
+ percentiles_y = sequences_y.pivot_table(index = 'Label', columns = 'kinase', values = 'site_percentile')
+ percentiles_st = sequences_st.pivot_table(index = 'Label', columns = 'kinase', values = 'site_percentile')
+
+ return percentiles_y, percentiles_st
+
+ def get_kinase_library_differences(self, inclusion_scores_file, exclusion_scores_file):
+ """
+ Given altered flanking sequences and kinase library scores for inclusion and Exclusion Flanking Sequences, calculate the difference in kinase library site percentiles between the two
+
+ Parameters
+ ----------
+ altered_flanks: pd.DataFrame
+ Dataframe with PTMs associated with altered flanking sequences
+ inclusion_scores: pd.DataFrame
+ Dataframe with kinase library scores for Inclusion Flanking Sequences (loaded from downloaded .tsv outputs from kinase library)
+ exclusion_scores: pd.DataFrame
+ Dataframe with kinase library scores for Exclusion Flanking Sequences (loaded from downloaded .tsv outputs from kinase library)
+
+ Returns
+ -------
+ percentiles_diff_y: pd.DataFrame
+ Dataframe with the difference in kinase library scores for tyrosine sites
+ percentiles_diff_st: pd.DataFrame
+ Dataframe with the difference in kinase library scores for serine/threonine sites
+ """
+ inclusion_scores = pd.read_csv(inclusion_scores_file, sep = '\t')
+ inclusion_percentiles_y, inclusion_percentiles_st = self.process_kinase_library_output(inclusion_scores, sequence_type = 'Inclusion')
+ exclusion_scores = pd.read_csv(exclusion_scores_file, sep = '\t')
+ exclusion_percentiles_y, exclusion_percentiles_st = self.process_kinase_library_output(exclusion_scores, sequence_type = 'Exclusion')
+
+ #calculate the difference in percentiles
+ labels= list(set(inclusion_percentiles_y.index).intersection(exclusion_percentiles_y.index))
+ percentiles_diff_y = inclusion_percentiles_y.loc[labels].copy()
+ percentiles_diff_y = percentiles_diff_y[exclusion_percentiles_y.columns]
+ for i, row in percentiles_diff_y.iterrows():
+ percentiles_diff_y.loc[i] = row - exclusion_percentiles_y.loc[i]
+
+ labels= list(set(inclusion_percentiles_st.index).intersection(exclusion_percentiles_st.index))
+ percentiles_diff_st = inclusion_percentiles_st.loc[labels].copy()
+ percentiles_diff_st = percentiles_diff_st[exclusion_percentiles_st.columns]
+ for i, row in percentiles_diff_st.iterrows():
+ percentiles_diff_st.loc[i] = row - exclusion_percentiles_st.loc[i]
+
+ #save all data
+ self.inclusion_percentiles = {}
+ self.inclusion_percentiles['Y'] = inclusion_percentiles_y
+ self.inclusion_percentiles['ST'] = inclusion_percentiles_st
+
+ self.exclusion_percentiles = {}
+ self.exclusion_percentiles['Y'] = exclusion_percentiles_y
+ self.exclusion_percentiles['ST'] = exclusion_percentiles_st
+
+ self.percentile_difference = {}
+ self.percentile_difference['Y'] = percentiles_diff_y
+ self.percentile_difference['ST'] = percentiles_diff_st
+
+
+#def process_data_for_exon_ontology(odir, spliced_ptms = None, altered_flanks = None):
+# pass
+
+
+
+
+
+class kstar_enrichment:
+ def __init__(self, significant_ptms, network_dir, background_ptms = None, phospho_type = 'Y'):
+ """
+ Given spliced ptm or PTMs with altered flanks and a single kstar network, get enrichment for each kinase in the network using a hypergeometric. Assumes the data has already been reduced to the modification of interest (phosphotyrosine or phoshoserine/threonine)
+
+ Parameters
+ ----------
+ network_dir : dict
+ dictionary of networks with kinase-substrate information
+ spliced_ptms : pandas dataframe
+ all PTMs of interest
+ background_ptms: pd.DataFrame
+ PTMs to consider as the background for enrichment purposes, which should overlap with the spliced ptms information provided (an example might be all identified events, whether or not they are significant). If not provided, will use all ptms in the phosphoproteome.
+ phospho_type : str
+ type of phosphorylation event to extract. Can either by phosphotyrosine ('Y') or phosphoserine/threonine ('ST'). Default is 'Y'.
+
+ """
+ #process ptms to only include specific phosphorylation data needed
+ self.significant_ptms = self.process_ptms(significant_ptms, phospho_type = phospho_type)
+ if background_ptms is not None:
+ self.background_ptms = self.process_ptms(background_ptms, phospho_type=phospho_type)
+ else:
+ background_ptms = pose_config.ptm_coordinates.copy()
+ self.background_ptms = self.process_ptms(background_ptms, phospho_type = phospho_type)
+
+ #check if file exists and whether a pickle has been generated: if not, load each network file individually
+ if not os.path.exists(network_dir):
+ raise ValueError('Network directory not found')
+ elif os.path.exists(f"{network_dir}/*.p"):
+ networks = pickle.load(open(f"{network_dir}/network_{phospho_type}.p", "rb" ) )
+ else:
+ network_directory = network_dir + f'/{phospho_type}/INDIVIDUAL_NETWORKS/'
+ networks = {}
+ for file in os.listdir(network_directory):
+ if file.endswith('.tsv'):
+ #get the value of the network number
+ file_noext = file.strip(".tsv").split('_')
+ key_name = 'nkin'+str(file_noext[1])
+ #print("Debug: key name is %s"%(key_name))
+ networks[key_name] = pd.read_csv(f"{network_directory}{file}", sep='\t')
+
+ #save info
+ self.networks = networks
+ self.phospho_type = phospho_type
+ self.median_enrichment = None
+
+ def process_ptms(self, ptms, phospho_type = 'Y'):
+ """
+ Given ptm information, restrict data to include only the phosphorylation type of interest and add a PTM column for matching information from KSTAR
+
+ Parameters
+ ----------
+ ptms: pd.DataFrame
+ ptm information containing modification type and ptm locatin information, such as the output from projection or altered flanking sequence analysis
+ phospho_type: str
+ type of phosphorylation event to extract. Can either by phosphotyrosine ('Y') or phosphoserine/threonine ('ST')
+
+ Returns
+ ptms: pd.DataFrame
+ trimmed dataframe containing only modifications of interest and new 'PTM' column
+ """
+
+ #restrict to ptms to phosphorylation type of interest
+ if phospho_type == 'Y':
+ ptms = ptms[ptms['Modification'].str.contains('Phosphotyrosine')].copy()
+ elif phospho_type == 'ST':
+ ptms = ptms[(ptms["Modification"].str.contains('Phosphoserine')) | (ptms['Modification'].str.contains('Phosphothreonine'))].copy()
+
+ #construct PTM column that matches KSTAR information
+ ptms['PTM'] = ptms['UniProtKB Accession'] + '_' + ptms['Residue'] + ptms['PTM Position in Canonical Isoform'].astype(int).astype(str)
+
+ #filter out any PTMs that come from alternative isoforms
+ ptms = ptms[~ptms['UniProtKB Accession'].str.contains('-')]
+ return ptms
+
+
+ def get_enrichment_single_network(self, network_key):
+ """
+ in progress
+ """
+ network = self.networks[network_key]
+ network['PTM'] = network['KSTAR_ACCESSION'] + '_' + network['KSTAR_SITE']
+
+ #add network information to all significant data
+ sig_ptms = self.significant_ptms[['PTM']].drop_duplicates()
+ sig_ptms_kstar = sig_ptms.merge(network[['KSTAR_KINASE','PTM']], on = 'PTM')
+
+ #repeat for background data
+ bg_ptms = self.background_ptms[['PTM']].drop_duplicates()
+ bg_ptms_kstar = bg_ptms.merge(network[['KSTAR_KINASE','PTM']], on = 'PTM')
+
+ results = pd.DataFrame(np.nan, index = sig_ptms_kstar['KSTAR_KINASE'].unique(), columns = ['k','n','M','N','p'])
+ for kinase in sig_ptms_kstar['KSTAR_KINASE'].unique():
+ #get numbers for a hypergeometric test to look for enrichment of kinase substrates
+ k = sig_ptms_kstar.loc[sig_ptms_kstar['KSTAR_KINASE'] == kinase, 'PTM'].nunique()
+ n = bg_ptms_kstar.loc[bg_ptms_kstar['KSTAR_KINASE'] == kinase, 'PTM'].nunique()
+ M = bg_ptms['PTM'].nunique()
+ N = sig_ptms_kstar['PTM'].nunique()
+
+ #run hypergeometric test
+ results.loc[kinase,'p'] = stat_utils.hypergeom(M,n,N,k)
+ results.loc[kinase, 'M'] = M
+ results.loc[kinase, 'N'] = N
+ results.loc[kinase, 'k'] = k
+ results.loc[kinase, 'n'] = n
+
+ return results
+
+ def get_enrichment_all_networks(self):
+ """
+ Given prostate data and a dictionary of kstar networks, get enrichment for each kinase in each network in the prostate data. Assumes the prostate data has already been reduced to the modification of interest (phosphotyrosine or phoshoserine/threonine)
+
+ Parameters
+ ----------
+ networks : dict
+ dictionary of kstar networks
+ prostate : pandas dataframe
+ all PTMs identified in tCGA prostate data, regardless of significance (reduced to only include mods of interest)
+ sig_prostate : pandas dataframe
+ significant PTMs identified in tCGA prostate data, p < 0.05 and effect size > 0.25 (reduced to only include mods of interest)
+ """
+ results = {}
+ for network in self.networks:
+ results[network] = self.get_enrichment_single_network(network_key=network)
+ return results
+
+ def extract_enrichment(self, results):
+ """
+ Given a dictionary of results from get_enrichment_all_networks, extract the p-values for each network and kinase, and then calculate the median p-value across all networks for each kinase
+
+ Parameters
+ ----------
+ results : dict
+ dictionary of results from get_enrichment_all_networks
+ """
+ enrichment = pd.DataFrame(index = results['nkin0'].index, columns = results.keys())
+ for network in results:
+ enrichment[network] = results[network]['p']
+ enrichment['median'] = enrichment.median(axis = 1)
+ return enrichment
+
+ def run_kstar_enrichment(self):
+ """
+ Run full kstar analysis to generate substrate enrichment across each of the 50 KSTAR networks and calculate the median p-value for each kinase across all networks
+ """
+ results = self.get_enrichment_all_networks()
+ enrichment = self.extract_enrichment(results)
+ self.enrichment_all = enrichment
+ self.median_enrichment = enrichment['median']
+
+ def return_enriched_kinases(self, alpha = 0.05):
+ """
+ Return kinases with a median p-value less than the provided alpha value (substrates are enriched among the significant PTMs)
+
+ Parameters
+ ----------
+ alpha : float
+ significance threshold to use to subset kinases. Default is 0.05.
+ """
+ if self.median_enrichment is None:
+ self.run_kstar_enrichment()
+ return self.median_enrichment[self.median_enrichment < alpha].index.values
+
+
+
+
+import pandas as pd
+import numpy as np
+import re
+import os
+
+from ptm_pose import pose_config, helpers
+
+
+#dictionaries for converting modification codes to modification names in PhosphoSitePlus data
+mod_shorthand_dict = {'p': 'Phosphorylation', 'ca':'Caspase Cleavage', 'hy':'Hydroxylation', 'sn':'S-Nitrosylation', 'ng':'Glycosylation', 'ub': 'Ubiquitination', 'pa': "Palmitoylation",'ne':'Neddylation','sc':'Succinylation', 'sm': 'Sumoylation', 'ga': 'Glycosylation', 'gl': 'Glycosylation', 'ac': 'Acetylation', 'me':'Methylation', 'm1':'Methylation', 'm2': 'Dimethylation', 'm3':'Trimethylation'}
+residue_dict = {'P': 'proline', 'Y':'tyrosine', 'S':'serine', 'T':'threonine', 'H':'histidine', 'D':'aspartic acid', 'I':'isoleucine', 'K':'lysine', 'R':'arginine', 'G':'glycine', 'N':'asparagine', 'M':'methionine'}
+annotation_col_dict = {'PhosphoSitePlus':{'Function':'PSP:ON_FUNCTION', 'Process':'PSP:ON_PROCESS', 'Interactions':'PSP:ON_PROT_INTERACT', 'Disease':'PSP:Disease_Association', 'Kinase':'PSP:Kinase','Perturbation':'PTMsigDB:PSP-PERT'},
+ 'ELM':{'Interactions':'ELM:Interactions', 'Motif Match':'ELM:Motif Matches'},
+ 'PTMcode':{'Intraprotein':'PTMcode:Intraprotein_Interactions', 'Interactions':'PTMcode:Interprotein_Interactions'},
+ 'PTMInt':{'Interactions':'PTMInt:Interactions'},
+ 'RegPhos':{'Kinase':'RegPhos:Kinase'},
+ 'DEPOD':{'Phosphatase':'DEPOD:Phosphatase'},
+ 'PTMsigDB': {'WikiPathway':'PTMsigDB:PATH-WP', 'NetPath':'PTMsigDB:PATH-NP','mSigDB':'PTMsigDB:PATH-BI', 'Pertubation (DIA2)':'PTMsigDB:PERT-P100-DIA2', 'Perturbation (DIA)': 'PTMsigDB:PERT-P100-DIA', 'Perturbation (PRM)':'PTMsigDB:PERT-P100-PRM', 'Kinase':'PTMsigDB:Kinase-iKiP'}}
+
+
+
+[docs]def add_custom_annotation(spliced_ptms, annotation_data, source_name, annotation_type, annotation_col, accession_col = 'UniProtKB Accession', residue_col = 'Residue', position_col = 'PTM Position in Canonical Isoform'):
+ """
+ Add custom annotation data to spliced_ptms or altered flanking sequence dataframes
+
+ Parameters
+ ----------
+ annotation_data: pandas.DataFrame
+ Dataframe containing the annotation data to be added to the spliced_ptms dataframe. Must contain columns for UniProtKB Accession, Residue, PTM Position in Canonical Isoform, and the annotation data to be added
+ source_name: str
+ Name of the source of the annotation data, will be used to label the columns in the spliced_ptms dataframe
+ annotation_type: str
+ Type of annotation data being added, will be used to label the columns in the spliced_ptms dataframe
+ annotation_col: str
+ Column name in the annotation data that contains the annotation data to be added to the spliced_ptms dataframe
+
+
+ Returns
+ -------
+ spliced_ptms: pandas.DataFrame
+ Contains the PTMs identified across the different splice events with an additional column for the custom annotation data
+ """
+ #check if annotation data contains the annotation col
+ if isinstance(annotation_col, str):
+ if annotation_col not in annotation_data.columns:
+ raise ValueError(f'Could not find column indicated to contain {annotation_col} in annotation data. Please either change the name of your annotation data column with this information or indicate the correct column name with the annotation_col parameter')
+ else:
+ #make annotation col name based on source and annotation type
+ annotation_col_name = source_name + ':' + annotation_type
+ annotation_data = annotation_data.rename(columns = {annotation_col: annotation_col_name})
+ else:
+ raise ValueError('annotation_col must be a string indicating column with annotation data to be added to the spliced_ptms dataframe')
+
+ #check to make sure annotation data has the necessary columns
+ if not all([x in annotation_data.columns for x in [accession_col, residue_col, position_col]]):
+ raise ValueError(f'Could not find columns containing ptm information: {accession_col}, {residue_col}, and {position_col}. Please either change the name of your annotation data columns containing this information or indicate the correct column names with the accession_col, residue_col, and position_col parameters')
+
+ #if splice data already has the annotation columns, remove them
+ if annotation_col_name in spliced_ptms.columns:
+ spliced_ptms = spliced_ptms.drop(columns = [annotation_col_name])
+
+ #add to splice data
+ original_data_size = spliced_ptms.shape[0]
+ spliced_ptms = spliced_ptms.merge(annotation_data, how = 'left', left_on = ['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform'], right_on = [accession_col, residue_col, position_col])
+ if spliced_ptms.shape[0] != original_data_size:
+ raise RuntimeError('Dataframe size has changed, check for duplicates in spliced ptms or annotation dataframe')
+
+ #report the number of PTMs identified
+ num_ptms_with_custom_data = spliced_ptms.dropna(subset = annotation_col).groupby(['UniProtKB Accession', 'Residue']).size().shape[0]
+ print(f"{source_name} {annotation_type} data added: {num_ptms_with_custom_data} PTMs in dataset found with {source_name} {annotation_type} information")
+
+ return spliced_ptms
+
+[docs]def add_PSP_regulatory_site_data(spliced_ptms, file = 'Regulatory_sites.gz', report_success = True):
+ """
+ Add functional information from PhosphoSitePlus (Regulatory_sites.gz) to spliced_ptms dataframe from project_ptms_onto_splice_events() function
+
+ Parameters
+ ----------
+ file: str
+ Path to the PhosphoSitePlus Regulatory_sites.gz file. Should be downloaded from PhosphoSitePlus in the zipped format
+
+ Returns
+ -------
+ spliced_ptms: pandas.DataFrame
+ Contains the PTMs identified across the different splice events with additional columns for regulatory site information, including domains, biological process, functions, and protein interactions associated with the PTMs
+ """
+ #check to make sure file exists
+ check_file(file, expected_extension='.gz')
+
+ #read in the kinase substrate data and add to spliced ptm info
+ regulatory_site_data = pd.read_csv(file, sep = '\t', header = 2, on_bad_lines='skip',compression = 'gzip')
+ regulatory_site_data = regulatory_site_data.rename(columns = {'ACC_ID':'UniProtKB Accession'})
+ #drop extra modification information that is not needed
+ regulatory_site_data['Residue'] = regulatory_site_data['MOD_RSD'].apply(lambda x: x.split('-')[0][0])
+ regulatory_site_data['PTM Position in Canonical Isoform'] = regulatory_site_data['MOD_RSD'].apply(lambda x: int(x.split('-')[0][1:]))
+ #add modification type
+ regulatory_site_data['Modification Class'] = regulatory_site_data['MOD_RSD'].apply(lambda x: mod_shorthand_dict[x.split('-')[1]])
+
+ #restrict to human data
+ regulatory_site_data = regulatory_site_data[regulatory_site_data['ORGANISM'] == 'human']
+ regulatory_site_data = regulatory_site_data[['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class', 'ON_PROCESS', 'ON_PROT_INTERACT', 'ON_OTHER_INTERACT', 'ON_FUNCTION']].drop_duplicates()
+
+ #group like modifications into a single column
+ regulatory_site_data = regulatory_site_data.groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']).agg(lambda x: '; '.join([y for y in x if y == y])).reset_index()
+ regulatory_site_data = regulatory_site_data.replace('', np.nan)
+
+ #add 'PSP:' in front of each column
+ regulatory_site_data.columns = ['PSP:' + x if x not in ['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class'] else x for x in regulatory_site_data.columns]
+
+ #if splice data already has the annotation columns, remove them
+ if 'PSP:ON_FUNCTION' in spliced_ptms.columns:
+ spliced_ptms = spliced_ptms.drop(columns = ['PSP:ON_FUNCTION', 'PSP:ON_PROCESS', 'PSP:ON_PROT_INTERACT', 'PSP:ON_OTHER_INTERACT'])
+
+ #explode dataframe on modifications
+ if spliced_ptms['Modification Class'].str.contains(';').any():
+ spliced_ptms['Modification Class'] = spliced_ptms['Modification Class'].str.split(';')
+ spliced_ptms = spliced_ptms.explode('Modification Class').reset_index(drop = True)
+
+ #merge with spliced_ptm info
+ original_data_size = spliced_ptms.shape[0]
+ spliced_ptms = spliced_ptms.merge(regulatory_site_data, how = 'left', on = ['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class'])
+ if spliced_ptms.shape[0] != original_data_size:
+ raise RuntimeError('Dataset size changed upon merge, please make sure there are no duplicates in spliced ptms data')
+
+
+ #report the number of ptms with motif data
+ if report_success:
+ num_ptms_with_known_function = spliced_ptms.dropna(subset = 'PSP:ON_FUNCTION').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']).size().shape[0]
+ num_ptms_with_known_process = spliced_ptms.dropna(subset = 'PSP:ON_PROCESS').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']).size().shape[0]
+ num_ptms_with_known_interaction = spliced_ptms.dropna(subset = 'PSP:ON_PROT_INTERACT').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']).size().shape[0]
+ print(f"PhosphoSitePlus regulatory_site information added:\n\t ->{num_ptms_with_known_function} PTMs in dataset found associated with a molecular function \n\t ->{num_ptms_with_known_process} PTMs in dataset found associated with a biological process\n\t ->{num_ptms_with_known_interaction} PTMs in dataset found associated with a protein interaction")
+ return spliced_ptms
+
+[docs]def add_PSP_kinase_substrate_data(spliced_ptms, file = 'Kinase_Substrate_Dataset.gz', report_success = True):
+ """
+ Add kinase substrate data from PhosphoSitePlus (Kinase_Substrate_Dataset.gz) to spliced_ptms dataframe from project_ptms_onto_splice_events() function
+
+ Parameters
+ ----------
+ file: str
+ Path to the PhosphoSitePlus Kinase_Substrate_Dataset.gz file. Should be downloaded from PhosphoSitePlus in the zipped format
+
+ Returns
+ -------
+ spliced_ptms: pandas.DataFrame
+ Contains the PTMs identified across the different splice events with an additional column indicating the kinases known to phosphorylate that site (not relevant to non-phosphorylation PTMs)
+
+ """
+ #check to make sure provided file exists
+ check_file(file, expected_extension='.gz')
+
+ #load data
+ ks_dataset = pd.read_csv(file, sep = '\t', header = 2, on_bad_lines='skip',compression = 'gzip', encoding = "cp1252")
+ #restrict to human data
+ ks_dataset = ks_dataset[ks_dataset['KIN_ORGANISM'] == 'human']
+ ks_dataset = ks_dataset[ks_dataset['SUB_ORGANISM'] == 'human']
+
+ ks_dataset = ks_dataset[['GENE', 'SUB_ACC_ID', 'SUB_MOD_RSD']].groupby(['SUB_ACC_ID', 'SUB_MOD_RSD']).agg(';'.join).reset_index()
+ ks_dataset.columns = ['UniProtKB Accession', 'Residue', 'PSP:Kinase']
+
+ #separate residue and position
+ ks_dataset['PTM Position in Canonical Isoform'] = ks_dataset['Residue'].apply(lambda x: int(x[1:]))
+ ks_dataset['Residue'] = ks_dataset['Residue'].apply(lambda x: x[0])
+
+
+ #if splice data already has the annotation columns, remove them
+ if 'PSP:Kinase' in spliced_ptms.columns:
+ spliced_ptms = spliced_ptms.drop(columns = ['PSP:Kinase'])
+
+ original_data_size = spliced_ptms.shape[0]
+ spliced_ptms = spliced_ptms.merge(ks_dataset, how = 'left', on = ['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform'])
+ if spliced_ptms.shape[0] != original_data_size:
+ raise RuntimeError('Dataset size changed upon merge, please make sure there are no duplicates in spliced ptms data')
+
+
+ #report the number of ptms with kinase substrate information
+ if report_success:
+ num_ptms_with_KS = spliced_ptms.dropna(subset = 'PSP:Kinase').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']).size().shape[0]
+ print(f"PhosphoSitePlus kinase-substrate interactions added: {num_ptms_with_KS} phosphorylation sites in dataset found associated with a kinase in PhosphoSitePlus")
+ return spliced_ptms
+
+[docs]def add_PSP_disease_association(spliced_ptms, file = 'Disease-associated_sites.gz', report_success = True):
+ """
+ Process disease asociation data from PhosphoSitePlus (Disease-associated_sites.gz), and add to spliced_ptms dataframe from project_ptms_onto_splice_events() function
+
+ Parameters
+ ----------
+ file: str
+ Path to the PhosphoSitePlus Kinase_Substrate_Dataset.gz file. Should be downloaded from PhosphoSitePlus in the zipped format
+
+ Returns
+ -------
+ spliced_ptms: pandas.DataFrame
+ Contains the PTMs identified across the different splice events with an additional column indicating the kinases known to phosphorylate that site (not relevant to non-phosphorylation PTMs)
+
+ """
+ #check to make sure provided file exists
+ check_file(file, expected_extension='.gz')
+
+ #load data
+ disease_associated_sites = pd.read_csv(file, sep = '\t', header = 2, on_bad_lines='skip',compression = 'gzip')
+ disease_associated_sites = disease_associated_sites[disease_associated_sites['ORGANISM'] == 'human']
+
+ #removes sites without a specific disease annotation
+ disease_associated_sites = disease_associated_sites.dropna(subset = ['DISEASE'])
+
+ #drop extra modification information that is not needed
+ #drop extra modification information that is not needed
+ disease_associated_sites['Residue'] = disease_associated_sites['MOD_RSD'].apply(lambda x: x.split('-')[0][0])
+ disease_associated_sites['PTM Position in Canonical Isoform'] = disease_associated_sites['MOD_RSD'].apply(lambda x: int(x.split('-')[0][1:]))
+ #add modification type
+ disease_associated_sites['Modification Class'] = disease_associated_sites['MOD_RSD'].apply(lambda x: mod_shorthand_dict[x.split('-')[1]])
+ #if phosphorylation, add specific residue
+ disease_associated_sites['Modification Class'] = disease_associated_sites.apply(lambda x: x['Modification Class'] + residue_dict[x['Residue'][0]] if x['Modification Class'] == 'Phospho' else x['Modification Class'], axis = 1)
+ #change O-GalNac occurring on N to N-glycosylation
+ disease_associated_sites['Modification Class'] = disease_associated_sites.apply(lambda x: 'N-Glycosylation' if x['Modification Class'] == 'O-Glycosylation' and x['Residue'][0] == 'N' else x['Modification Class'], axis = 1)
+
+
+ #combine disease and alteration
+ disease_associated_sites['ALTERATION'] = disease_associated_sites.apply(lambda x: x['DISEASE']+'->'+x['ALTERATION'] if x['ALTERATION'] == x['ALTERATION'] else x['DISEASE'], axis = 1)
+ #grab only necessary columns and rename
+ disease_associated_sites = disease_associated_sites[['ACC_ID', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class', 'ALTERATION']]
+ disease_associated_sites.columns = ['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class', 'PSP:Disease_Association']
+
+ #aggregate multiple disease associations
+ disease_associated_sites = disease_associated_sites.groupby(['UniProtKB Accession', 'Residue','PTM Position in Canonical Isoform', 'Modification Class']).agg(';'.join).reset_index()
+
+ #if splice data already has the annotation columns, remove them
+ if 'PSP:Disease_Association' in spliced_ptms.columns:
+ spliced_ptms = spliced_ptms.drop(columns = ['PSP:Disease_Association'])
+
+ #explode dataframe on modifications
+ if spliced_ptms['Modification Class'].str.contains(';').any():
+ spliced_ptms['Modification Class'] = spliced_ptms['Modification Class'].str.split(';')
+ spliced_ptms = spliced_ptms.explode('Modification Class').reset_index(drop = True)
+
+
+ #merge with spliced_ptm info
+ original_data_size = spliced_ptms.shape[0]
+ spliced_ptms = spliced_ptms.merge(disease_associated_sites, how = 'left', on = ['UniProtKB Accession', 'Residue','PTM Position in Canonical Isoform', 'Modification Class'])
+ if spliced_ptms.shape[0] != original_data_size:
+ raise RuntimeError('Dataset size changed upon merge, please make sure there are no duplicates in spliced ptms data')
+
+ #
+ #report the number of ptms with motif data
+ if report_success:
+ num_ptms_with_disease = spliced_ptms.dropna(subset = 'PSP:Disease_Association').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']).size().shape[0]
+ print(f"PhosphoSitePlus disease associations added: {num_ptms_with_disease} PTM sites in dataset found associated with a disease in PhosphoSitePlus")
+
+
+ return spliced_ptms
+
+
+[docs]def add_ELM_interactions(spliced_ptms, file = None, report_success =True):
+ """
+ Given a spliced ptms dataframe from the project module, add ELM interaction data to the dataframe
+ """
+ #load data
+ if file is None:
+ elm_interactions = pd.read_csv('http://elm.eu.org/interactions/as_tsv', sep = '\t', header = 0)
+ else:
+ check_file(file, expected_extension='.tsv')
+ elm_interactions = pd.read_csv(file, sep = '\t', header = 0)
+
+ elm_interactions = elm_interactions[(elm_interactions['taxonomyElm'] == '9606(Homo sapiens)') & (elm_interactions['taxonomyDomain'] == '9606(Homo sapiens)')]
+
+ elm_list = []
+ elm_type = []
+ elm_interactor = []
+ for i, row in spliced_ptms.iterrows():
+ #grab ptm location from residue column (gives residue and position (S981), so need to remove residue and convert to int)
+ ptm_loc = int(row['PTM Position in Canonical Isoform']) if row['PTM Position in Canonical Isoform'] == row['PTM Position in Canonical Isoform'] and row['PTM Position in Canonical Isoform'] != 'none' else None
+
+ #if data does not have position information, move to the next
+ if ptm_loc is None:
+ elm_list.append(np.nan)
+ elm_type.append(np.nan)
+ elm_interactor.append(np.nan)
+ continue
+
+ #find if any of the linear motifs match ptm loc
+ protein_match = row['UniProtKB Accession'] == elm_interactions['interactorElm']
+ region_match = (ptm_loc >= elm_interactions['StartElm']) & (ptm_loc <=elm_interactions['StopElm'])
+ elm_subset_motif = elm_interactions[protein_match & region_match]
+ #if any interactions were found, record and continue to the next (assumes a single ptm won't be found as both a SLiM and domain)
+ if elm_subset_motif.shape[0] > 0:
+ elm_list.append(';'.join(elm_subset_motif['Elm'].values))
+ elm_type.append('SLiM')
+ elm_interactor.append(';'.join(elm_subset_motif['interactorDomain'].values))
+ continue
+
+
+ #domain
+ protein_match = row['UniProtKB Accession'] == elm_interactions['interactorDomain']
+ region_match = (ptm_loc >= elm_interactions['StartDomain']) & (ptm_loc <=elm_interactions['StopDomain'])
+ elm_subset_domain = elm_interactions[protein_match & region_match]
+ #if any interactions were found, record and continue to the next (assumes a single ptm won't be found as both a SLiM and domain)
+ if elm_subset_domain.shape[0] > 0:
+ elm_list.append(';'.join(elm_subset_domain['Elm'].values))
+ elm_type.append('Domain')
+ elm_interactor.append(';'.join(elm_subset_domain['interactorElm'].values))
+ continue
+
+ #if no interactions wer found, record as np.nan
+ elm_list.append(np.nan)
+ elm_type.append(np.nan)
+ elm_interactor.append(np.nan)
+
+ spliced_ptms['ELM:Interactions'] = elm_interactor
+ spliced_ptms['ELM:Location of PTM for Interaction'] = elm_type
+ spliced_ptms['ELM:Motifs Associated with Interactions'] = elm_list
+
+ #report the number of ptms with motif data
+ if report_success:
+ num_ptms_with_ELM_instance = spliced_ptms.dropna(subset = 'ELM:Interactions').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform']).size().shape[0]
+ print(f"ELM interaction instances added: {num_ptms_with_ELM_instance} PTMs in dataset found associated with at least one known ELM instance")
+ return spliced_ptms
+
+
+def add_ELM_matched_motifs(spliced_ptms, flank_size = 7, file = None, report_success = True):
+ if file is None:
+ elm_classes = pd.read_csv('http://elm.eu.org/elms/elms_index.tsv', sep = '\t', header = 5)
+ else:
+ check_file(file, expected_extension='.tsv')
+ elm_classes = pd.read_csv(file, sep = '\t', header = 5)
+
+ ptm_coordinates = pose_config.ptm_coordinates.copy()
+ #create corresponding label for ptm_coordinate data
+ ptm_coordinates['PTM Label'] = ptm_coordinates['UniProtKB Accession'] + '_' + ptm_coordinates['Residue'] + ptm_coordinates['PTM Position in Canonical Isoform'].apply(lambda x: int(float(x)) if x == x else np.nan).astype(str)
+
+ match_list = []
+ for i, row in spliced_ptms.iterrows():
+ matches = []
+ #grab ptm information
+ #grab flanking sequence for the ptm
+ loc = int(row["PTM Position in Canonical Isoform"]) if row['PTM Position in Canonical Isoform'] == row['PTM Position in Canonical Isoform'] else np.nan
+ ptm = row['UniProtKB Accession'] + '_' + row['Residue'] + str(loc)
+
+
+ if ptm in ptm_coordinates['PTM Label'].values:
+ ptm_flanking_seq = ptm_coordinates.loc[ptm_coordinates['PTM Label'] == ptm, 'Expected Flanking Sequence'].values[0]
+ #make sure flanking sequence is present
+ if isinstance(ptm_flanking_seq, str):
+
+ #default flanking sequence is 10, if requested flanking sequence is different, then adjust
+ if flank_size > 10:
+ raise ValueError('Flanking size must be equal to or less than 10')
+ elif flank_size < 10:
+ ptm_flanking_seq = ptm_flanking_seq[10-flank_size:10+flank_size]
+
+ for j, elm_row in elm_classes.iterrows():
+ reg_ex = elm_row['Regex']
+ if re.search(reg_ex, ptm_flanking_seq) is not None:
+ matches.append(elm_row['ELMIdentifier'])
+
+ match_list.append(';'.join(matches))
+ else:
+ match_list.append(np.nan)
+ else:
+ #print(f'PTM {ptm} not found in PTM info file')
+ match_list.append(np.nan)
+
+ spliced_ptms['ELM:Motif Matches'] = match_list
+
+ #report the number of ptms with motif data
+ if report_success:
+ num_ptms_with_matched_motif = spliced_ptms.dropna(subset = 'ELM:Motif Matches').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform']).size().shape[0]
+ print(f"ELM Class motif matches found: {num_ptms_with_matched_motif} PTMs in dataset found with at least one matched motif")
+ return spliced_ptms
+
+[docs]def add_PTMInt_data(spliced_ptms, file = None, report_success = True):
+ """
+ Given spliced_ptms data from project module, add PTMInt interaction data, which will include the protein that is being interacted with, whether it enchances or inhibits binding, and the localization of the interaction. This will be added as a new column labeled PTMInt:Interactions and each entry will be formatted like 'Protein->Effect|Localization'. If multiple interactions, they will be separated by a semicolon
+ """
+ #load file
+ if file is None:
+ PTMint = pd.read_csv('https://ptmint.sjtu.edu.cn/data/PTM%20experimental%20evidence.csv')
+ else:
+ check_file(file, expected_extension='.csv')
+ PTMint = pd.read_csv(file)
+
+ PTMint = PTMint.rename(columns={'Uniprot':'UniProtKB Accession', 'AA':'Residue', 'Site':'PTM Position in Canonical Isoform'})
+ #PTMint['Site'] = PTMint['AA'] + PTMint['Site'].astype(str)
+ PTMint['PTMInt:Interaction'] = PTMint['Int_gene']+'->'+PTMint['Effect']
+ PTMint = PTMint[['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'PTMInt:Interaction']]
+ #PTMint['PTM Position in Canonical Isoform'] = PTMint['PTM Position in Canonical Isoform'].astype(str)
+
+ #aggregate PTMint data on the same PTMs
+ PTMint = PTMint.groupby(['UniProtKB Accession','Residue','PTM Position in Canonical Isoform'], as_index = False).agg(';'.join)
+
+ #if splice data already has the annotation columns, remove them
+ if 'PTMInt:Interaction' in spliced_ptms.columns:
+ spliced_ptms = spliced_ptms.drop(columns = ['PTMInt:Interaction'])
+
+ #add to splice data
+ original_data_size = spliced_ptms.shape[0]
+ spliced_ptms = spliced_ptms.merge(PTMint[['UniProtKB Accession','Residue','PTM Position in Canonical Isoform', 'PTMInt:Interaction']], on = ['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform'], how = 'left')
+ if spliced_ptms.shape[0] != original_data_size:
+ raise RuntimeError('Dataframe size has changed, check for duplicates in spliced ptms dataframe')
+
+ #report the number of PTMs identified
+ if report_success:
+ num_ptms_with_PTMInt_data = spliced_ptms.dropna(subset = 'PTMInt:Interaction').groupby(['UniProtKB Accession', 'Residue']).size().shape[0]
+ print(f"PTMInt data added: {num_ptms_with_PTMInt_data} PTMs in dataset found with PTMInt interaction information")
+
+ return spliced_ptms
+ #delete source PTMint data
+ #os.remove(pdir + './Data/PTM_experimental_evidence.csv')
+
+#def add_PTMcode_intraprotein(spliced_ptms, fname = None, report_success = True):
+# #load ptmcode info
+# if fname is None:
+# ptmcode = pd.read_csv('https://ptmcode.embl.de/data/PTMcode2_associations_within_proteins.txt.gz', sep = '\t', header = 2, compression='gzip')
+# else:
+# check_file(fname, expected_extension = '.gz')
+# ptmcode = pd.read_csv(fname, sep = '\t', header = 2, compression = 'gzip')
+#
+# #grab humn data
+# ptmcode = ptmcode[ptmcode['Species'] == 'Homo sapiens']
+#
+# #add gene name to data
+# translator = pd.DataFrame(pose_config.uniprot_to_genename, index = ['Gene']).T
+# translator['Gene'] = translator['Gene'].apply(lambda x: x.split(' '))
+# translator = translator.explode('Gene')
+# translator = translator.reset_index()
+# translator.columns = ['UniProtKB/Swiss-Prot ID', 'Gene name']
+#
+# #add uniprot ID information
+# ptmcode = ptmcode.merge(translator.dropna().drop_duplicates(), left_on = '## Protein', right_on = 'Gene name', how = 'left')
+#
+# #convert modification names to match annotation data
+# convert_dict = {'Adp ribosylation': 'ADP Ribosylation', 'Glutamine deamidation':'Deamidation'}
+# new_mod_names = []
+# failed_mod = []
+# mod_list = ptmcode['PTM1'].unique()
+# for mod in mod_list:
+# mod = mod.capitalize()
+# if 'glycosylation' in mod: #if glycosylation, group into one gorup
+# new_mod_names.append('Glycosylation')
+# elif mod in pose_config.modification_conversion['Modification Class'].values: #if already in modification class data, keep
+# new_mod_names.append(mod)
+# elif mod in convert_dict.keys():
+# new_mod_names.append(convert_dict[mod])
+# else:
+# try:
+# new_mod = pose_config.modification_conversion[pose_config.modification_conversion['Modification'] == mod].values[0][0]
+# new_mod_names.append(new_mod)
+# except:
+# failed_mod.append(mod)
+# new_mod_names.append(mod)
+# conversion_df = pd.DataFrame({'PTM1':mod_list, 'Modification Class':new_mod_names})
+#
+# #add new modification labels to data
+# ptmcode = ptmcode.merge(conversion_df, on = 'PTM1', how = 'left')
+#
+# #groupby by PTM1 and rename to match column names in annotation data
+# ptmcode = ptmcode[['UniProtKB/Swiss-Prot ID', 'Modification Class', 'Residue1', 'Residue2']].dropna(subset = 'UniProtKB/Swiss-Prot ID')
+# ptmcode = ptmcode.groupby(['UniProtKB/Swiss-Prot ID', 'Modification Class', 'Residue1'])['Residue2'].agg(';'.join).reset_index()
+# ptmcode = ptmcode.rename(columns = {'UniProtKB/Swiss-Prot ID':'UniProtKB Accession', 'Residue1':'Residue', 'Residue2':'PTMcode:Intraprotein_Interactions'})
+#
+# #separate residue information into separate columns, one for amino acid and one for position
+# ptmcode['PTM Position in Canonical Isoform'] = ptmcode['Residue'].apply(lambda x: int(x[1:]))
+# ptmcode['Residue'] = ptmcode['Residue'].apply(lambda x: x[0])
+#
+# #if splice data already has the annotation columns, remove them
+# if 'PTMcode:Intraprotein_Interactions' in spliced_ptms.columns:
+# spliced_ptms = spliced_ptms.drop(columns = ['PTMcode:Intraprotein_Interactions'])
+#
+# #explode dataframe on modifications
+# if spliced_ptms['Modification Class'].str.contains(';').any():
+# spliced_ptms['Modification Class'] = spliced_ptms['Modification Class'].str.split(';')
+# spliced_ptms = spliced_ptms.explode('Modification Class').reset_index(drop = True)
+#
+# #add to splice data
+# original_data_size = spliced_ptms.shape[0]
+# spliced_ptms = spliced_ptms.merge(ptmcode, how = 'left', on = ['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class'])
+# if spliced_ptms.shape[0] != original_data_size:
+# raise RuntimeError('Dataframe size has changed, check for duplicates in spliced ptms dataframe')
+#
+# #report the number of PTMs identified
+# if report_success:
+# num_ptms_with_PTMcode_data = spliced_ptms.dropna(subset = 'PTMcode:Intraprotein_Interactions').groupby(['UniProtKB Accession', 'Residue']).size().shape[0]
+# print(f"PTMcode intraprotein interactions added: {num_ptms_with_PTMcode_data} PTMs in dataset found with PTMcode intraprotein interaction information")
+#
+# return spliced_ptms
+
+def extract_ids_PTMcode(df, col = '## Protein1'):
+
+ #add gene name to data
+ name_to_uniprot = pd.DataFrame(pose_config.uniprot_to_genename, index = ['Gene']).T
+ name_to_uniprot['Gene'] = name_to_uniprot['Gene'].apply(lambda x: x.split(' ') if x == x else np.nan)
+ name_to_uniprot = name_to_uniprot.explode('Gene')
+ name_to_uniprot = name_to_uniprot.reset_index()
+ name_to_uniprot.columns = ['UniProtKB/Swiss-Prot ID', 'Gene name']
+ name_to_uniprot = name_to_uniprot.drop_duplicates(subset = 'Gene name', keep = False)
+
+ #protein name is provided as either ensemble gene id or gene name check for both
+ df = df.merge(pose_config.translator[['Gene stable ID']].reset_index().dropna().drop_duplicates(), left_on = col, right_on = 'Gene stable ID', how = 'left')
+ df = df.rename(columns = {'index': 'From_ID'})
+ df = df.merge(name_to_uniprot, left_on = col, right_on = 'Gene name', how = 'left')
+ df = df.rename(columns = {'UniProtKB/Swiss-Prot ID': 'From_Name'})
+
+ #grab unique id from 'From_ID' and 'From_Name' column, if available
+ uniprot_ids = df['From_Name'].combine_first(df['From_ID'])
+ return uniprot_ids.values
+
+def add_PTMcode_interprotein(spliced_ptms, fname = None, report_success = True):
+ if fname is None:
+ ptmcode = pd.read_csv('https://ptmcode.embl.de/data/PTMcode2_associations_between_proteins.txt.gz', sep = '\t', header = 2, compression = 'gzip')
+ else:
+ check_file(fname, expected_extension = '.gz')
+ ptmcode = pd.read_csv(fname, sep = '\t', header = 2, compression='gzip')
+
+ #grab human interactions
+ ptmcode = ptmcode[ptmcode['Species'] == 'Homo sapiens']
+ #ignore intraprotein interactions
+ ptmcode = ptmcode[ptmcode['## Protein1'] != ptmcode['Protein2']]
+
+ #get uniprot id for primary protein and interacting protein
+ ptmcode['UniProtKB Accession'] = extract_ids_PTMcode(ptmcode, '## Protein1')
+ ptmcode['Interacting Protein'] = extract_ids_PTMcode(ptmcode, 'Protein2')
+
+ ptmcode = ptmcode.dropna(subset = ['UniProtKB Accession', 'Interacting Protein'])
+ #remove duplicate proteins (some entries have different ids but are actually the same protein)
+ ptmcode = ptmcode[ptmcode['UniProtKB Accession'] != ptmcode['Interacting Protein']]
+
+ #aggregate interactions
+ ptmcode['Interacting Residue'] = ptmcode['Interacting Protein'] + '_' + ptmcode['Residue2']
+
+
+ #convert modification names
+ convert_dict = {'Adp ribosylation': 'ADP Ribosylation', 'Glutamine deamidation':'Deamidation'}
+ new_mod_names = []
+ failed_mod = []
+ mod_list = ptmcode['PTM1'].unique()
+ for mod in mod_list:
+ mod = mod.capitalize()
+ if 'glycosylation' in mod:
+ new_mod_names.append('Glycosylation')
+ elif mod in pose_config.modification_conversion['Modification Class'].values:
+ new_mod_names.append(mod)
+ elif mod in convert_dict.keys():
+ new_mod_names.append(convert_dict[mod])
+ else:
+ try:
+ new_mod = pose_config.modification_conversion[pose_config.modification_conversion['Modification'] == mod].values[0][0]
+ new_mod_names.append(new_mod)
+ except:
+ failed_mod.append(mod)
+ new_mod_names.append(mod)
+ conversion_df = pd.DataFrame({'PTM1':mod_list, 'Modification Class':new_mod_names})
+
+ ptmcode = ptmcode.merge(conversion_df, on = 'PTM1', how = 'left')
+
+
+ ptmcode = ptmcode.rename(columns = {'Residue1':'Residue'})
+ ptmcode = ptmcode.groupby(['UniProtKB Accession', 'Residue', 'Modification Class'])['Interacting Residue'].agg(';'.join).reset_index()
+ ptmcode = ptmcode.rename(columns = {'UniProtKB/Swiss-Prot ID':'UniProtKB Accession', 'Residue1':'Residue', 'Interacting Residue':'PTMcode:Interprotein_Interactions'})
+
+ #separate residue information into separate columns, one for amino acid and one for position
+ ptmcode['PTM Position in Canonical Isoform'] = ptmcode['Residue'].apply(lambda x: float(x[1:]))
+ ptmcode['Residue'] = ptmcode['Residue'].apply(lambda x: x[0])
+
+ #if splice data already has the annotation columns, remove them
+ if 'PTMcode:Interprotein_Interactions' in spliced_ptms.columns:
+ spliced_ptms = spliced_ptms.drop(columns = ['PTMcode:Interprotein_Interactions'])
+
+ #explode dataframe on modifications
+ if spliced_ptms['Modification Class'].str.contains(';').any():
+ spliced_ptms['Modification Class'] = spliced_ptms['Modification Class'].str.split(';')
+ spliced_ptms = spliced_ptms.explode('Modification Class').reset_index(drop = True)
+
+ #add to splice data
+ original_data_size = spliced_ptms.shape[0]
+ spliced_ptms = spliced_ptms.merge(ptmcode, how = 'left', on = ['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class'])
+ if spliced_ptms.shape[0] != original_data_size:
+ raise RuntimeError('Dataframe size has changed, check for duplicates in spliced ptms dataframe')
+
+ #report the number of PTMs identified
+ if report_success:
+ num_ptms_with_PTMcode_data = spliced_ptms.dropna(subset = 'PTMcode:Interprotein_Interactions').groupby(['UniProtKB Accession', 'Residue']).size().shape[0]
+ print(f"PTMcode interprotein interactions added: {num_ptms_with_PTMcode_data} PTMs in dataset found with PTMcode interprotein interaction information")
+
+ return spliced_ptms
+
+[docs]def extract_positions_from_DEPOD(x):
+ """
+ Given string object consisting of multiple modifications in the form of 'Residue-Position' separated by ', ', extract the residue and position. Ignore any excess details in the string.
+ """
+ x = x.split('[')[0].split(', ')
+ #for each residue in list, find location of 'Ser', 'Thr' and 'Tyr' in the string (should either have '-' or a number immediately after it)
+ new_x = []
+ for item in x:
+ #determine type of modification
+ if 'Ser' in item:
+ loc = [match.start() for match in re.finditer('Ser', item)]
+ res = 'S'
+ elif 'Thr' in item:
+ loc = [match.start() for match in re.finditer('Thr', item)]
+ res = 'T'
+ elif 'Tyr' in item:
+ loc = [match.start() for match in re.finditer('Tyr', item)]
+ res = 'Y'
+ elif 'His' in item:
+ loc = [match.start() for match in re.finditer('His', item)]
+ res = 'H'
+ else:
+ loc = -1
+
+ #check if multiple locations were found, if so grab last entry
+ if loc == -1:
+ item = np.nan
+ make_string = False
+ elif len(loc) > 1:
+ make_string = True
+ loc = loc[-1]
+ else:
+ loc = loc[0]
+ make_string = True
+
+ #find integer
+ if make_string:
+ if '-' in item[loc:]:
+ item = item.split('-')
+ item = res + item[1].strip()
+ else:
+ item = item[loc+3:]
+ item = res + item
+
+ new_x.append(item)
+
+ return new_x
+
+def add_DEPOD_phosphatase_data(spliced_ptms, report_success = True):
+
+ #download data
+ depod1 = pd.read_excel('https://depod.bioss.uni-freiburg.de/download/PPase_protSubtrates_201903.xls', sheet_name='PSprots')
+ depod2 = pd.read_excel('https://depod.bioss.uni-freiburg.de/download/PPase_protSubtrates_newPairs_201903.xls', sheet_name = 'newPSprots')
+ depod = pd.concat([depod1, depod2])
+
+ #remove any rows with missing sit information
+ depod = depod.dropna(subset = 'Dephosphosites')
+
+ #remove excess annotations that make parsing difficult
+ depod['Dephosphosites'] = depod['Dephosphosites'].apply(lambda x: x.split('[')[0])
+ depod['Dephosphosites'] = depod['Dephosphosites'].apply(lambda x: x.split('(')[0])
+ depod['Dephosphosites'] = depod['Dephosphosites'].apply(lambda x: x.split(';')[0])
+ depod['Dephosphosites'] = depod['Dephosphosites'].apply(lambda x: x.split('in')[0])
+ depod['Dephosphosites'] = depod['Dephosphosites'].str.replace('in ref.', '')
+
+ #separate individual sites
+ depod['Dephosphosites'] = depod['Dephosphosites'].str.split(',')
+ depod = depod.explode('Dephosphosites')
+ depod = depod[(~depod['Dephosphosites'].str.contains('Isoform')) & (~depod['Dephosphosites'].str.contains('isoform'))]
+
+ #process dephosphosite strings to extract residue and position and explode so that each phosphosite is its own row
+ depod['Dephosphosites'] = depod['Dephosphosites'].apply(extract_positions_from_DEPOD)
+ depod = depod.explode('Dephosphosites')
+
+ #separate multiple substrate accessions into their own rows (many of these link back to the same ID, but will keep just in case)
+ depod['Substrate accession numbers'] = depod['Substrate accession numbers'].str.split(' ')
+ depod = depod.explode('Substrate accession numbers')
+ depod = depod.dropna(subset = ['Substrate accession numbers'])
+
+ #extract only needed information and add phosphorylation as modification type
+ #extract only needed information and add phosphorylation as modification type
+ depod['Residue'] = depod['Dephosphosites'].apply(lambda x: x[0] if x == x else np.nan)
+ depod['PTM Position in Canonical Isoform'] = depod['Dephosphosites'].apply(lambda x: int(x[1:]) if x == x else np.nan)
+ depod = depod.rename({'Substrate accession numbers': 'UniProtKB Accession', 'Phosphatase entry names':'DEPOD:Phosphatase'}, axis = 1)
+ depod = depod[['DEPOD:Phosphatase', 'UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform']]
+ depod['Modification Class'] = 'Phosphorylation'
+
+ #combine on the same PTM
+ depod = depod.drop_duplicates()
+ depod = depod.groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class'], as_index = False)['DEPOD:Phosphatase'].agg(';'.join)
+
+ #if splice data already has the annotation columns, remove them
+ if 'DEPOD:Phosphatase' in spliced_ptms.columns:
+ spliced_ptms = spliced_ptms.drop(columns = ['DEPOD:Phosphatase'])
+
+ #explode dataframe on modifications
+ if spliced_ptms['Modification Class'].str.contains(';').any():
+ spliced_ptms['Modification Class'] = spliced_ptms['Modification Class'].str.split(';')
+ spliced_ptms = spliced_ptms.explode('Modification Class').reset_index(drop = True)
+
+ #add to splice data
+ original_data_size = spliced_ptms.shape[0]
+ spliced_ptms = spliced_ptms.merge(depod, how = 'left', on = ['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class'])
+ if spliced_ptms.shape[0] != original_data_size:
+ raise RuntimeError('Dataframe size has changed, check for duplicates in spliced ptms dataframe')
+
+ #report the number of PTMs identified
+ if report_success:
+ num_ptms_with_PTMcode_data = spliced_ptms.dropna(subset = 'DEPOD:Phosphatase').groupby(['UniProtKB Accession', 'Residue']).size().shape[0]
+ print(f"DEPOD Phosphatase substrates added: {num_ptms_with_PTMcode_data} PTMs in dataset found with Phosphatase substrate information")
+
+ return spliced_ptms
+
+def add_RegPhos_data(spliced_ptms, file = None, report_success = True):
+ if file is None:
+ regphos = pd.read_csv('http://140.138.144.141/~RegPhos/download/RegPhos_Phos_human.txt', sep = '\t', dtype = {'position':int, 'description':str,'catalytic kinase':str, 'reference':'str'})
+ else:
+ check_file(file, expected_extension = '.txt')
+ regphos = pd.read_csv(file, sep = '\t')
+
+ regphos = regphos.dropna(subset = 'catalytic kinase')
+ #regphos['Residue'] = regphos['code'] + regphos['position'].astype(str)
+ regphos = regphos.rename(columns = {'code': 'Residue', 'position':'PTM Position in Canonical Isoform', 'AC': 'UniProtKB Accession', 'catalytic kinase': 'RegPhos:Kinase'})
+ regphos['Modification Class'] = 'Phosphorylation'
+ regphos = regphos[['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class', 'RegPhos:Kinase']].dropna()
+ regphos = regphos.groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']).agg(';'.join).reset_index()
+
+ #if splice data already has the annotation columns, remove them
+ if 'RegPhos:Kinase' in spliced_ptms.columns:
+ spliced_ptms = spliced_ptms.drop(columns = ['RegPhos:Kinase'])
+
+ #explode dataframe on modifications
+ if spliced_ptms['Modification Class'].str.contains(';').any():
+ spliced_ptms['Modification Class'] = spliced_ptms['Modification Class'].str.split(';')
+ spliced_ptms = spliced_ptms.explode('Modification Class').reset_index(drop = True)
+
+ #add to splice data
+ original_data_size = spliced_ptms.shape[0]
+ spliced_ptms = spliced_ptms.merge(regphos, how = 'left', on = ['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class'])
+ if spliced_ptms.shape[0] != original_data_size:
+ raise RuntimeError('Dataframe size has changed, check for duplicates in spliced ptms dataframe')
+
+ #report the number of PTMs identified
+ if report_success:
+ num_ptms_with_regphos_data = spliced_ptms.dropna(subset = 'RegPhos:Kinase').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform']).size().shape[0]
+ print(f"RegPhos kinase-substrate data added: {num_ptms_with_regphos_data} PTMs in dataset found with kinase-substrate information")
+
+ return spliced_ptms
+
+
+def add_PTMsigDB_data(spliced_ptms, file = None, report_success = True):
+ #if file is None:
+ # ptmsigdb = pd.read_excel('https://proteomics.broadapps.org/ptmsigdb/_w_8b062d9e/appff37efd164a676afcc8e6e42e6058e01/session/a2b28c4ed29deadd6779fdd26aec33c1/download/download.xlsx?w=8b062d9e', sheet_name = 'human')
+ #else:
+ check_file(file, expected_extension = '.xlsx')
+ ptmsigdb = pd.read_excel(file, sheet_name = 'human')
+
+
+ ptmsigdb['UniProtKB Accession'] = ptmsigdb['site.uniprot'].str.split(';').str[0]
+ ptmsigdb['Residue'] = ptmsigdb['site.uniprot'].str.split(';').str[1].str[0]
+ ptmsigdb['PTM Position in Canonical Isoform'] = ptmsigdb['site.uniprot'].apply(lambda x: int(x.split(';')[1].split('-')[0][1:]))
+
+ #filter out excess information in some of the site.ptm column, then convert to modification class details
+ ptmsigdb['site.ptm'] = ptmsigdb['site.ptm'].apply(lambda x: x.split(';')[1].split('-')[1] if ';' in x else x)
+ ptmsigdb['Modification Class'] = ptmsigdb['site.ptm'].map(mod_shorthand_dict)
+
+ #combine signature and direction for annotation column
+ ptmsigdb['Signature'] = ptmsigdb['signature'] +'->'+ ptmsigdb['site.direction']
+
+ #drop unneeded columns
+ ptmsigdb = ptmsigdb[['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class', 'Signature', 'category']]
+ ptmsigdb['Signature'] = ptmsigdb.apply(lambda x: x['Signature'].replace(x['category'] + '_', ''), axis = 1)
+ ptmsigdb['category'] = 'PTMsigDB:' + ptmsigdb['category']
+ ptmsigdb = ptmsigdb.drop_duplicates()
+
+ #convert to pivot table with each category being a separate column
+ ptmsigdb = ptmsigdb.pivot_table(index = ['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class'], columns = 'category', values = 'Signature', aggfunc=';'.join).reset_index()
+
+ #remove psp data if it is already in spliced ptms
+ if 'PSP:Kinase' in spliced_ptms.columns:
+ ptmsigdb = ptmsigdb.drop(columns = 'PTMsigDB:KINASE-PSP')
+
+ if 'PSP:Disease_Association' in spliced_ptms.columns:
+ ptmsigdb = ptmsigdb.drop(columns = 'PTMsigDB:DISEASE-PSP')
+
+
+ #if splice data already has the annotation columns, remove them
+ if 'PTMsigDB:PATH-BI' in spliced_ptms.columns:
+ cols_in_data = [col for col in spliced_ptms.columns if 'PTMsigDB' in col]
+ spliced_ptms = spliced_ptms.drop(columns = cols_in_data)
+
+
+ #explode dataframe on modifications
+ if spliced_ptms['Modification Class'].str.contains(';').any():
+ spliced_ptms['Modification Class'] = spliced_ptms['Modification Class'].str.split(';')
+ spliced_ptms = spliced_ptms.explode('Modification Class').reset_index(drop = True)
+
+ #merge with spliced_ptm info
+ original_data_size = spliced_ptms.shape[0]
+ spliced_ptms = spliced_ptms.merge(ptmsigdb, how = 'left', on = ['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class'])
+ if spliced_ptms.shape[0] != original_data_size:
+ raise RuntimeError('Dataset size changed upon merge, please make sure there are no duplicates in spliced ptms data')
+
+
+ #report the number of ptms with motif data
+ if report_success:
+ num_ptms_with_ikip = spliced_ptms.dropna(subset = 'PTMsigDB:KINASE-iKiP').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']).size().shape[0]
+ num_ptms_with_path_bi = spliced_ptms.dropna(subset = 'PTMsigDB:PATH-BI').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']).size().shape[0]
+ num_ptms_with_path_np= spliced_ptms.dropna(subset = 'PTMsigDB:PATH-NP').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']).size().shape[0]
+ num_ptms_with_path_wp = spliced_ptms.dropna(subset = 'PTMsigDB:PATH-WP').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']).size().shape[0]
+ num_ptms_with_dia_pert = spliced_ptms.dropna(subset = 'PTMsigDB:PERT-P100-DIA').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']).size().shape[0]
+ num_ptms_with_dia2_pert = spliced_ptms.dropna(subset = 'PTMsigDB:PERT-P100-DIA2').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']).size().shape[0]
+ num_ptms_with_prm_pert = spliced_ptms.dropna(subset = 'PTMsigDB:PERT-P100-PRM').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']).size().shape[0]
+ num_ptms_with_psp_pert = spliced_ptms.dropna(subset = 'PTMsigDB:PERT-PSP').groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']).size().shape[0]
+ print(f"PTMsigDB added:\n\t ->{num_ptms_with_ikip} PTMs associated with kinases in iKiP\n\t ->{num_ptms_with_path_wp} PTMs associated with molecular pathway signatures from WikiPathways\n\t ->{num_ptms_with_path_np} PTMs associated with molecular pathway signatures from NetPath\n\t ->{num_ptms_with_psp_pert} PTMs with PhosphoSitePlus perturbations\n\t ->{num_ptms_with_dia_pert} with perturbations in LINCS P1000 DIA dataset \n\t ->{num_ptms_with_dia2_pert} with perturbations in LINCS P1000 DIA2 dataset\n\t ->{num_ptms_with_prm_pert} with perturbations in LINCS P1000 PRM dataset")
+ return spliced_ptms
+
+
+
+######### Functions for combining annotations from multiple sources ########
+
+[docs]def convert_PSP_label_to_UniProt(label):
+ """
+ Given a label for an interacting protein from PhosphoSitePlus, convert to UniProtKB accession. Required as PhosphoSitePlus interactions are recorded in various ways that aren't necessarily consistent with other databases (i.e. not always gene name)
+
+ Parameters
+ ----------
+ label: str
+ Label for interacting protein from PhosphoSitePlus
+ """
+ if not hasattr(pose_config, 'genename_to_uniprot'):
+ #using uniprot to gene name dict, construct dict to go the other direction (gene name to uniprot id)
+ pose_config.genename_to_uniprot = pose_config.flip_uniprot_dict(pose_config.uniprot_to_genename)
+
+
+ #remove isoform label if present
+ if label in pose_config.genename_to_uniprot: #if PSP name is gene name found in uniprot
+ return pose_config.genename_to_uniprot[label]
+ elif label.upper() in pose_config.genename_to_uniprot:
+ return pose_config.genename_to_uniprot[label.upper()]
+ elif label.split(' ')[0].upper() in pose_config.genename_to_uniprot:
+ return pose_config.genename_to_uniprot[label.split(' ')[0].upper()]
+ elif label.replace('-', '').upper() in pose_config.genename_to_uniprot:
+ return pose_config.genename_to_uniprot[label.replace('-', '').upper()]
+ elif label in pose_config.psp_name_dict: # if PSP name is not gene name, but is in conversion dictionary
+ return pose_config.psp_name_dict[label]
+ else: #otherwise note that gene was missed
+ return np.nan
+ #missed_genes.append(gene)
+
+def extract_interaction_details(interaction, column = "PSP:ON_PROT_INTERACT"):
+
+ interaction_types = {'PTMcode:Interprotein_Interactions':'INDUCES', 'PSP:Kinase':'REGULATES', 'DEPOD:Phosphatase':'REGULATES', 'RegPhos:Kinase':'REGULATES', 'Combined:Kinase':'REGULATES', 'ELM:Interactions':'UNCLEAR'}
+ if column == 'PSP:ON_PROT_INTERACT':
+ type = interaction.split('(')[1].split(')')[0]
+ protein = interaction.split('(')[0].strip(' ')
+ elif column == 'PTMInt:Interaction':
+ ptmint_type_conversion = {'Inhibit':'DISRUPTS', 'Enhance':"INDUCES"}
+ type = ptmint_type_conversion[interaction.split('->')[1]]
+ protein = interaction.split('->')[0]
+ elif column == 'PTMcode:Interprotein_Interactions':
+ type = 'INDUCES'
+ protein = interaction.split('_')[0]
+ else:
+ type = interaction_types[column]
+ protein = interaction
+
+ return type, protein
+
+[docs]def unify_interaction_data(spliced_ptms, interaction_col, name_dict = {}):
+ """
+ Given spliced ptm data and a column containing interaction data, extract the interacting protein, type of interaction, and convert to UniProtKB accession. This will be added as a new column labeled 'Interacting ID'
+
+ Parameters
+ ----------
+ spliced_ptms: pd.DataFrame
+ Dataframe containing PTM data
+ interaction_col: str
+ column containing interaction information from a specific database
+ name_dict: dict
+ dictionary to convert names within given database to UniProt IDs. For cases when name is not necessarily one of the gene names listed in UniProt
+
+ Returns
+ -------
+ interact: pd.DataFrame
+ Contains PTMs and their interacting proteins, the type of influence the PTM has on the interaction (DISRUPTS, INDUCES, or REGULATES)
+ """
+ if not hasattr(pose_config, 'genename_to_uniprot'):
+ #using uniprot to gene name dict, construct dict to go the other direction (gene name to uniprot id)
+ pose_config.genename_to_uniprot = pose_config.flip_uniprot_dict(pose_config.uniprot_to_genename)
+
+ #extract PSP data from annotated PTMs, separate cases in which single PTM has multipe interactions
+ data_cols = [col for col in spliced_ptms.columns if col in ['Significance', 'dPSI']]
+ interact = spliced_ptms.dropna(subset = interaction_col)[['Gene', 'UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class',interaction_col] + data_cols]
+ if interact.empty:
+ print(f"No PTMs associated with {interaction_col}")
+ return interact
+
+ interact[interaction_col] = interact[interaction_col].apply(lambda x: x.split(';'))
+ interact = interact.explode(interaction_col)
+
+ #extract protein and type of interaction (currently for phosphosite plus)
+ type = []
+ protein = []
+ for i, row in interact.iterrows():
+ processed = extract_interaction_details(row[interaction_col], interaction_col)
+ type.append(processed[0])
+ protein.append(processed[1])
+ interact['Type'] = type
+ interact['Interacting Protein'] = protein
+
+
+ #convert interacting protein to uniprot id for databases that are not reported in uniprot ids
+ if interaction_col not in ['PTMcode:Interprotein_Interactions', 'ELM:Interactions']:
+ interacting_id = []
+ missed_genes = []
+ for gene in interact['Interacting Protein']:
+ #remove isoform label if present
+ if gene in pose_config.genename_to_uniprot: #if PSP name is gene name found in uniprot
+ interacting_id.append(pose_config.genename_to_uniprot[gene])
+ elif gene.upper() in pose_config.genename_to_uniprot:
+ interacting_id.append(pose_config.genename_to_uniprot[gene.upper()])
+ elif gene.split(' ')[0].upper() in pose_config.genename_to_uniprot:
+ interacting_id.append(pose_config.genename_to_uniprot[gene.split(' ')[0].upper()])
+ elif gene.replace('-', '').upper() in pose_config.genename_to_uniprot:
+ interacting_id.append(pose_config.genename_to_uniprot[gene.replace('-', '').upper()])
+ elif gene in name_dict: # if PSP name is not gene name, but is in conversion dictionary
+ interacting_id.append(name_dict[gene])
+ else: #otherwise note that gene was missed
+ interacting_id.append(np.nan)
+ missed_genes.append(gene)
+
+ #save information
+ interact['Interacting ID'] = interacting_id
+ interact = interact.dropna(subset = 'Interacting ID')
+
+
+ #check if there multiple in one row
+ if interact['Interacting ID'].str.contains(';').any():
+ interact['Interacting ID'] = interact['Interacting ID'].apply(lambda x: x.split(';'))
+ interact = interact.explode('Interacting ID')
+ else:
+ interact['Interacting ID'] = interact['Interacting Protein']
+
+
+ interact['Interacting ID'] = interact['Interacting ID'].apply(lambda x: x.split(' ')[0] if x == x else np.nan)
+ interact = interact.explode('Interacting ID')
+ interact = interact.dropna(subset = 'Interacting ID')
+ interact = interact.drop(columns = interaction_col).drop_duplicates()
+
+ return interact
+
+[docs]def add_annotation(spliced_ptms, database = 'PhosphoSitePlus', annotation_type = 'Function', file = None, check_existing = False):
+ """
+ Given a desired database and annotation type, add the corresponding annotation data to the spliced ptm dataframe
+
+ Parameters
+ ----------
+ spliced_ptms: pd.DataFrame
+ Dataframe containing PTM data
+ database: str
+ Database to extract annotation data from. Options include 'PhosphoSitePlus', 'PTMcode', 'PTMInt', 'RegPhos', 'DEPOD'
+ annotation_type: str
+ Type of annotation to extract. Options include 'Function', 'Process', 'Interactions', 'Disease', 'Kinase', 'Phosphatase', but depend on the specific database (run analyze.get_annotation_categories())
+ file: str
+ File path to annotation data. If None, will download from online source, except for PhosphoSitePlus (due to licensing restrictions)
+ """
+ if check_existing:
+ annot_col = annotation_col_dict[database][annotation_type]
+ if annot_col in spliced_ptms.columns:
+ print(f"Annotation data for {database} {annotation_type} already present in provided dataframe, skipping. If you would like to update annotation data, set check_existing = False")
+ return spliced_ptms
+
+ if database == "PhosphoSitePlus":
+ if annotation_type in ['Function', 'Process', 'Interactions']:
+ check_file(file, expected_extension='.gz')
+ spliced_ptms = add_PSP_regulatory_site_data(spliced_ptms, file = file)
+ elif annotation_type == 'Disease':
+ check_file(file, expected_extension='.gz')
+ spliced_ptms = add_PSP_disease_association(spliced_ptms, file = file)
+ elif annotation_type == 'Kinase':
+ check_file(file, expected_extension='.gz')
+ spliced_ptms = add_PSP_kinase_substrate_data(spliced_ptms, file = file)
+ else:
+ raise ValueError(f"Annotation type {annotation_type} not recognized for PhosphoSitePlus")
+ elif database == 'PTMcode':
+ #if annotation_type == 'Intraprotein':
+ # if file is not None:
+ # check_file(file, expected_extension='.gz')
+ # spliced_ptms = add_PTMcode_intraprotein(spliced_ptms, file = file)
+ # else:
+ # spliced_ptms = add_PTMcode_intraprotein(spliced_ptms)
+ if annotation_type == 'Interactions':
+ if file is not None:
+ check_file(file, expected_extension='.gz')
+ spliced_ptms = add_PTMcode_interprotein(spliced_ptms, file = file)
+ else:
+ spliced_ptms = add_PTMcode_interprotein(spliced_ptms)
+ else:
+ raise ValueError(f"Annotation type {annotation_type} not recognized for PTMcode")
+ elif database == 'PTMInt':
+ if annotation_type == 'Interactions':
+ if file is not None:
+ check_file(file, expected_extension='.csv')
+ spliced_ptms = add_PTMInt_data(spliced_ptms, file = file)
+ else:
+ spliced_ptms = add_PTMInt_data(spliced_ptms)
+ else:
+ raise ValueError(f"Annotation type {annotation_type} not recognized for PTMInt")
+ elif database == 'RegPhos':
+ if annotation_type == 'Kinase':
+ if file is not None:
+ check_file(file, expected_extension='.txt')
+ spliced_ptms = add_RegPhos_data(spliced_ptms, file = file)
+ else:
+ spliced_ptms = add_RegPhos_data(spliced_ptms)
+ else:
+ raise ValueError(f"Annotation type {annotation_type} not recognized for RegPhos")
+ elif database == 'DEPOD':
+ if annotation_type == 'Phosphatase':
+ spliced_ptms = add_DEPOD_phosphatase_data(spliced_ptms, file = file)
+ else:
+ raise ValueError(f"Annotation type {annotation_type} not recognized for RegPhos")
+ elif database == 'Combined':
+ if annotation_type == 'Kinase':
+ if 'PSP:Kinase' not in spliced_ptms.columns:
+ raise ValueError("PhosphoSitePlus kinase data not found in spliced PTM dataframe, please annotate with this first")
+ if 'RegPhos:Kinase' not in spliced_ptms.columns:
+ spliced_ptms = add_RegPhos_data(spliced_ptms)
+ spliced_ptms = combine_KS_data(spliced_ptms)
+ elif annotation_type == 'Interactions':
+ spliced_ptms = combine_interaction_data(spliced_ptms)
+ else:
+ raise ValueError(f"Database {database} not recognized")
+
+ return spliced_ptms
+
+
+[docs]def combine_interaction_data(spliced_ptms, interaction_databases = ['PhosphoSitePlus', 'PTMcode', 'PTMInt', 'RegPhos', 'DEPOD', 'ELM'], include_enzyme_interactions = True):
+ """
+ Given annotated spliced ptm data, extract interaction data from various databases and combine into a single dataframe. This will include the interacting protein, the type of interaction, and the source of the interaction data
+
+ Parameters
+ ----------
+ spliced_ptms: pd.DataFrame
+ Dataframe containing PTM data and associated interaction annotations from various databases
+ interaction_databases: list
+ List of databases to extract interaction data from. Options include 'PhosphoSitePlus', 'PTMcode', 'PTMInt', 'RegPhos', 'DEPOD'. These should already have annotation columns in the spliced_ptms dataframe, otherwise they will be ignored. For kinase-substrate interactions, if combined column is present, will use that instead of individual databases
+ include_enzyme_interactions: bool
+ If True, will include kinase-substrate and phosphatase interactions in the output dataframe
+
+ Returns
+ -------
+ interact_data: list
+ List of dataframes containing PTMs and their interacting proteins, the type of influence the PTM has on the interaction (DISRUPTS, INDUCES, or REGULATES), and the source of the interaction data
+
+ """
+ interact_data = []
+ combined_added = False
+ for database in interaction_databases:
+ if database == 'PhosphoSitePlus' and 'PSP:ON_PROT_INTERACT' in spliced_ptms.columns:
+ if not spliced_ptms['PSP:ON_PROT_INTERACT'].isna().all():
+ print('PhosphoSitePlus regulatory site data found and added')
+ interact = unify_interaction_data(spliced_ptms, 'PSP:ON_PROT_INTERACT', pose_config.psp_name_dict)
+ interact['Source'] = database
+ interact_data.append(interact)
+
+
+ if database == 'PTMcode' and 'PTMcode:Interprotein_Interactions' in spliced_ptms.columns:
+ if not spliced_ptms['PTMcode:Interprotein_Interactions'].isna().all():
+ print('PTMcode data found and added')
+ interact = unify_interaction_data(spliced_ptms, 'PTMcode:Interprotein_Interactions')
+ interact['Source'] = database
+ interact_data.append(interact)
+ if database == 'PTMInt' and 'PTMInt:Interaction' in spliced_ptms.columns:
+ if not spliced_ptms['PTMInt:Interaction'].isna().all():
+ print('PTMInt data found and added')
+ interact = unify_interaction_data(spliced_ptms, 'PTMInt:Interaction')
+ interact['Source'] = database
+ interact_data.append(interact)
+ if database == 'ELM' and 'ELM:Interactions' in spliced_ptms.columns:
+ if not spliced_ptms['ELM:Interactions'].isna().all():
+ print('ELM data found and added')
+ interact = unify_interaction_data(spliced_ptms, 'ELM:Interactions')
+ interact['Source'] = database
+ interact_data.append(interact)
+
+ if include_enzyme_interactions:
+ #dictionary to convert kinase names to gene names
+ ks_genes_to_uniprot = {'ABL1(ABL)':'P00519', 'ACK':'Q07912', 'AURC':'Q9UQB9', 'ERK1(MAPK3)':'P27361','ERK2(MAPK1)':'P28482', 'ERK5(MAPK7)':'Q13164','JNK1(MAPK8)':'P45983', 'CK1A':'P48729', 'JNK2(MAPK9)':'P45984', 'JNK3(MAPK10)':'P53779', 'P38A(MAPK14)':'Q16539','P38B(MAPK11)':'Q15759', 'P38G(MAPK12)':'P53778','P70S6K' :'Q9UBS0', 'PAK':'Q13153', 'PKCZ':'Q05513', 'CK2A':'P19784', 'ABL2':'P42684', 'AMPKA1':'Q13131', 'AMPKA2':'Q13131', 'AURB':'Q96GD4', 'CAMK1A':'Q14012', 'CDC42BP':'Q9Y5S2','CK1D':'P48730','CK1E':'P49674','CK2B':'P67870','DMPK1':'Q09013', 'DNAPK':'P78527','DSDNA KINASE':'P78527', 'EG3 KINASE':'P49840','ERK3(MAPK6)':'Q16659','GSK3':'P49840', 'MRCKA':'Q5VT25', 'P38D(MAPK13)':'O15264','P70S6KB':'Q9UBS0','PDKC':'P78527','PKCH':'P24723','PKCI':'P41743','PKCT':'Q04759','PKD3':'O94806','PKG1':'Q13976','PKG2':'Q13237','SMMLCK':'Q15746'}
+ if 'Combined:Kinase' in spliced_ptms.columns and not combined_added:
+ if not spliced_ptms['Combined:Kinase'].isna().all():
+ print('Combined kinase-substrate data found and added')
+ interact = unify_interaction_data(spliced_ptms, 'Combined:Kinase', ks_genes_to_uniprot)
+ interact['Source'] = 'PSP/RegPhos'
+ interact_data.append(interact)
+ combined_added = True
+ elif 'Combined:Kinase' not in spliced_ptms.columns:
+ if 'RegPhos:Kinase' in spliced_ptms.columns and database == 'RegPhos':
+ if not spliced_ptms['RegPhos:Kinase'].isna().all():
+ print('RegPhos kinase-substrate data found and added')
+ interact = unify_interaction_data(spliced_ptms, 'RegPhos:Kinase', ks_genes_to_uniprot)
+ interact['Source'] = database
+ interact_data.append(interact)
+ if 'PSP:Kinase' in spliced_ptms.columns and database == 'PhosphoSitePlus':
+ if not spliced_ptms['PSP:Kinase'].isna().all():
+ print('PhosphoSitePlus kinase-substrate data found and added')
+ interact = unify_interaction_data(spliced_ptms, 'PSP:Kinase', ks_genes_to_uniprot)
+ interact['Source'] = database
+ interact_data.append(interact)
+
+ if database == 'DEPOD' and 'DEPOD:Phosphatase' in spliced_ptms.columns:
+ if not spliced_ptms['DEPOD:Phosphatase'].isna().all():
+ print('DEPOD phosphatase-substrate data found and added')
+ interact = unify_interaction_data(spliced_ptms, 'DEPOD:Phosphatase')
+ interact['Source'] = database
+ interact_data.append(interact)
+
+ if len(interact_data) > 0:
+ interact_data = pd.concat(interact_data)
+ extra_cols = [col for col in interact_data.columns if col in ['dPSI', 'Significance']]
+ interact_data = interact_data.groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Interacting ID', 'Type']+extra_cols, dropna = False, as_index = False)['Source'].apply(helpers.join_unique_entries)
+
+ #convert uniprot ids back to gene names for interpretability
+ ptm_gene = []
+ interacting_gene = []
+ for i, row in interact_data.iterrows():
+ ptm_gene.append(pose_config.uniprot_to_genename[row['UniProtKB Accession'].split('-')[0]].split(' ')[0]) if row['UniProtKB Accession'].split('-')[0] in pose_config.uniprot_to_genename else ptm_gene.append(row['UniProtKB Accession'])
+ interacting_gene.append(pose_config.uniprot_to_genename[row['Interacting ID'].split('-')[0]].split(' ')[0]) if row['Interacting ID'].split('-')[0] in pose_config.uniprot_to_genename else interacting_gene.append(row['Interacting ID'])
+ interact_data['Modified Gene'] = ptm_gene
+ interact_data["Interacting Gene"] = interacting_gene
+
+
+ return interact_data.drop_duplicates()
+ else:
+ return pd.DataFrame()
+
+
+
+[docs]def combine_KS_data(spliced_ptms, ks_databases = ['PhosphoSitePlus', 'RegPhos'], regphos_conversion = {'ERK1(MAPK3)':'MAPK3', 'ERK2(MAPK1)':'MAPK1', 'JNK2(MAPK9)':'MAPK9','CDC2':'CDK1', 'CK2A1':'CSNK2A1', 'PKACA':'PRKACA', 'ABL1(ABL)':'ABL1'}):
+ """
+ Given spliced ptm information, combine kinase-substrate data from multiple databases (currently support PhosphoSitePlus and RegPhos), assuming that the kinase data from these resources has already been added to the spliced ptm data. The combined kinase data will be added as a new column labeled 'Combined:Kinase'
+
+ Parameters
+ ----------
+ spliced_ptms: pd.DataFrame
+ Spliced PTM data from project module
+ ks_databases: list
+ List of databases to combine kinase data from. Currently support PhosphoSitePlus and RegPhos
+ regphos_conversion: dict
+ Allows conversion of RegPhos names to matching names in PhosphoSitePlus.
+
+ Returns
+ -------
+ splicde_ptms: pd.DataFrame
+ Spliced PTM data with combined kinase data added
+
+ """
+ if not hasattr(pose_config, 'genename_to_uniprot'):
+ pose_config.genename_to_uniprot = pose_config.flip_uniprot_dict(pose_config.uniprot_to_genename)
+
+ ks_data = []
+ for i, row in spliced_ptms.iterrows():
+ combined = []
+ for db in ks_databases:
+ if db == 'PhosphoSitePlus':
+ psp = row['PSP:Kinase'].split(';') if row['PSP:Kinase'] == row['PSP:Kinase'] else []
+ #convert PSP names to a common name (first gene name provided by uniprot)
+ psp = [pose_config.uniprot_to_genename[pose_config.genename_to_uniprot[kin]].split(' ')[0] if kin in pose_config.genename_to_uniprot else kin for kin in psp]
+ combined += psp
+ elif db == 'RegPhos':
+ regphos = row['RegPhos:Kinase'].split(';') if row['RegPhos:Kinase'] == row['RegPhos:Kinase'] else []
+ for i, rp in enumerate(regphos):
+ if rp in pose_config.genename_to_uniprot:
+ regphos[i] = pose_config.uniprot_to_genename[pose_config.genename_to_uniprot[rp]].split(' ')[0]
+ elif rp.split('(')[0] in pose_config.genename_to_uniprot:
+ regphos[i] = pose_config.uniprot_to_genename[pose_config.genename_to_uniprot[rp.split('(')[0]]].split(' ')[0]
+ elif rp.upper() in regphos_conversion:
+ regphos[i] = regphos_conversion[rp.upper()]
+ else:
+ regphos[i] = rp.upper()
+ combined += regphos
+
+
+ if len(combined) > 0:
+ ks_data.append(';'.join(set(combined)))
+ else:
+ ks_data.append(np.nan)
+
+ spliced_ptms['Combined:Kinase'] = ks_data
+ return spliced_ptms
+
+
+[docs]def check_file(fname, expected_extension = '.tsv'):
+ """
+ Given a file name, check if the file exists and has the expected extension. If the file does not exist or has the wrong extension, raise an error.
+
+ Parameters
+ ----------
+ fname: str
+ File name to check
+ expected_extension: str
+ Expected file extension. Default is '.tsv'
+ """
+ if fname is None:
+ raise ValueError('Annotation file path must be provided')
+ if not os.path.exists(fname):
+ raise ValueError(f'File {fname} not found')
+
+ if not fname.endswith(expected_extension):
+ raise ValueError(f'File {fname} does not have the expected extension ({expected_extension})')
+
+
+
+
+
+[docs]def annotate_ptms(spliced_ptms, psp_regulatory_site_file = None, psp_ks_file = None, psp_disease_file = None, elm_interactions = False, elm_motifs = False, PTMint = False, PTMcode_interprotein = False, DEPOD = False, RegPhos = False, ptmsigdb_file = None, interactions_to_combine = ['PTMcode', 'PhosphoSitePlus', 'RegPhos', 'PTMInt'], kinases_to_combine = ['PhosphoSitePlus', 'RegPhos'], combine_similar = True):
+ """
+ Given spliced ptm data, add annotations from various databases. The annotations that can be added are the following:
+ - PhosphoSitePlus
+ - regulatory site data (file must be provided)
+ - kinase-substrate data (file must be provided)
+ - disease association data (file must be provided)
+ - ELM
+ - interaction data (can be downloaded automatically or provided as a file)
+ - motif matches (elm class data can be downloaded automatically or provided as a file)
+ - PTMInt
+ - interaction data (will be downloaded automatically)
+ - PTMcode
+ - intraprotein interactions (can be downloaded automatically or provided as a file)
+ - interprotein interactions (can be downloaded automatically or provided as a file)
+ - DEPOD
+ - phosphatase-substrate data (will be downloaded automatically)
+ - RegPhos
+ - kinase-substrate data (will be downloaded automatically)
+
+ Parameters
+ ----------
+ spliced_ptms: pd.DataFrame
+ Spliced PTM data from project module
+ psp_regulatory_site_file: str
+ File path to PhosphoSitePlus regulatory site data
+ psp_ks_file: str
+ File path to PhosphoSitePlus kinase-substrate data
+ psp_disease_file: str
+ File path to PhosphoSitePlus disease association data
+ elm_interactions: bool or str
+ If True, download ELM interaction data automatically. If str, provide file path to ELM interaction data
+ elm_motifs: bool or str
+ If True, download ELM motif data automatically. If str, provide file path to ELM motif data
+ PTMint: bool
+ If True, download PTMInt data automatically
+ PTMcode_intraprotein: bool or str
+ If True, download PTMcode intraprotein data automatically. If str, provide file path to PTMcode intraprotein data
+ PTMcode_interprotein: bool or str
+ If True, download PTMcode interprotein data automatically. If str, provide file path to PTMcode interprotein data
+ DEPOD: bool
+ If True, download DEPOD data automatically
+ RegPhos: bool
+ If True, download RegPhos data automatically
+ ptmsigdb_file: str
+ File path to PTMsigDB data
+ interactions_to_combine: list
+ List of databases to combine interaction data from. Default is ['PTMcode', 'PhosphoSitePlus', 'RegPhos', 'PTMInt']
+ kinases_to_combine: list
+ List of databases to combine kinase-substrate data from. Default is ['PhosphoSitePlus', 'RegPhos']
+ combine_similar: bool
+ Whether to combine annotations of similar information (kinase, interactions, etc) from multiple databases into another column labeled as 'Combined'. Default is True
+ """
+ if psp_regulatory_site_file is not None:
+ try:
+ check_file(psp_regulatory_site_file, expected_extension='.gz')
+ spliced_ptms = add_PSP_regulatory_site_data(spliced_ptms, file = psp_regulatory_site_file)
+ except Exception as e:
+ raise RuntimeError(f'Error adding PhosphoSitePlus regulatory site data. Error message: {e}')
+ if psp_ks_file is not None:
+ try:
+ check_file(psp_ks_file, expected_extension='.gz')
+ spliced_ptms = add_PSP_kinase_substrate_data(spliced_ptms, file = psp_ks_file)
+ except Exception as e:
+ raise RuntimeError(f'Error adding PhosphoSitePlus kinase-substrate data. Error message: {e}')
+ if psp_disease_file is not None:
+ try:
+ check_file(psp_disease_file, expected_extension='.gz')
+ spliced_ptms = add_PSP_disease_association(spliced_ptms, file = psp_disease_file)
+ except Exception as e:
+ raise RuntimeError(f'Error adding PhosphoSitePlus disease association data. Error message: {e}')
+ if elm_interactions:
+ try:
+ if isinstance(elm_interactions, bool):
+ spliced_ptms = add_ELM_interactions(spliced_ptms)
+ elif isinstance(elm_interactions, str):
+ check_file(elm_interactions, expected_extension='.tsv')
+ spliced_ptms = add_ELM_interactions(spliced_ptms, file = elm_interactions)
+ else:
+ raise ValueError('elm_interactions must be either a boolean (download elm data automatically, slower) or a string (path to elm data tsv file, faster)')
+ except Exception as e:
+ raise RuntimeError(f'Error adding ELM interaction data. Error message: {e}')
+ if elm_motifs:
+ try:
+ if isinstance(elm_motifs, bool):
+ spliced_ptms = add_ELM_matched_motifs(spliced_ptms)
+ elif isinstance(elm_motifs, str):
+ check_file(elm_motifs, expected_extension='.tsv')
+ spliced_ptms = add_ELM_matched_motifs(spliced_ptms, file = elm_motifs)
+ else:
+ raise ValueError('elm_interactions must be either a boolean (download elm data automatically, slower) or a string (path to elm data tsv file, faster)')
+ except Exception as e:
+ raise RuntimeError(f'Error adding ELM motif matches. Error message: {e}')
+ if PTMint:
+ try:
+ if isinstance(PTMint, bool):
+ spliced_ptms = add_PTMInt_data(spliced_ptms)
+ elif isinstance(PTMint, str):
+ check_file(PTMint, expected_extension='.csv')
+ spliced_ptms = add_PTMInt_data(spliced_ptms, file = PTMint)
+ else:
+ raise ValueError('PTMint must be either a boolean (download PTMInt data automatically, slower) or a string (path to PTMInt data csv file, faster)')
+ except Exception as e:
+ raise RuntimeError(f'Error adding PTMInt interaction data. Error message: {e}')
+ #if PTMcode_intraprotein:
+ # try:
+ # if isinstance(PTMcode_intraprotein, bool):
+ # spliced_ptms = add_PTMcode_intraprotein(spliced_ptms)
+ # elif isinstance(PTMcode_intraprotein, str):
+ # check_file(PTMcode_intraprotein, expected_extension='.gz')
+ # spliced_ptms = add_PTMcode_intraprotein(spliced_ptms, fname = PTMcode_intraprotein)
+ # else:
+ # raise ValueError('PTMcode_intraprotein must be either a boolean (download PTMcode data automatically, slower) or a string (path to PTMcode data file, faster)')
+ # except Exception as e:
+ # print(f'Error adding PTMcode intraprotein interaction data. Error message: {e}')
+ if PTMcode_interprotein:
+ try:
+ if isinstance(PTMcode_interprotein, bool):
+ spliced_ptms = add_PTMcode_interprotein(spliced_ptms)
+ elif isinstance(PTMcode_interprotein, str):
+ check_file(PTMcode_interprotein, expected_extension='.gz')
+ spliced_ptms = add_PTMcode_interprotein(spliced_ptms, fname = PTMcode_interprotein)
+ else:
+ raise ValueError('PTMcode_interprotein must be either a boolean (download PTMcode data automatically, slower) or a string (path to PTMcode data file, faster)')
+ except Exception as e:
+ raise RuntimeError(f'Error adding PTMcode interprotein interaction data. Error message: {e}')
+ if DEPOD:
+ try:
+ spliced_ptms = add_DEPOD_phosphatase_data(spliced_ptms)
+ except Exception as e:
+ raise RuntimeError(f'Error adding DEPOD phosphatase data. Error message: {e}')
+ if RegPhos:
+ try:
+ if isinstance(RegPhos, str):
+ check_file(RegPhos, expected_extension='.txt')
+ spliced_ptms = add_RegPhos_data(spliced_ptms, file = RegPhos)
+ else:
+ spliced_ptms = add_RegPhos_data(spliced_ptms)
+ except Exception as e:
+ raise RuntimeError(f'Error adding RegPhos kinase substrate data data. Error message: {e}')
+ if ptmsigdb_file is not None:
+ try:
+ spliced_ptms = add_PTMsigDB_data(spliced_ptms, file = ptmsigdb_file)
+ except Exception as e:
+ raise RuntimeError(f'Error adding PTMsigDB data. Error message: {e}')
+
+ if combine_similar:
+ interaction_cols = ['PTMcode:Interprotein_Interactions', 'PSP:ON_PROT_INTERACT', 'PSP:Kinase', 'PTMInt:Interaction', 'RegPhos:Kinase', 'DEPOD:Phosphatase']
+ if set(interaction_cols).intersection(spliced_ptms.columns) != 0:
+ print('\nCombining interaction data from multiple databases')
+ interact = combine_interaction_data(spliced_ptms, interaction_databases = interactions_to_combine)
+ if not interact.empty:
+ interact['Combined:Interactions'] = interact['Interacting Gene']+'->'+interact['Type']
+ interact = interact.groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform'], dropna = False, as_index = False)['Combined:Interactions'].apply(lambda x: ';'.join(np.unique(x)))
+ if 'Combined:Interactions' in spliced_ptms.columns:
+ spliced_ptms = spliced_ptms.drop(columns = ['Combined:Interactions'])
+
+ spliced_ptms = spliced_ptms.merge(interact, how = 'left', on = ['UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform'])
+ else:
+ spliced_ptms['Combined:Interactions'] = np.nan
+
+ #check for what kinase data is available
+ spliced_ptms = combine_KS_data(spliced_ptms, ks_databases=kinases_to_combine) #add combined kinase column
+
+
+ return spliced_ptms
+
+
+
+#biopython packages
+from Bio.Data import CodonTable
+
+#standard packages
+import numpy as np
+import pandas as pd
+import re
+
+import tqdm
+import warnings
+
+#PTM pose functions
+from ptm_pose import database_interfacing as di
+from ptm_pose import project
+from ptm_pose import pose_config
+
+
+
+# Get the standard codon table
+codon_table = CodonTable.unambiguous_dna_by_name["Standard"]
+
+
+[docs]def translate_flanking_sequence(seq, flank_size = 7, full_flanking_seq = True, lowercase_mod = True, first_flank_length = None, stop_codon_symbol = '*', unknown_codon_symbol = 'X'):
+ """
+ Given a DNA sequence, translate the sequence into an amino acid sequence. If the sequence is not the correct length, the function will attempt to extract the flanking sequence with spaces to account for missing parts if full_flanking_seq is not True. If the sequence is still not the correct length, the function will raise an error. Any unrecognized codons that are found in the sequence and are not in the standard codon table, including stop codons, will be translated as 'X' (unknown) or '*' (stop codon).
+
+ Parameters
+ ----------
+ seq : str
+ DNA sequence to translate
+ flank_size : int, optional
+ Number of amino acids to include flanking the PTM, by default 7
+ full_flanking_seq : bool, optional
+ Whether to require the flanking sequence to be the correct length, by default True
+ lowercase_mod : bool, optional
+ Whether to lowercase the amino acid associated with the PTM, by default True
+ first_flank_length : int, optional
+ Length of the flanking sequence in front of the PTM, by default None. If full_flanking_seq is False and sequence is not the correct length, this is required.
+ stop_codon_symbol : str, optional
+ Symbol to use for stop codons, by default '*'
+ unknown_codon_symbol : str, optional
+ Symbol to use for unknown codons, by default 'X'
+
+ Returns
+ -------
+ str
+ Amino acid sequence of the flanking sequence if translation was successful, otherwise np.nan
+ """
+ aa_seq = ''
+ if len(seq) == flank_size*2*3+3:
+ for i in range(0, len(seq), 3):
+ if seq[i:i+3] in codon_table.forward_table.keys():
+ aa = codon_table.forward_table[seq[i:i+3]]
+ elif seq[i:i+3] in codon_table.stop_codons:
+ aa = stop_codon_symbol
+ else:
+ aa = unknown_codon_symbol
+
+ if i/3 == flank_size and lowercase_mod:
+ aa = aa.lower()
+ aa_seq += aa
+ elif len(seq) % 3 == 0 and not full_flanking_seq:
+ for i in range(0, len(seq), 3):
+ if seq[i:i+3] in codon_table.forward_table.keys():
+ aa = codon_table.forward_table[seq[i:i+3]]
+ elif seq[i:i+3] in codon_table.stop_codons:
+ aa = '*'
+ else:
+ aa = 'X'
+
+ if lowercase_mod and i/3 == first_flank_length:
+ aa = aa.lower()
+ aa_seq += aa
+ elif len(seq) % 3 == 0 and full_flanking_seq:
+ raise ValueError('Provided sequence length does not match indicated flank size. Fix sequence or set full_flanking_seq = False, which requires indicating the length of the flanking sequence in front of the PTM.')
+ elif len(seq) % 3 != 0:
+ raise ValueError('Provided sequence is not a multiple of 3')
+ else:
+ raise ValueError('Unknown error with flanking sequence')
+ return aa_seq
+
+[docs]def get_ptm_locs_in_spliced_sequences(ptm_loc_in_flank, first_flank_seq, spliced_seq, second_flank_seq, strand, which_flank = 'First', order_by = 'Coordinates'):
+ """
+ Given the location of a PTM in a flanking sequence, extract the location of the PTM in the Inclusion Flanking Sequence and the Exclusion Flanking Sequence associated with a given splice event. Inclusion Flanking Sequence will include the skipped exon region, retained intron, or longer alternative splice site depending on event type. The PTM location should be associated with where the PTM is located relative to spliced region (before = 'First', after = 'Second').
+
+ Parameters
+ ----------
+ ptm_loc_in_flank : int
+ Location of the PTM in the flanking sequence it is found (either first or second)
+ first_flank_seq : str
+ Flanking exon sequence before the spliced region
+ spliced_seq : str
+ Spliced region sequence
+ second_flank_seq : str
+ Flanking exon sequence after the spliced region
+ which_flank : str, optional
+ Which flank the PTM is associated with, by default 'First'
+ order_by : str, optional
+ Whether the first, spliced and second regions are defined by their genomic coordinates (first has smallest coordinate, spliced next, then second), or if they are defined by their translation (first the first when translated, etc.)
+
+ Returns
+ -------
+ tuple
+ Tuple containing the PTM location in the Inclusion Flanking Sequence and the Exclusion Flanking Sequence
+ """
+ if order_by == 'Translation':
+ if which_flank == 'First':
+ inclusion_ptm_loc, exclusion_ptm_loc = ptm_loc_in_flank, ptm_loc_in_flank
+ elif which_flank == 'Second':
+ inclusion_ptm_loc = ptm_loc_in_flank+len(spliced_seq)+len(first_flank_seq)
+ exclusion_ptm_loc = ptm_loc_in_flank+len(first_flank_seq)
+
+ elif order_by == 'Coordinates':
+ #grab codon associated with ptm in sequence
+ if (which_flank == 'First' and strand == 1) or (which_flank == 'Second' and strand == -1):
+ inclusion_ptm_loc, exclusion_ptm_loc = ptm_loc_in_flank, ptm_loc_in_flank
+ elif (strand == -1 and which_flank == 'First'):
+ inclusion_ptm_loc = ptm_loc_in_flank+len(spliced_seq)+len(second_flank_seq)
+ exclusion_ptm_loc = ptm_loc_in_flank+len(second_flank_seq)
+ elif (strand == 1 and which_flank == 'Second'):
+ inclusion_ptm_loc = ptm_loc_in_flank+len(spliced_seq)+len(first_flank_seq)
+ exclusion_ptm_loc = ptm_loc_in_flank+len(first_flank_seq)
+ else:
+ raise ValueError('Unknown order_by value, must be either Coordinates (first, spliced and second regions are determined by genomic coordinates) or Translation (first, spliced and second regions are determined by translation')
+
+ return int(inclusion_ptm_loc), int(exclusion_ptm_loc)
+
+
+[docs]def get_flanking_sequence(ptm_loc, seq, ptm_residue, flank_size = 5, lowercase_mod = True, full_flanking_seq = False):
+ """
+ Given a PTM location in a sequence of DNA, extract the flanking sequence around the PTM location and translate into the amino acid sequence. If the sequence is not the correct length, the function will attempt to extract the flanking sequence with spaces to account for missing parts if full_flanking_seq is not True. If the sequence is still not the correct length, the function will raise an error. Any unrecognized codons that are found in the sequence and are not in the standard codon table, including stop codons, will be translated as 'X' (unknown) or '*' (stop codon).
+
+ Parameters
+ ----------
+ ptm_loc : int
+ Location of the first base pair associated with PTM in the DNA sequence
+ seq : str
+ DNA sequence containing the PTM
+ ptm_residue : str
+ Amino acid residue associated with the PTM
+ flank_size : int, optional
+ Number of amino acids to include flanking the PTM, by default 5
+ lowercase_mod : bool, optional
+ Whether to lowercase the amino acid associated with the PTM, by default True
+ full_flanking_seq : bool, optional
+ Whether to require the flanking sequence to be the correct length, by default False
+
+ Returns
+ -------
+ str
+ Amino acid sequence of the flanking sequence around the PTM if translation was successful, otherwise np.nan
+ """
+ ptm_codon = seq[ptm_loc:ptm_loc+3]
+ #check if ptm codon codes for amino acid and then extract flanking sequence
+ if ptm_codon in codon_table.forward_table.keys():
+ if codon_table.forward_table[ptm_codon] == ptm_residue:
+ if len(seq) != 3*(flank_size*2+1):
+ if full_flanking_seq:
+ raise ValueError('Flanking sequence is not the correct length, please fix or set full_flanking_seq to False')
+ else:
+ #check where issue is, at start or end of sequence
+ enough_at_start = ptm_loc >= flank_size*3
+ enough_at_end = len(seq) - ptm_loc >= flank_size*3+3
+ #extract length with amino acids and add cushion for missing parts
+ front_length = flank_size*3 if enough_at_start else ptm_loc
+ start_cushion = (flank_size*3 - ptm_loc)*' ' if not enough_at_start else ''
+ end_length = flank_size*3 + 3 if enough_at_end else len(seq) - ptm_loc
+ end_cushion = (flank_size*3 - (len(seq) - ptm_loc))*' ' if not enough_at_end else ''
+ #reconstruct sequence with spaces to account for missing ends
+ flanking_seq_bp = start_cushion + seq[ptm_loc-front_length:ptm_loc+end_length] + end_cushion
+ else:
+ flanking_seq_bp = seq[ptm_loc-(flank_size*3):ptm_loc+(flank_size*3)+3]
+ flanking_seq_aa = translate_flanking_sequence(flanking_seq_bp, flank_size = flank_size, lowercase_mod=lowercase_mod, full_flanking_seq = full_flanking_seq)
+ else:
+ flanking_seq_aa = np.nan
+ else:
+ flanking_seq_aa = np.nan
+
+ return flanking_seq_aa
+
+[docs]def extract_region_from_splicegraph(splicegraph, region_id):
+ """
+ Given a region id and the splicegraph from SpliceSeq, extract the chromosome, strand, and start and stop locations of that exon. Start and stop are forced to be in ascending order, which is not necessarily true from the splice graph (i.e. start > stop for negative strand exons). This is done to make the region extraction consistent with the rest of the codebase.
+
+ Parameters
+ ----------
+ spliceseq : pandas.DataFrame
+ SpliceSeq splicegraph dataframe, with region_id as index
+ region_id : str
+ Region ID to extract information from, in the format of 'GeneName_ExonNumber'
+
+ Returns
+ -------
+ list
+ List containing the chromosome, strand (1 for forward, -1 for negative), start, and stop locations of the region
+ """
+ region_info = splicegraph.loc[region_id]
+
+ #check to see how many regions correspond to id, if multiple, default to first entry
+ if isinstance(region_info, pd.DataFrame):
+ region_info = region_info.iloc[0]
+ print(f'Warning: {region_id} has multiple entries in splicegraph. Defaulting to first entry.')
+
+ strand = project.convert_strand_symbol(region_info['Strand'])
+ if strand == 1:
+ return [region_info['Chromosome'], strand,region_info['Chr_Start'], region_info['Chr_Stop']]
+ else:
+ return [region_info['Chromosome'], strand,region_info['Chr_Stop'], region_info['Chr_Start']]
+
+
+[docs]def get_spliceseq_event_regions(gene_name, from_exon, spliced_exons, to_exon, splicegraph):
+ """
+ Given all exons associated with a splicegraph event, obtain the coordinates associated with the flanking exons and the spliced region. The spliced region is defined as the exons that are associated with psi values, while flanking regions include the "from" and "to" exons that indicate the adjacent, unspliced exons.
+
+ Parameters
+ ----------
+ gene_name : str
+ Gene name associated with the splice event
+ from_exon : int
+ Exon number associated with the first flanking exon
+ spliced_exons : str
+ Exon numbers associated with the spliced region, separated by colons for each unique exon
+ to_exon : int
+ Exon number associated with the second flanking exon
+ splicegraph : pandas.DataFrame
+ DataFrame containing information about individual exons and their coordinates
+
+ Returns
+ -------
+ tuple
+ Tuple containing the genomic coordinates of the first flanking region, spliced regions, and second flanking region
+ """
+ first_exon_region = extract_region_from_splicegraph(splicegraph, region_id = gene_name+'_'+str(from_exon))
+ spliced_regions = [extract_region_from_splicegraph(splicegraph, gene_name+'_'+exon) if '.' in exon else extract_region_from_splicegraph(splicegraph, gene_name+'_'+exon+'.0') for exon in spliced_exons.split(':')]
+ second_exon_region = extract_region_from_splicegraph(splicegraph, region_id = gene_name+'_'+str(to_exon))
+ return first_exon_region, spliced_regions, second_exon_region
+
+
+
+
+
+[docs]def get_flanking_changes(ptm_coordinates, chromosome, strand, first_flank_region, spliced_region, second_flank_region, gene = None, dPSI = None, sig = None, event_id = None, flank_size = 5, coordinate_type = 'hg38', lowercase_mod = True, order_by = 'Coordinates'):
+ """
+ Currently has been tested with MATS splicing events.
+
+ Given flanking and spliced regions associated with a splice event, identify PTMs that have potential to have an altered flanking sequence depending on whether spliced region is included or excluded (if PTM is close to splice boundary). For these PTMs, extract the flanking sequences associated with the inclusion and exclusion cases and translate into amino acid sequences. If the PTM is not associated with a codon that codes for the expected amino acid, the PTM will be excluded from the results.
+
+ Parameters
+ ----------
+ ptm_coordinates : pandas.DataFrame
+ DataFrame containing PTM coordinate information for identify PTMs in the flanking regions
+ chromosome : str
+ Chromosome associated with the splice event
+ strand : int
+ Strand associated with the splice event (1 for forward, -1 for negative)
+ first_flank_region : list
+ List containing the start and stop locations of the first flanking region (first is currently defined based on location the genome not coding sequence)
+ spliced_region : list
+ List containing the start and stop locations of the spliced region
+ second_flank_region : list
+ List containing the start and stop locations of the second flanking region (second is currently defined based on location the genome not coding sequence)
+ event_id : str, optional
+ Event ID associated with the splice event, by default None
+ flank_size : int, optional
+ Number of amino acids to include flanking the PTM, by default 7
+ coordinate_type : str, optional
+ Coordinate system used for the regions, by default 'hg38'. Other options is hg19.
+ lowercase_mod : bool, optional
+ Whether to lowercase the amino acid associated with the PTM in returned flanking sequences, by default True
+ order_by : str, optional
+ Whether the first, spliced and second regions are defined by their genomic coordinates (first has smallest coordinate, spliced next, then second), or if they are defined by their translation (first the first when translated, etc.)
+
+
+ Returns
+ -------
+ pandas.DataFrame
+ DataFrame containing the PTMs associated with the flanking regions and the amino acid sequences of the flanking regions in the inclusion and exclusion cases
+ """
+ strand = project.convert_strand_symbol(strand)
+ #check first flank for ptms
+ ptms_in_region_first_flank = project.find_ptms_in_region(ptm_coordinates, chromosome, strand, first_flank_region[0], first_flank_region[1], gene = gene, coordinate_type = coordinate_type)
+ if not ptms_in_region_first_flank.empty:
+ ptms_in_region_first_flank = ptms_in_region_first_flank[ptms_in_region_first_flank['Proximity to Region End (bp)'] < flank_size*3]
+ ptms_in_region_first_flank['Region'] = 'First'
+ #check second flank for ptms
+ ptms_in_region_second_flank = project.find_ptms_in_region(ptm_coordinates, chromosome, strand, second_flank_region[0], second_flank_region[1], gene = gene, coordinate_type = coordinate_type)
+ if not ptms_in_region_second_flank.empty:
+ ptms_in_region_second_flank = ptms_in_region_second_flank[ptms_in_region_second_flank['Proximity to Region Start (bp)'] < flank_size*3]
+ ptms_in_region_second_flank['Region'] = 'Second'
+
+ #combine
+ ptms_in_region = pd.concat([ptms_in_region_first_flank, ptms_in_region_second_flank])
+
+
+ if ptms_in_region.empty:
+ return pd.DataFrame()
+ else:
+
+ #add chromosome/strand info to region info for ensembl query
+ first_flank_region_query = [chromosome, strand] + first_flank_region
+ spliced_region_query = [chromosome, strand] + spliced_region
+ second_flank_region_query = [chromosome, strand] + second_flank_region
+ regions_list = [first_flank_region_query, spliced_region_query, second_flank_region_query]
+
+ #get dna sequences associated with regions
+ first_flank_seq, spliced_seq, second_flank_seq = di.get_region_sequences_from_list(regions_list, coordinate_type = coordinate_type)
+
+ #combine sequences for inclusion and exclusion cases
+ if strand == 1:
+ inclusion_seq = first_flank_seq + spliced_seq + second_flank_seq
+ exclusion_seq = first_flank_seq + second_flank_seq
+ else:
+ inclusion_seq = second_flank_seq + spliced_seq + first_flank_seq
+ exclusion_seq = second_flank_seq + first_flank_seq
+
+ #go through all ptms in region within range of splice boundary and grab flanking sequences
+ translate_success_list = []
+ inclusion_seq_list = []
+ exclusion_seq_list = []
+ flank_region_list = []
+ for i, ptm in ptms_in_region.iterrows():
+ ptm_loc = ptm[f'Gene Location ({coordinate_type})']
+ flank_region = ptm['Region']
+ flank_region_loc = ptm['Region']
+ flank_region = first_flank_region if flank_region_loc == 'First' else second_flank_region
+ #grab ptm loc based on which strand ptm is on
+ if strand == 1:
+ relative_ptm_loc = int(ptm_loc - flank_region[0])
+ else:
+ relative_ptm_loc = int(flank_region[1] - ptm_loc)
+
+
+ #grab where ptm is located in both the inclusion and exclusion event
+ inclusion_ptm_loc, exclusion_ptm_loc = get_ptm_locs_in_spliced_sequences(relative_ptm_loc, first_flank_seq, spliced_seq, second_flank_seq,strand = strand, which_flank = flank_region_loc, order_by = order_by)
+
+ #grab codon associated with ptm in sequence
+ ptm_codon_inclusion = inclusion_seq[inclusion_ptm_loc:inclusion_ptm_loc+3]
+ ptm_codon_exclusion = exclusion_seq[exclusion_ptm_loc:exclusion_ptm_loc+3]
+
+
+ #check if ptm codon codes for amino acid and then extract flanking sequence
+ correct_seq = False
+ if ptm_codon_inclusion in codon_table.forward_table.keys() and ptm_codon_exclusion in codon_table.forward_table.keys():
+ if codon_table.forward_table[ptm_codon_inclusion] == ptm['Residue'] and codon_table.forward_table[ptm_codon_exclusion] == ptm['Residue'] and exclusion_ptm_loc-(flank_size*3) >= 0 and len(exclusion_seq) >= exclusion_ptm_loc+(flank_size*3)+3:
+ inclusion_flanking_seq = inclusion_seq[inclusion_ptm_loc-(flank_size*3):inclusion_ptm_loc+(flank_size*3)+3]
+ exclusion_flanking_seq = exclusion_seq[exclusion_ptm_loc-(flank_size*3):exclusion_ptm_loc+(flank_size*3)+3]
+ correct_seq = True
+
+
+ #check to make sure ptm matches expected residue
+ if correct_seq:
+ translate_success_list.append(True)
+
+ #translate flanking sequences
+ inclusion_aa = translate_flanking_sequence(inclusion_flanking_seq, flank_size = flank_size, lowercase_mod=lowercase_mod)
+ exclusion_aa = translate_flanking_sequence(exclusion_flanking_seq, flank_size = flank_size, lowercase_mod=lowercase_mod)
+
+ #append to lists
+ inclusion_seq_list.append(inclusion_aa)
+ exclusion_seq_list.append(exclusion_aa)
+ flank_region_list.append(flank_region_loc)
+ else:
+ translate_success_list.append(False)
+ inclusion_seq_list.append(np.nan)
+ exclusion_seq_list.append(np.nan)
+ flank_region_list.append(flank_region_loc)
+
+ #grab useful info from ptm dataframe
+ if gene is not None:
+ ptms_in_region = ptms_in_region[['Source of PTM', 'Gene', 'UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']].reset_index(drop = True)
+ else:
+ ptms_in_region = ptms_in_region[['Source of PTM', 'UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', 'Modification Class']].reset_index(drop = True)
+ #add flanking sequence information to ptm dataframe
+ ptms_in_region['Inclusion Flanking Sequence'] = inclusion_seq_list
+ ptms_in_region['Exclusion Flanking Sequence'] = exclusion_seq_list
+ ptms_in_region['Region'] = flank_region_list
+ ptms_in_region['Translation Success'] = translate_success_list
+
+ if event_id is not None:
+ ptms_in_region.insert(0, 'Event ID', event_id)
+ if dPSI is not None:
+ ptms_in_region['dPSI'] = dPSI
+ if sig is not None:
+ ptms_in_region['Significance'] = sig
+
+ return ptms_in_region
+
+
+[docs]def get_flanking_changes_from_splice_data(splice_data, ptm_coordinates = None, chromosome_col = None, strand_col = None, first_flank_start_col = None, first_flank_end_col = None, spliced_region_start_col = None, spliced_region_end_col = None, second_flank_start_col = None, second_flank_end_col = None, dPSI_col = None, sig_col = None, event_id_col = None, gene_col = None, flank_size = 5, coordinate_type = 'hg38', lowercase_mod = True):
+ """
+ Given a DataFrame containing information about splice events, extract the flanking sequences associated with the PTMs in the flanking regions if there is potential for this to be altered. The DataFrame should contain columns for the chromosome, strand, start and stop locations of the first flanking region, spliced region, and second flanking region. The DataFrame should also contain a column for the event ID associated with the splice event. If the DataFrame does not contain the necessary columns, the function will raise an error.
+
+ Parameters
+ ----------
+ splice_data : pandas.DataFrame
+ DataFrame containing information about splice events
+ ptm_coordinates : pandas.DataFrame
+ DataFrame containing PTM coordinate information for identify PTMs in the flanking regions
+ chromosome_col : str, optional
+ Column name indicating chromosome, by default None
+ strand_col : str, optional
+ Column name indicating strand, by default None
+ first_flank_start_col : str, optional
+ Column name indicating start location of the first flanking region, by default None
+ first_flank_end_col : str, optional
+ Column name indicating end location of the first flanking region, by default None
+ spliced_region_start_col : str, optional
+ Column name indicating start location of the spliced region, by default None
+ spliced_region_end_col : str, optional
+ Column name indicating end location of the spliced region, by default None
+ second_flank_start_col : str, optional
+ Column name indicating start location of the second flanking region, by default None
+ second_flank_end_col : str, optional
+ Column name indicating end location of the second flanking region, by default None
+ event_id_col : str, optional
+ Column name indicating event ID, by default None
+ flank_size : int, optional
+ Number of amino acids to include flanking the PTM, by default 7
+ coordinate_type : str, optional
+ Coordinate system used for the regions, by default 'hg38'. Other options is hg19.
+ lowercase_mod : bool, optional
+ Whether to lowercase the amino acid associated with the PTM in returned flanking sequences, by default True
+
+ Returns
+ -------
+ list
+ List containing DataFrames with the PTMs associated with the flanking regions and the amino acid sequences of the flanking regions in the inclusion and exclusion cases
+ """
+ #load ptm data from config if not provided
+ if ptm_coordinates is None and pose_config.ptm_coordinates is not None:
+ ptm_coordinates = pose_config.ptm_coordinates
+ elif ptm_coordinates is None:
+ raise ValueError('ptm_coordinates dataframe not provided and not found in the resource files. Please provide the ptm_coordinates dataframe with config.download_ptm_coordinates() or download the file manually. To avoid needing to download this file each time, run pose_config.download_ptm_coordinates(save = True) to save the file locally within the package directory (will take ~63MB of storage space)')
+
+ #check to make sure all required columns are provided
+ if chromosome_col is None and strand_col is None and first_flank_start_col is None and first_flank_end_col is None and spliced_region_start_col is None and spliced_region_end_col is None and second_flank_start_col is None and second_flank_end_col is None:
+ raise ValueError('Please provide column names for chromosome, strand, first flank start, first flank end, spliced region start, spliced region end, second flank start, and second flank end.')
+
+ #if chromosome is labeled with 'chr', remove
+ if splice_data[chromosome_col].str.contains('chr').any():
+ splice_data['chr'] = splice_data['chr'].str.strip('chr')
+
+
+ results = []
+ for i, event in tqdm.tqdm(splice_data.iterrows(), total = splice_data.shape[0], desc = 'Finding flanking sequences for PTMs nearby splice junctions'):
+ if event_id_col is None:
+ event_id = i
+ else:
+ event_id = event[event_id_col]
+
+ #get gene info
+ chromosome = event[chromosome_col]
+ strand = event[strand_col]
+ gene = event[gene_col] if gene_col is not None else None
+ dPSI = event[dPSI_col] if dPSI_col is not None else None
+ sig = event[sig_col] if sig_col is not None else None
+
+ #extract region inof
+ first_flank_region = [event[first_flank_start_col], event[first_flank_end_col]]
+ spliced_region = [event[spliced_region_start_col], event[spliced_region_end_col]]
+ second_flank_region = [event[second_flank_start_col], event[second_flank_end_col]]
+
+ #get flanking changes
+ ptm_flanks = get_flanking_changes(ptm_coordinates, chromosome, strand, first_flank_region, spliced_region, second_flank_region, gene = gene, sig = sig, dPSI = dPSI, event_id = event_id, flank_size = flank_size, coordinate_type = coordinate_type, lowercase_mod=lowercase_mod)
+
+ #append to results
+ results.append(ptm_flanks)
+
+ results = pd.concat(results)
+ #combine and remove any failed translation attempts
+ if not results.empty:
+ results = results[results['Translation Success']]
+
+ #do some quick comparison of flanking sequences
+ if not results.empty:
+ #find flanking sequences that have changed and only keep those
+ results['Matched'] = results['Inclusion Flanking Sequence'] == results['Exclusion Flanking Sequence']
+ results = results[~results['Matched']]
+ results = results.drop(columns=['Matched'])
+ results['Stop Codon Introduced'] = (results['Inclusion Flanking Sequence'].str.contains(r'\*')) | (results['Exclusion Flanking Sequence'].str.contains(r'\*'))
+
+ print(f'{results.shape[0]} PTMs found with potential for altered flanking sequences.')
+ else:
+ print('No PTMs found with potential for altered flanking sequences.')
+ return results
+
+
+[docs]def get_spliceseq_flank_loc(ptm, strand, from_region_coords, to_region_coords, coordinate_type = 'hg19'):
+ """
+ Given ptm information for identifying flanking sequences from splicegraph information, extract the relative location of the ptm in the flanking region (where it is located in translation of the flanking region).
+
+ Parameters
+ ----------
+ ptm : pandas.Series
+ Series containing PTM information
+ strand : int
+ Strand associated with the splice event (1 for forward, -1 for negative)
+ from_region_coords : list
+ List containing the chromosome, strand, start, and stop locations of the first flanking region
+ to_region_coords : list
+ List containing the chromosome, strand, start, and stop locations of the second flanking region
+
+ Returns
+ -------
+ int
+ Relative location of the PTM in the flanking region
+ """
+ if strand == 1 and ptm['Which Flank'] == 'First':
+ return ptm[f'Gene Location ({coordinate_type})'] - from_region_coords[-2]
+ elif strand == 1 and ptm['Which Flank'] == 'Second':
+ return ptm[f'Gene Location ({coordinate_type})'] - to_region_coords[-2]
+ elif strand == -1 and ptm['Which Flank'] == 'First':
+ return from_region_coords[-1] - ptm[f'Gene Location ({coordinate_type})']
+ else:
+ return to_region_coords[-1] - ptm[f'Gene Location ({coordinate_type})']
+
+[docs]def get_ptms_in_splicegraph_flank(gene_name, chromosome, strand, flank_region_start, flank_region_end, coordinate_type = 'hg19', which_flank = 'First', flank_size = 5):
+ """
+
+ """
+ #check for ptms in first flank region
+ flank_ptms = project.find_ptms_in_region(ptm_coordinates = pose_config.ptm_coordinates, chromosome = chromosome, strand = strand, start = flank_region_start, end = flank_region_end, coordinate_type = coordinate_type, gene = gene_name)
+ if not flank_ptms.empty and which_flank == 'First': #if ptms found region, grab those close enough to splice boundary to have impacted flanking sequence
+ flank_ptms = flank_ptms[flank_ptms['Proximity to Region End (bp)'] < flank_size*3]
+ flank_ptms['Which Flank'] = 'First'
+ elif not flank_ptms.empty and which_flank == 'Second': #if ptms found region, grab those close enough to splice boundary to have impacted flanking sequence
+ flank_ptms = flank_ptms[flank_ptms['Proximity to Region Start (bp)'] < flank_size*3]
+ flank_ptms['Which Flank'] = 'Second'
+
+ return flank_ptms
+
+def get_flank_changes_from_splicegraph_single_event(event_row, splicegraph, event_id_col = None, dPSI_col = None, sig_col = None, extra_cols = None, flank_size = 5, coordinate_type = 'hg19'):
+ region_id = event_row[event_id_col] if event_id_col is not None else None
+ dPSI = event_row[dPSI_col] if dPSI_col is not None else None
+ sig = event_row[sig_col] if sig_col is not None else None
+
+ #get region info
+ from_region_coords, spliced_region_coords, to_region_coords = get_spliceseq_event_regions(gene_name = event_row['symbol'], from_exon = event_row['from_exon'], spliced_exons = event_row['exons'], to_exon = event_row['to_exon'], splicegraph = splicegraph)
+ chromosome = from_region_coords[0]
+ strand = from_region_coords[1]
+
+ from_flank_ptms = get_ptms_in_splicegraph_flank(event_row['symbol'], chromosome, strand, from_region_coords[-2], from_region_coords[-1], coordinate_type = coordinate_type, which_flank = 'First', flank_size = flank_size)
+ to_flank_ptms = get_ptms_in_splicegraph_flank(event_row['symbol'], chromosome, strand, to_region_coords[-2], to_region_coords[-1], coordinate_type = coordinate_type, which_flank = 'Second', flank_size = flank_size)
+ ptms_of_interest = pd.concat([from_flank_ptms, to_flank_ptms]).reset_index()
+
+
+ #if any ptms found for event that could have altered flanking sequences extract sequence information
+ if not ptms_of_interest.empty:
+ #add additional context from splice data, if indicated
+ if event_id_col is not None:
+ ptms_of_interest['Region ID'] = region_id
+
+ if dPSI_col is not None:
+ ptms_of_interest['dPSI'] = dPSI
+
+ if sig_col is not None:
+ ptms_of_interest['Significance'] = sig
+
+ if extra_cols is not None:
+ for col in extra_cols:
+ ptms_of_interest[col] = event_row[col]
+
+
+ region_list = [from_region_coords] + spliced_region_coords + [to_region_coords]
+ seqs = di.get_region_sequences_from_list(region_list, coordinate_type = 'hg19')
+ from_sequence = seqs[0]
+ to_sequence = seqs[-1]
+ spliced_sequence = ''.join(seqs[1:-1]) #combine all sequences from spliced region (may be multiple exons)
+
+ inclusion_sequence = seqs[0] + ''.join(seqs[1:-1]) + seqs[-1] #combine sequences if spliced region is included
+ exclusion_sequence = seqs[0] + seqs[-1] #combine sequences if spliced region is excluded
+
+ #initialize columns for flanking sequences
+ ptms_of_interest['Inclusion Flanking Sequence'] = ''
+ ptms_of_interest['Exclusion Flanking Sequence'] = ''
+ for i, ptm in ptms_of_interest.iterrows():
+ ptm_loc_in_flank = get_spliceseq_flank_loc(ptm, strand, from_region_coords, to_region_coords)
+ #grab where ptm is located in both the inclusion and exclusion event
+ inclusion_ptm_loc, exclusion_ptm_loc = get_ptm_locs_in_spliced_sequences(ptm_loc_in_flank, from_sequence, spliced_sequence, to_sequence, strand = strand, which_flank = ptm['Which Flank'], order_by = 'Translation')
+
+ #extract expected flanking sequence based on location in sequence
+ inclusion_flank = get_flanking_sequence(inclusion_ptm_loc, inclusion_sequence, ptm_residue = ptm['Residue'], flank_size = flank_size, full_flanking_seq = False)
+ exclusion_flank = get_flanking_sequence(exclusion_ptm_loc, exclusion_sequence, ptm_residue = ptm['Residue'], flank_size = flank_size, full_flanking_seq = False)
+
+ #add to dataframe
+ ptms_of_interest.loc[i, 'Inclusion Flanking Sequence'] = inclusion_flank
+ ptms_of_interest.loc[i, 'Exclusion Flanking Sequence'] = exclusion_flank
+
+ #trim the expected flanking sequence
+ #ptms_of_interest['Expected Flanking Sequence'] = ptms_of_interest['Expected Flanking Sequence'].apply(lambda x: x[int((len(x)-1)/2-flank_size):int((len(x)-1)/2+flank_size+1)] if x == x else np.nan)
+ #find flanking sequences that have changed and only keep those
+ ptms_of_interest['Matched'] = ptms_of_interest['Inclusion Flanking Sequence'] == ptms_of_interest['Exclusion Flanking Sequence']
+ ptms_of_interest = ptms_of_interest[~ptms_of_interest['Matched']]
+ ptms_of_interest = ptms_of_interest.drop(columns=['Matched'])
+ ptms_of_interest['Stop Codon Introduced'] = (ptms_of_interest['Inclusion Flanking Sequence'].str.contains(r'\*')) | (ptms_of_interest['Exclusion Flanking Sequence'].str.contains(r'\*'))
+
+
+ return ptms_of_interest
+
+[docs]def get_flanking_changes_from_splicegraph(psi_data, splicegraph, ptm_coordinates = None, dPSI_col = None, sig_col = None, event_id_col = None, extra_cols = None, gene_col = 'symbol', flank_size = 5, coordinate_type = 'hg19'):
+ """
+ Given a DataFrame containing information about splice events obtained from SpliceSeq and the corresponding splicegraph, extract the flanking sequences of PTMs that are nearby the splice boundary (potential for flanking sequence to be altered). Coordinate information of individual exons should be found in splicegraph. You can also provide columns with specific psi or significance information. Extra cols not in these categories can be provided with extra_cols parameter.
+
+ Parameters
+ ----------
+ psi_data : pandas.DataFrame
+ DataFrame containing information about splice events obtained from SpliceSeq
+ splicegraph : pandas.DataFrame
+ DataFrame containing information about individual exons and their coordinates
+ ptm_coordinates : pandas.DataFrame
+ DataFrame containing PTM coordinate information for identify PTMs in the flanking regions
+ dPSI_col : str, optional
+ Column name indicating delta PSI value, by default None
+ sig_col : str, optional
+ Column name indicating significance of the event, by default None
+ event_id_col : str, optional
+ Column name indicating event ID, by default None
+ extra_cols : list, optional
+ List of column names for additional information to add to the results, by default None
+ gene_col : str, optional
+ Column name indicating gene symbol of spliced gene, by default 'symbol'
+ flank_size : int, optional
+ Number of amino acids to include flanking the PTM, by default 5
+ coordinate_type : str, optional
+ Coordinate system used for the regions, by default 'hg19'. Other options is hg38.
+
+ Returns
+ -------
+ altered_flanks : pandas.DataFrame
+ DataFrame containing the PTMs associated with the flanking regions that are altered, and the flanking sequences that arise depending on whether the flanking sequence is included or not
+ """
+ #load ptm data from config if not provided
+ if ptm_coordinates is None and pose_config.ptm_coordinates is not None:
+ ptm_coordinates = pose_config.ptm_coordinates.copy()
+ elif ptm_coordinates is None:
+ raise ValueError('ptm_coordinates dataframe not provided and not found in the resource files. Please provide the ptm_coordinates dataframe with config.download_ptm_coordinates() or download the file manually. To avoid needing to download this file each time, run pose_config.download_ptm_coordinates(save = True) to save the file locally within the package directory (will take ~63MB of storage space)')
+
+ #load spliceseq
+ splicegraph['Region ID'] = splicegraph['Symbol'] + '_' + splicegraph['Exon'].astype(str)
+ splicegraph.index = splicegraph['Region ID'].values
+
+ data_for_flanks = psi_data.drop_duplicates().copy()
+
+ #extract relevant columns
+ relevant_columns = ['as_id', 'splice_type', 'symbol', 'from_exon', 'exons', 'to_exon']
+ if event_id_col is not None:
+ relevant_columns.append(event_id_col)
+ if dPSI_col is not None:
+ relevant_columns.append(dPSI_col)
+ if sig_col is not None:
+ relevant_columns.append(sig_col)
+ if extra_cols is not None:
+ relevant_columns.extend(extra_cols)
+
+ data_for_flanks = data_for_flanks[relevant_columns].drop_duplicates()
+ data_for_flanks = data_for_flanks.dropna(subset = ['from_exon', 'to_exon'])
+ data_for_flanks['from_region_id'] = data_for_flanks[gene_col]+'_'+data_for_flanks['from_exon'].astype(str)
+ data_for_flanks['to_region_id'] = data_for_flanks['symbol']+'_'+data_for_flanks['to_exon'].astype(str)
+
+ #get coordinates for the different regions
+ altered_flanks = []
+ for i, row in tqdm.tqdm(data_for_flanks.iterrows(), total = data_for_flanks.shape[0], desc = 'Finding flanking changes for splicegraph events'):
+ single_event_altered_flanks = get_flank_changes_from_splicegraph_single_event(row, splicegraph, event_id_col = event_id_col, dPSI_col = dPSI_col, sig_col = sig_col, extra_cols = extra_cols, flank_size = flank_size, coordinate_type = coordinate_type)
+
+ altered_flanks.append(single_event_altered_flanks)
+
+ altered_flanks = pd.concat(altered_flanks)
+ return altered_flanks
+
+
+
+
+
+
+
+import pandas as pd
+import numpy as np
+
+#base python packages
+import os
+import time
+
+
+from ptm_pose import database_interfacing as di
+
+#identify package directory
+package_dir = os.path.dirname(os.path.abspath(__file__))
+resource_dir = package_dir + '/Resource_Files/'
+
+#download modification conversion file (allows for conversion between modificaiton subtypes and clases)
+modification_conversion = pd.read_csv(resource_dir + 'modification_conversion.csv')
+
+#load ptm_coordinates dataframe, if present
+if os.path.isfile(resource_dir + 'ptm_coordinates.csv'):
+ ptm_coordinates = pd.read_csv(resource_dir + 'ptm_coordinates.csv',index_col = 0, dtype = {'Chromosome/scaffold name': str, 'PTM Position in Canonical Isoform': int})
+else:
+ print('ptm_coordinates file not found. Please run download_ptm_coordinates() to download the file from GitHub LFS. Set save = True to save the file locally and avoid downloading in the future.')
+ ptm_coordinates = None
+
+[docs]def download_ptm_coordinates(save = False, max_retries = 5, delay = 10):
+ """
+ Download ptm_coordinates dataframe from GitHub Large File Storage (LFS). By default, this will not save the file locally due the larger size (do not want to force users to download but highly encourage), but an option to save the file is provided if desired
+
+ Parameters
+ ----------
+ save : bool, optional
+ Whether to save the file locally into Resource Files directory. The default is False.
+ max_retries : int, optional
+ Number of times to attempt to download the file. The default is 5.
+ delay : int, optional
+ Time to wait between download attempts. The default is 10.
+
+ """
+ for i in range(max_retries):
+ try:
+ ptm_coordinates = pd.read_csv('https://github.com/NaegleLab/PTM-POSE/raw/main/PTM_POSE/Resource_Files/ptm_coordinates.csv?download=', index_col = 0, dtype = {'Chromosome/scaffold name': str, 'PTM Position in Canonical Isoform': str})
+ break
+ except:
+ time.sleep(delay)
+ else:
+ raise Exception('Failed to download ptm_coordinates file after ' + str(max_retries) + ' attempts. Please try again.')
+
+
+
+ if save:
+ ptm_coordinates.to_csv(resource_dir + 'ptm_coordinates.csv')
+
+ return ptm_coordinates
+
+def download_translator(save = False):
+ uniprot_to_genename, uniprot_to_geneid = di.get_uniprot_to_gene()
+ translator = pd.DataFrame({'Gene stable ID': uniprot_to_geneid, 'Gene name':uniprot_to_genename})
+ if save:
+ translator.to_csv(resource_dir + 'translator.csv')
+ return translator, uniprot_to_genename, uniprot_to_geneid
+
+#load uniprot translator dataframe, process if need be
+if os.path.isfile(resource_dir + 'translator.csv'):
+ translator = pd.read_csv(resource_dir + 'translator.csv', index_col=0)
+ uniprot_to_genename = translator['Gene name'].to_dict()
+ uniprot_to_geneid = translator['Gene stable ID'].to_dict()
+
+ #replace empty strings with np.nan
+ translator = translator.replace('', np.nan)
+else:
+ print('Downloading mapping information between UniProt and Gene Names from UniProt. To permanently save the translator file, run download_translator(save = True)')
+ translator, uniprot_to_genename, uniprot_to_geneid = download_translator()
+
+
+#additional information
+
+#dictionary to associate annotation column names with different annotation types
+annotation_col_dict = {'PhosphoSitePlus':{'Function':'PSP:ON_FUNCTION', 'Process':'PSP:ON_PROCESS', 'Interactions':'PSP:ON_PROT_INTERACT', 'Disease':'PSP:Disease_Association', 'Kinase':'PSP:Kinase','Perturbation':'PTMsigDB:PERT-PSP'},
+ 'ELM':{'Interactions':'ELM:Interactions', 'Motif Match':'ELM:Motif Matches'},
+ 'PTMcode':{'Intraprotein':'PTMcode:Intraprotein_Interactions', 'Interactions':'PTMcode:Interprotein_Interactions'},
+ 'PTMInt':{'Interactions':'PTMInt:Interaction'},
+ 'RegPhos':{'Kinase':'RegPhos:Kinase'},
+ 'DEPOD':{'Phosphatase':'DEPOD:Phosphatase'},
+ 'PTMsigDB': {'WikiPathway':'PTMsigDB:PATH-WP', 'NetPath':'PTMsigDB:PATH-NP','mSigDB':'PTMsigDB:PATH-BI', 'Perturbation (DIA2)':'PTMsigDB:PERT-P100-DIA2', 'Perturbation (DIA)': 'PTMsigDB:PERT-P100-DIA', 'Perturbation (PRM)':'PTMsigDB:PERT-P100-PRM','Kinase':'PTMsigDB:KINASE-iKiP'}}
+
+annotation_function_dict = {'PhosphoSitePlus': {'Function':'add_PSP_regulatory_site_data', 'Process':'add_PSP_regulatory_site_data', 'Disease':'add_PSP_disease_association', 'Kinase':'add_PSP_kinase_substrate_data', 'Interactions': 'add_PSP_regulatory_site_data()', 'Perturbation':'add_PTMsigDB_data'},
+ 'ELM': {'Interactions':'add_ELM_interactions()', 'Motif Match':'add_ELM_motif_matches'},
+ 'PTMcode': {'Intraprotein': 'add_PTMcode_intraprotein', 'Interactions':'add_PTMcode_interprotein'},
+ 'PTMInt': {'Interactions':'add_PTMInt_data'},
+ 'RegPhos': {'Kinase': 'add_RegPhos_data'},
+ 'DEPOD': {'Phosphatase':'add_DEPOD_data'},
+ 'PTMsigDB':{'WikiPathway':'add_PTMsigDB_data', 'NetPath':'add_PTMsigDB_data','mSigDB':'add_PTMsigDB_data', 'Perturbation (DIA2)':'add_PTMsigDB_data', 'Perturbation (DIA)': 'add_PTMsigDB_data', 'Perturbation (PRM)':'add_PTMsigDB_data','Kinase':'add_PTMsigDB_data'}}
+
+
+
+#manually curated dictionary to convert phosphositeplus names that are not standard gene names to UniProt IDs
+psp_name_dict = {'Actinfilin':'Q6TDP4','14-3-3 zeta':'P63104','14-3-3 epsilon':'P62258','14-3-3 sigma':'P31947','P130Cas':'P56945','ENaC-beta':'P51168','ENaC-alpha':'P37088','14-3-3 eta':'Q04917','14-3-3 beta':'P31946', '14-3-3 gamma':'P61981', '14-3-3 theta':'P27348','Securin':'O95997','GPIbA':'P07359','occludin':'Q16625','ER-beta':'Q92731','53BP1': 'Q12888','4E-T':'Q9NRA8','53BP2':'Q13625','AP-2 beta':'Q92481','APAF':'O14727','Bcl-xL':'Q07817','C/EBP-epsilon':'Q15744','CREB':'P16220','Calmodulin':'P0DP23','Cortactin':'Q14247','DNAPK':'P78527', 'Diaphanous-1':'O60610', 'ER-alpha':'P03372', 'Exportin-1':'O14980', 'Ezrin':'P15311', 'H3':'Q6NXT2','HSP70':'P0DMV8;P0DMV9','IKKG':'Q9Y6K9', 'Ig-beta':'P40259','Ku80':'P13010','LC8':'Q96FJ2', 'MRLC2V':'P10916', 'Merlin':'P35240','NFkB-p105':'P19838', 'Rb':'P06400', 'RhoGDI alpha':'P52565', 'Rhodopsin':'P08100', 'SHP-1':'P29350', 'SHP-2':'Q06124','SLP76':'Q13094','SMRT':'Q9Y618','SRC-3':'Q9Y6Q9','STI1':'Q9BPY8','Vinculin':'P18206','beclin 1':'Q14457','claspin':'Q9HAW4', 'gp130':'P40189','leupaxin':'O60711','p14ARF':'Q8N726','rubicon':'Q92622','snRNP A':'P09661','snRNP B1':'P08579','snRNP C':'P09234','syntenin':'O00560;Q9H190','talin 1':'Q9Y490', 'ubiquitin':'P0CG47', '4E-BP1':'Q13541', 'ALK2':'Q04771', 'AMPKA1':'Q13131','AurA':'O14965','AurB':'Q96GD4', 'AurC':'Q9UQB9', 'C/EBP-beta':'P17676', 'CAMK1A':'Q14012', 'CHD-3 iso3':'Q12873', 'CK1A':'P48729', 'CK2B':'P67870', 'DAT':'Q01959', 'DJ-1':'Q99497', 'DOR-1':'P41143', 'DYN1':'Q05193','Desmoplakin':'P15924', 'Exportin-4':'Q9C0E2', 'FBPase':'P09467', 'FBPase 2':'O60825', 'G-alpha':'P63096', 'G-alpha 13':'Q14344', 'G-alpha i1':'P63096', 'G-beta 1':'P62873', 'G-beta 2':'P62879', 'G6PI':'P06744', 'GM130':'Q08379', 'GR':'P04150', 'H4':'P62805', 'HP1 alpha':'P45973', 'IkB-alpha':'P25963', 'IkB-beta':'Q15653', 'PPAR-gamma':'P37231', 'Claudin-1':'O95832', 'Claudin-2':'P57739', 'Cofilin-1':'P23528', 'K14':'P02533', 'K18':'P05783', 'K5':'P13647','K8':'P05787','Ku70':'P12956', 'Moesin':'P26038','N-WASP':'O00401','Nur77':'P22736','P38A':'Q16539','P38B':'Q15759', 'P70S6KB':'P23443','PGC-1 alpha':'Q9UBK2','PKHF1':'Q96S99','P38G':'P53778','PKCI':'P41743','PKCZ':'Q05513', 'PKG1':'Q13976', 'PTP-PEST':'Q05209','Plectin-1':'Q15149','RFA2':'P15927','SERCA2':'P16615','SH2-B-beta':'Q9NRF2', 'SNAP-alpha':'P54920', 'SPT16':'Q9BXB7', 'SPT6':'Q7KZ85','STEP':'P54829','STLK3':'Q9UEW8', 'Snail1':'O95863', 'Snail2':'O43623', 'Stargazin':'P62955','Survivin':'O15392','TARP':'P09693','TK':'P04183','TOM20':'Q15388','TR-alpha':'P10827','Titin':'Q8WZ42','Vimentin':'P08670','WASP':'P42768','ZAP':'Q7Z2W4', 'Zyxin':'Q15942', 'cIAP1':'Q13490','caveolin-1':'Q03135', 'coronin 2A':'Q92828', 'desmin':'P17661','eIF2-alpha':'Q9BY44', 'eIF2-beta':'P20042', 'eIF3-alpha':'O75822', 'eIF3-eta':'P55884', 'eIF3-zeta':'O15371', 'eNOS':'P29474', 'emerin':'P50402', 'epsin 1':'Q9Y6I3', 'glutaminase':'O94925','hnRNP A1':'P09651', 'hnRNP A2/B1':'P22626', 'hnRNP A3':'P51991','hnRNP D0':'Q14103', 'hnRNP E2':'Q15366','hnRNP P2':'P35637','hnRNP U':'Q00839', 'kindlin-2':'Q96AC1', 'kindlin-3':'Q86UX7','lamin A/C':'P02545', 'mucolipin 1':'Q9GZU1','nNOS':'Q8WY41','p21Cip1':'P38936', 'p27Kip1':'P46527','p47phox':'P14598','p90RSK':'Q15418','palladin':'Q8WX93','polybromo 1':'Q86U86', 'syndecan-4':'P31431', 'tensin 1 iso1':'Q9HBL0', 'utrophin':'P46939','DKFZp686L1814':'Q6MZP7', 'EB1':'Q15691', 'EB2':'Q15555', 'G-alpha i3':'P08754','HSP20':'O14558','HSP40':'P25685', 'Hic-5':'O43294', 'Ig-alpha':'P11912', 'LC3A':'Q9H492', 'LC3B':'Q9GZQ8', 'LC3C':'Q9BXW4','NFkB-p100':'Q00653','NFkB-p65':'Q04206','Pnk1':'Q96T60', 'RPT2':'P62191','EB3':'Q9UPY8'}
+
+
+def download_background(annotation_type = 'Function', database = 'PhosphoSitePlus', mod_class = None, collapsed = False):
+ if mod_class is None:
+ fname = f'{database}_{annotation_type}_collapsed.csv' if collapsed else f'{database}_{annotation_type}.csv'
+ else:
+ fname = f'{database}_{annotation_type}_{mod_class}.csv'
+
+ if os.path.exists(resource_dir + '/background_annotations/'+fname):
+ background = pd.read_csv(resource_dir + '/background_annotations/'+fname,index_col = 0).squeeze()
+ return background
+ else:
+ raise FileNotFoundError(f"Specific background file for {annotation_type} in {database} does not exist. Please construct the background with `analyze.construct_background()`")
+
+def flip_uniprot_dict(uniprot_dict):
+ """
+ Given one of the uniprot id to gene name or gene id dictionaries, flip the dictionary so that the gene name or id is the key and the uniprot id is the value
+ """
+ uniprot_dict = pd.DataFrame(uniprot_dict, index = ['Gene']).T.reset_index()
+ uniprot_dict['Gene'] = uniprot_dict['Gene'].str.split(' ')
+ uniprot_dict = uniprot_dict.explode('Gene')
+ uniprot_dict = uniprot_dict.set_index('Gene')['index'].to_dict()
+ return uniprot_dict
+
+import numpy as np
+import pandas as pd
+
+import multiprocessing
+import datetime
+
+from ptm_pose import pose_config
+from ptm_pose import flanking_sequences as fs
+
+from tqdm import tqdm
+
+def find_ptms_in_region(ptm_coordinates, chromosome, strand, start, end, gene = None, coordinate_type = 'hg38'):
+ """
+ Given an genomic region in either hg38 or hg19 coordinates (such as the region encoding an exon of interest), identify PTMs that are mapped to that region. If so, return the exon number. If none are found, return np.nan.
+
+ Parameters
+ ----------
+ chromosome: str
+ chromosome where region is located
+ strand: int
+ DNA strand for region is found on (1 for forward, -1 for reverse)
+ start: int
+ start position of region on the chromosome/strand (should always be less than end)
+ end: int
+ end position of region on the chromosome/strand (should always be greater than start)
+ coordinate_type: str
+ indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is 'hg38'.
+
+ Returns
+ -------
+ ptms_in_region: pandas.DataFrame
+ dataframe containing all PTMs found in the region. If no PTMs are found, returns np.nan.
+
+ """
+ #restrict to PTMs on the same chromosome and strand
+ ptms_in_region = ptm_coordinates[(ptm_coordinates['Chromosome/scaffold name'] == chromosome) & (ptm_coordinates['Strand'] == strand)].copy()
+
+ if coordinate_type in ['hg19','hg38']:
+ loc_col = f'Gene Location ({coordinate_type})'
+ else:
+ raise ValueError('Coordinate type must be hg38 or hg19')
+
+ #check to make sure the start value is less than the end coordinate. If it is not, treat the end coordinate as the start and the start coordinate as the end
+ if start < end:
+ ptms_in_region = ptms_in_region[(ptms_in_region[loc_col] >= start) & (ptms_in_region[loc_col] <= end)]
+ else:
+ ptms_in_region = ptms_in_region[(ptms_in_region[loc_col] <= start) & (ptms_in_region[loc_col] >= end)]
+
+
+ #extract only PTM information from dataframe and return that and list (if not ptms, return empty dataframe)
+ if not ptms_in_region.empty:
+ #grab uniprot id and residue
+ ptms_in_region = ptms_in_region[['Source of PTM', 'UniProtKB Accession', 'Residue', 'PTM Position in Canonical Isoform', loc_col, 'Modification', 'Modification Class']]
+ #check if ptm is associated with the same gene (if info is provided). if not, do not add
+ if gene is not None:
+ for i, row in ptms_in_region.iterrows():
+ if ';' in row['UniProtKB Accession']:
+ uni_ids = row['UniProtKB Accession'].split(';')
+ remove = True
+ for uni in uni_ids:
+ if gene in pose_config.uniprot_to_genename[uni.split('-')[0]].split(' '):
+ remove = False
+ break
+
+ if remove:
+ ptms_in_region.drop(i)
+ else:
+ if gene not in pose_config.uniprot_to_genename[row['UniProtKB Accession'].split('-')[0]].split(' '):
+ ptms_in_region = ptms_in_region.drop(i)
+
+ #make sure ptms still are present after filtering
+ if ptms_in_region.empty:
+ return pd.DataFrame()
+ else:
+ ptms_in_region.insert(0, 'Gene', gene)
+
+ #calculate proximity to region start and end
+ ptms_in_region['Proximity to Region Start (bp)'] = (ptms_in_region[loc_col] - start).abs()
+ ptms_in_region['Proximity to Region End (bp)'] = (ptms_in_region[loc_col] - end).abs()
+ ptms_in_region['Proximity to Splice Boundary (bp)'] = ptms_in_region.apply(lambda x: min(x['Proximity to Region Start (bp)'], x['Proximity to Region End (bp)']), axis = 1)
+
+
+ return ptms_in_region
+ else:
+ return pd.DataFrame()
+
+def convert_strand_symbol(strand):
+ """
+ Given DNA strand information, make sure the strand information is in integer format (1 for forward, -1 for reverse). This is intended to convert from string format ('+' or '-') to integer format (1 or -1), but will return the input if it is already in integer format.
+
+ Parameters
+ ----------
+ strand: str or int
+ DNA strand information, either as a string ('+' or '-') or an integer (1 or -1)
+
+ Returns
+ -------
+ int
+ DNA strand information as an integer (1 for forward, -1 for reverse)
+ """
+ if isinstance(strand, str):
+ if strand == '+' or strand == '1':
+ return 1
+ elif strand == '-' or strand == '-1':
+ return -1
+ else:
+ return strand
+
+def find_ptms_in_many_regions(region_data, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = 'exonStart_0base', region_end_col = 'exonEnd', gene_col = None, dPSI_col = None, sig_col = None, event_id_col = None, extra_cols = None, annotate_original_df = True, coordinate_type = 'hg38', separate_modification_types = False, taskbar_label = None):
+ """
+ Given a dataframe with a unique region in each row, project PTMs onto the regions. Assumes that the region data will have chromosome, strand, and genomic start/end positions, and each row corresponds to a unique region.
+
+ Parameters
+ ----------
+ ptm_coordinates: pandas.DataFrame
+ dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs
+ region_data: pandas.DataFrame
+ dataframe containing region information, including chromosome, strand, and genomic location of regions of interest
+ chromosome_col: str
+ column name in splice_data that contains chromosome information. Default is 'chr'. Expects it to be a str with only the chromosome number: 'Y', '1', '2', etc.
+ strand_col: str
+ column name in splice_data that contains strand information. Default is 'strand'. Expects it to be a str with '+' or '-', or integers as 1 or -1. Will convert to integers automatically if string format is provided.
+ region_start_col: str
+ column name in splice_data that contains the start position of the region of interest. Default is 'exonStart_0base'.
+ region_end_col: str
+ column name in splice_data that contains the end position of the region of interest. Default is 'exonEnd'.
+ gene_col: str
+ column name in splice_data that contains the gene name. If provided, will be used to make sure the projected PTMs stem from the same gene (some cases where genomic coordiantes overlap between distinct genes). Default is None.
+ event_id_col: str
+ column name in splice_data that contains the unique identifier for the splice event. If provided, will be used to annotate the ptm information with the specific splice event ID. Default is None.
+ coordinate_type: str
+ indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is 'hg38'.
+ separate_modification_types: bool
+ Indicate whether to store PTM sites with multiple modification types as multiple rows. For example, if a site at K100 was both an acetylation and methylation site, these will be separated into unique rows with the same site number but different modification types. Default is True.
+ taskbar_label: str
+ Label to display in the tqdm progress bar. Default is None, which will automatically state "Projecting PTMs onto regions using ----- coordinates".
+
+
+
+ Returns
+ -------
+ spliced_ptm_info: pandas.DataFrame
+ Contains the PTMs identified across the different splice events
+ splice_data: pandas.DataFrame
+ dataframe containing the original splice data with an additional column 'PTMs' that contains the PTMs found in the region of interest, in the format of 'SiteNumber(ModificationType)'. If no PTMs are found, the value will be np.nan.
+ """
+ if taskbar_label is None:
+ taskbar_label = 'Projecting PTMs onto regions using ' + coordinate_type + ' coordinates.'
+
+ if region_data[chromosome_col].str.contains('chr').any():
+ region_data[chromosome_col] = region_data[chromosome_col].str.strip('chr')
+
+
+ spliced_ptm_info = []
+ spliced_ptms_list = []
+ num_ptms_affected = []
+ num_unique_ptm_sites = []
+
+ #copy
+ region_data = region_data.copy()
+
+ #iterate through each row of the splice data and find PTMs in the region
+ for index, row in tqdm(region_data.iterrows(), total = len(region_data), desc = taskbar_label):
+ #grab region information from row
+ chromosome = row[chromosome_col]
+ strand = convert_strand_symbol(row[strand_col])
+ start = row[region_start_col]
+ end = row[region_end_col]
+ #only provide these if column is given
+ gene = row[gene_col] if gene_col is not None else None
+
+ #project ptms onto region
+ ptms_in_region = find_ptms_in_region(ptm_coordinates, chromosome, strand, start, end, gene = gene, coordinate_type = coordinate_type)
+
+ extra_info = {}
+
+
+ #add additional context from splice data, if indicated
+ extra_info = {}
+ if event_id_col is not None:
+ extra_info['Region ID'] = row[event_id_col]
+
+ if dPSI_col is not None:
+ extra_info['dPSI'] = row[dPSI_col]
+
+ if sig_col is not None:
+ extra_info['Significance'] = row[sig_col]
+
+ if extra_cols is not None:
+ for col in extra_cols:
+ extra_info[col] = row[col]
+
+ #add extra info to ptms_in_region
+ ptms_in_region = pd.concat([pd.DataFrame(extra_info, index = ptms_in_region.index), ptms_in_region], axis = 1)
+
+ #if desired, add ptm information to the original splice event dataframe
+ if annotate_original_df:
+ if not ptms_in_region.empty:
+ #split and separate unique modification types
+ if separate_modification_types:
+ ptms_in_region['Modification Class'] = ptms_in_region['Modification Class'].str.split(';')
+ ptms_in_region = ptms_in_region.explode('Modification Class')
+
+ ptms_info = ptms_in_region.apply(lambda x: x['UniProtKB Accession'] + '_' + x['Residue'] + str(x['PTM Position in Canonical Isoform']) + ' (' + x['Modification Class'] + ')', axis = 1)
+ ptms_str = '/'.join(ptms_info.values)
+ spliced_ptms_list.append(ptms_str)
+ num_ptms_affected.append(ptms_in_region.shape[0])
+ num_unique_ptm_sites.append(ptms_in_region.groupby(['UniProtKB Accession', 'Residue']).size().shape[0])
+ else:
+ spliced_ptms_list.append(np.nan)
+ num_ptms_affected.append(0)
+ num_unique_ptm_sites.append(0)
+
+ spliced_ptm_info.append(ptms_in_region.copy())
+
+ #combine all PTM information
+ spliced_ptm_info = pd.concat(spliced_ptm_info, ignore_index = True)
+
+ #convert ptm position to float
+ if spliced_ptm_info.shape[0] > 0:
+ spliced_ptm_info['PTM Position in Canonical Isoform'] = spliced_ptm_info['PTM Position in Canonical Isoform'].astype(float)
+
+ #add ptm info to original splice event dataframe
+ if annotate_original_df:
+ region_data['PTMs'] = spliced_ptms_list
+ region_data['Number of PTMs Affected'] = num_ptms_affected
+ region_data['Number of Unique PTM Sites by Position'] = num_unique_ptm_sites
+ region_data['Event Length'] = (region_data[region_end_col] - region_data[region_start_col]).abs()
+ region_data['PTM Density (PTMs/bp)'] = region_data['Number of Unique PTM Sites by Position']/(region_data[region_end_col] - region_data[region_start_col]).abs()
+
+ return region_data, spliced_ptm_info
+
+[docs]def project_ptms_onto_splice_events(splice_data, ptm_coordinates = None, annotate_original_df = True, chromosome_col = 'chr', strand_col = 'strand', region_start_col = 'exonStart_0base', region_end_col = 'exonEnd', dPSI_col = None, sig_col = None, event_id_col = None, gene_col = None, extra_cols = None, separate_modification_types = False, coordinate_type = 'hg38', taskbar_label = None, PROCESSES = 1):
+ """
+ Given splice event quantification data, project PTMs onto the regions impacted by the splice events. Assumes that the splice event data will have chromosome, strand, and genomic start/end positions for the regions of interest, and each row of the splice_event_data corresponds to a unique region.
+
+ Parameters
+
+ splice_data: pandas.DataFrame
+ dataframe containing splice event information, including chromosome, strand, and genomic location of regions of interest
+ ptm_coordinates: pandas.DataFrame
+ dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs. If none, it will pull from the config file.
+ chromosome_col: str
+ column name in splice_data that contains chromosome information. Default is 'chr'. Expects it to be a str with only the chromosome number: 'Y', '1', '2', etc.
+ strand_col: str
+ column name in splice_data that contains strand information. Default is 'strand'. Expects it to be a str with '+' or '-', or integers as 1 or -1. Will convert to integers automatically if string format is provided.
+ region_start_col: str
+ column name in splice_data that contains the start position of the region of interest. Default is 'exonStart_0base'.
+ region_end_col: str
+ column name in splice_data that contains the end position of the region of interest. Default is 'exonEnd'.
+ event_id_col: str
+ column name in splice_data that contains the unique identifier for the splice event. If provided, will be used to annotate the ptm information with the specific splice event ID. Default is None.
+ gene_col: str
+ column name in splice_data that contains the gene name. If provided, will be used to make sure the projected PTMs stem from the same gene (some cases where genomic coordiantes overlap between distinct genes). Default is None.
+ dPSI_col: str
+ column name in splice_data that contains the delta PSI value for the splice event. Default is None, which will not include this information in the output
+ sig_col: str
+ column name in splice_data that contains the significance value for the splice event. Default is None, which will not include this information in the output.
+ extra_cols: list
+ list of additional columns to include in the output dataframe. Default is None, which will not include any additional columns.
+ coordinate_type: str
+ indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is 'hg38'.
+ separate_modification_types: bool
+ Indicate whether to store PTM sites with multiple modification types as multiple rows. For example, if a site at K100 was both an acetylation and methylation site, these will be separated into unique rows with the same site number but different modification types. Default is True.
+ taskbar_label: str
+ Label to display in the tqdm progress bar. Default is None, which will automatically state "Projecting PTMs onto regions using ----- coordinates".
+ PROCESSES: int
+ Number of processes to use for multiprocessing. Default is 1 (single processing)
+
+ Returns
+ -------
+ spliced_ptm_info: pandas.DataFrame
+ Contains the PTMs identified across the different splice events
+ splice_data: pandas.DataFrame
+ dataframe containing the original splice data with an additional column 'PTMs' that contains the PTMs found in the region of interest, in the format of 'SiteNumber(ModificationType)'. If no PTMs are found, the value will be np.nan.
+ """
+ #load ptm data from config if not provided
+ if ptm_coordinates is None and pose_config.ptm_coordinates is not None:
+ ptm_coordinates = pose_config.ptm_coordinates
+ elif ptm_coordinates is None:
+ raise ValueError('ptm_coordinates dataframe not provided and not found in the resource files. Please provide the ptm_coordinates dataframe with pose_config.download_ptm_coordinates() or download the file manually. To avoid needing to download this file each time, run pose_config.download_ptm_coordinates(save = True) to save the file locally within the package directory (will take ~63MB of storage space)')
+
+ if taskbar_label is None:
+ taskbar_label = 'Projecting PTMs onto splice events using ' + coordinate_type + ' coordinates.'
+
+
+
+ #copy
+ splice_data = splice_data.copy()
+
+ #check columns to make sure they are present and correct data type
+ check_columns(splice_data, chromosome_col=chromosome_col, strand_col=strand_col, region_start_col=region_start_col, region_end_col=region_end_col, dPSI_col=dPSI_col, sig_col=sig_col, event_id_col=event_id_col, gene_col=gene_col, extra_cols=extra_cols)
+
+ if PROCESSES == 1:
+ splice_data, spliced_ptm_info = find_ptms_in_many_regions(splice_data, ptm_coordinates, chromosome_col = chromosome_col, strand_col = strand_col, region_start_col = region_start_col, region_end_col = region_end_col, dPSI_col = dPSI_col, sig_col = sig_col, event_id_col = event_id_col, gene_col = gene_col, extra_cols = extra_cols, annotate_original_df = annotate_original_df, coordinate_type = coordinate_type,taskbar_label = taskbar_label, separate_modification_types=separate_modification_types)
+ elif PROCESSES > 1:
+ #check num_cpus available, if greater than number of cores - 1 (to avoid freezing machine), then set to PROCESSES to 1 less than total number of cores
+ num_cores = multiprocessing.cpu_count()
+ if PROCESSES > num_cores - 1:
+ PROCESSES = num_cores - 1
+
+ #split dataframe into chunks equal to PROCESSES
+ splice_data_split = np.array_split(splice_data, PROCESSES)
+ pool = multiprocessing.Pool(PROCESSES)
+ #run with multiprocessing
+ results = pool.starmap(find_ptms_in_many_regions, [(splice_data_split[i], ptm_coordinates, chromosome_col, strand_col, region_start_col, region_end_col, gene_col, dPSI_col, sig_col, event_id_col, extra_cols, annotate_original_df, coordinate_type, separate_modification_types, taskbar_label) for i in range(PROCESSES)])
+
+ splice_data = pd.concat([res[0] for res in results])
+ spliced_ptm_info = pd.concat([res[1] for res in results])
+
+ #raise ValueError('Multiprocessing not yet functional. Please set PROCESSES = 1.')
+
+ print(f'PTMs projection successful ({spliced_ptm_info.shape[0]} identified).\n')
+
+ return splice_data, spliced_ptm_info
+
+
+
+[docs]def project_ptms_onto_MATS(ptm_coordinates = None, SE_events = None, fiveASS_events = None, threeASS_events = None, RI_events = None, MXE_events = None, coordinate_type = 'hg38', identify_flanking_sequences = False, dPSI_col = 'meanDeltaPSI', sig_col = 'FDR', separate_modification_types = False, PROCESSES = 1):
+ """
+ Given splice quantification from the MATS algorithm, annotate with PTMs that are found in the differentially included regions.
+
+ Parameters
+ ----------
+ ptm_coordinates: pandas.DataFrame
+ dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs
+ SE_events: pandas.DataFrame
+ dataframe containing skipped exon event information from MATS
+ fiveASS_events: pandas.DataFrame
+ dataframe containing 5' alternative splice site event information from MATS
+ threeASS_events: pandas.DataFrame
+ dataframe containing 3' alternative splice site event information from MATS
+ RI_events: pandas.DataFrame
+ dataframe containing retained intron event information from MATS
+ MXE_events: pandas.DataFrame
+ dataframe containing mutually exclusive exon event information from MATS
+ coordinate_type: str
+ indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is 'hg38'.
+ identify_flanking_sequences: bool
+ Indicate whether to look for altered flanking sequences from spliced events, in addition to those directly in the spliced region. Default is False. (not yet active)
+ PROCESSES: int
+ Number of processes to use for multiprocessing. Default is 1.
+ """
+ print(f'Projecting PTMs onto MATS splice events using {coordinate_type} coordinates.')
+ #reformat chromosome name format
+ spliced_events = {}
+
+ spliced_flanks = []
+ spliced_ptms = []
+ if SE_events is not None:
+ if SE_events['chr'].str.contains('chr').any():
+ SE_events['chr'] = SE_events['chr'].apply(lambda x: x[3:])
+
+ SE_events['AS ID'] = "SE_" + SE_events.index.astype(str)
+
+ #check to make sure there is enough information to do multiprocessing if that is desired
+ if PROCESSES*4 > SE_events.shape[0]:
+ SE_processes = 1
+ else:
+ SE_processes = PROCESSES
+
+ spliced_events['SE'], SE_ptms = project_ptms_onto_splice_events(SE_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = 'exonStart_0base', region_end_col = 'exonEnd', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', event_id_col = 'AS ID', coordinate_type=coordinate_type, taskbar_label = "Skipped Exon events", separate_modification_types=separate_modification_types, PROCESSES = SE_processes)
+ SE_ptms['Event Type'] = 'SE'
+ spliced_ptms.append(SE_ptms)
+
+ if identify_flanking_sequences:
+ print('Identifying flanking sequences for skipped exon events.')
+ SE_flanks = fs.get_flanking_changes_from_splice_data(SE_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', spliced_region_start_col = 'exonStart_0base', spliced_region_end_col = 'exonEnd', first_flank_start_col = 'firstFlankingES', first_flank_end_col = 'firstFlankingEE', second_flank_start_col = 'secondFlankingES', second_flank_end_col = 'secondFlankingEE', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', event_id_col = 'AS ID', coordinate_type=coordinate_type)
+ SE_flanks['Event Type'] = 'SE'
+ spliced_flanks.append(SE_flanks)
+
+ else:
+ print('Skipped exon event data (SE_events) not provided, skipping')
+
+ if fiveASS_events is not None:
+ if fiveASS_events['chr'].str.contains('chr').any():
+ fiveASS_events['chr'] = fiveASS_events['chr'].apply(lambda x: x[3:])
+
+ #set the relevent start and end regions of the spliced out region, which are different depending on the strand
+ region_start = []
+ region_end = []
+ first_flank_start = []
+ first_flank_end = []
+ second_flank_end = []
+ second_flank_start = []
+ for i, row in fiveASS_events.iterrows():
+ strand = row['strand']
+ if strand == '+':
+ region_start.append(row['shortEE'])
+ region_end.append(row['longExonEnd'])
+ if identify_flanking_sequences:
+ first_flank_start.append(row['shortES'])
+ first_flank_end.append(row['shortEE'])
+ second_flank_start.append(row['flankingES'])
+ second_flank_end.append(row['flankingEE'])
+ else:
+ region_start.append(row['longExonStart_0base'])
+ region_end.append(row['shortES'])
+ if identify_flanking_sequences:
+ second_flank_start.append(row['shortES'])
+ second_flank_end.append(row['shortEE'])
+ first_flank_start.append(row['flankingES'])
+ first_flank_end.append(row['flankingEE'])
+
+ fiveASS_events['event_start'] = region_start
+ fiveASS_events['event_end'] = region_end
+ if identify_flanking_sequences:
+ fiveASS_events['first_flank_start'] = first_flank_start
+ fiveASS_events['first_flank_end'] = first_flank_end
+ fiveASS_events['second_flank_start'] = second_flank_start
+ fiveASS_events['second_flank_end'] = second_flank_end
+
+
+ #set specific as id
+
+ fiveASS_events['AS ID'] = "5ASS_" + fiveASS_events.index.astype(str)
+
+ #check to make sure there is enough information to do multiprocessing if that is desired
+ if PROCESSES*4 > fiveASS_events.shape[0]:
+ fiveASS_processes = 1
+ else:
+ fiveASS_processes = PROCESSES
+
+ #identify PTMs found within spliced regions
+ spliced_events['5ASS'], fiveASS_ptms = project_ptms_onto_splice_events(fiveASS_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = 'event_start', region_end_col = 'event_end', event_id_col = 'AS ID', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', coordinate_type=coordinate_type, taskbar_label = "5' ASS events", separate_modification_types=separate_modification_types, PROCESSES = fiveASS_processes)
+ fiveASS_ptms['Event Type'] = '5ASS'
+ spliced_ptms.append(fiveASS_ptms)
+
+ #identify ptms with altered flanking sequences
+ if identify_flanking_sequences:
+ print("Identifying flanking sequences for 5'ASS events.")
+ fiveASS_flanks = fs.get_flanking_changes_from_splice_data(fiveASS_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', spliced_region_start_col = 'event_start', spliced_region_end_col = 'event_end', first_flank_start_col = 'first_flank_start', first_flank_end_col = 'first_flank_end', second_flank_start_col = 'second_flank_start', second_flank_end_col = 'second_flank_end',dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', event_id_col = 'AS ID',coordinate_type=coordinate_type)
+ fiveASS_flanks['Event Type'] = '5ASS'
+ spliced_flanks.append(fiveASS_flanks)
+ else:
+ print("5' ASS event data (fiveASS_events) not provided, skipping.")
+
+ if threeASS_events is not None:
+
+ if RI_events['chr'].str.contains('chr').any():
+ RI_events['chr'] = RI_events['chr'].apply(lambda x: x[3:])
+
+ if threeASS_events['chr'].str.contains('chr').any():
+ threeASS_events['chr'] = threeASS_events['chr'].apply(lambda x: x[3:])
+
+ #set the relevent start and end regions of the spliced out region, which are different depending on the strand
+ region_start = []
+ region_end = []
+ first_flank_start = []
+ first_flank_end = []
+ second_flank_end = []
+ second_flank_start = []
+ for i, row in threeASS_events.iterrows():
+ strand = row['strand']
+ if strand == '+':
+ region_start.append(row['longExonStart_0base'])
+ region_end.append(row['shortES'])
+ if identify_flanking_sequences:
+ second_flank_start.append(row['flankingES'])
+ second_flank_end.append(row['flankingEE'])
+ first_flank_start.append(row['shortES'])
+ first_flank_end.append(row['shortEE'])
+ else:
+ region_start.append(row['shortEE'])
+ region_end.append(row['longExonEnd'])
+ if identify_flanking_sequences:
+ second_flank_start.append(row['flankingES'])
+ second_flank_end.append(row['flankingEE'])
+ first_flank_start.append(row['shortES'])
+ first_flank_end.append(row['shortEE'])
+
+
+ #save region info
+ threeASS_events['event_start'] = region_start
+ threeASS_events['event_end'] = region_end
+ if identify_flanking_sequences:
+ threeASS_events['first_flank_start'] = first_flank_start
+ threeASS_events['first_flank_end'] = first_flank_end
+ threeASS_events['second_flank_start'] = second_flank_start
+ threeASS_events['second_flank_end'] = second_flank_end
+
+ #add event ids
+ threeASS_events['AS ID'] = "3ASS_" + threeASS_events.index.astype(str)
+
+ #check to make sure there is enough information to do multiprocessing if that is desired
+ if PROCESSES*4 > threeASS_events.shape[0]:
+ threeASS_processes = 1
+ else:
+ threeASS_processes = PROCESSES
+
+ spliced_events['3ASS'], threeASS_ptms = project_ptms_onto_splice_events(threeASS_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = 'event_start', region_end_col = 'event_end', event_id_col = 'AS ID', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', coordinate_type=coordinate_type, taskbar_label = "3' ASS events", separate_modification_types=separate_modification_types, PROCESSES = threeASS_processes)
+ threeASS_ptms['Event Type'] = '3ASS'
+ spliced_ptms.append(threeASS_ptms)
+
+ #identify ptms with altered flanking sequences
+ if identify_flanking_sequences:
+ print("Identifying flanking sequences for 3' ASS events.")
+ threeASS_flanks = fs.get_flanking_changes_from_splice_data(threeASS_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', spliced_region_start_col = 'event_start', spliced_region_end_col = 'event_end', first_flank_start_col = 'first_flank_start', first_flank_end_col = 'first_flank_end', second_flank_start_col = 'second_flank_start', second_flank_end_col = 'second_flank_end', dPSI_col=dPSI_col, sig_col = dPSI_col, gene_col = 'geneSymbol', event_id_col = 'AS ID', coordinate_type=coordinate_type)
+ threeASS_flanks['Event Type'] = '3ASS'
+ spliced_flanks.append(threeASS_flanks)
+
+
+ else:
+ print("3' ASS event data (threeASS_events) not provided, skipping")
+
+ if RI_events is not None:
+
+ if RI_events['chr'].str.contains('chr').any():
+ RI_events['chr'] = RI_events['chr'].apply(lambda x: x[3:])
+
+ #add event id
+ RI_events['AS ID'] = "RI_" + RI_events.index.astype(str)
+
+ #check to make sure there is enough information to do multiprocessing if that is desired
+ if PROCESSES*4 > RI_events.shape[0]:
+ RI_processes = 1
+ else:
+ RI_processes = PROCESSES
+
+ spliced_events['RI'], RI_ptms = project_ptms_onto_splice_events(RI_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = 'upstreamEE', region_end_col = 'downstreamES', event_id_col = 'AS ID', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', coordinate_type=coordinate_type, taskbar_label = 'Retained Intron Events', separate_modification_types=separate_modification_types, PROCESSES = RI_processes)
+ RI_ptms['Event Type'] = 'RI'
+ spliced_ptms.append(RI_ptms)
+
+ #identify ptms with altered flanking sequences
+ if identify_flanking_sequences:
+ print('Identifying flanking sequences for retained intron events.')
+ RI_flanks = fs.get_flanking_changes_from_splice_data(RI_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', spliced_region_start_col = 'upstreamEE', spliced_region_end_col = 'downstreamES', first_flank_start_col = 'upstreamES', first_flank_end_col = 'upstreamEE', second_flank_start_col = 'downstreamES', second_flank_end_col = 'downstreamEE', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', event_id_col = 'AS ID', coordinate_type=coordinate_type)
+ RI_flanks['Event Type'] = 'RI'
+ spliced_flanks.append(RI_flanks)
+
+ if MXE_events is not None:
+ if MXE_events['chr'].str.contains('chr').any():
+ MXE_events['chr'] = MXE_events['chr'].apply(lambda x: x[3:])
+
+ #check to make sure there is enough information to do multiprocessing if that is desired
+ if PROCESSES*4 > MXE_events.shape[0]:
+ MXE_processes = 1
+ else:
+ MXE_processes = PROCESSES
+
+ #add AS ID
+ MXE_events['AS ID'] = "MXE_" + MXE_events.index.astype(str)
+
+ mxe_ptms = []
+ #first mxe exon
+ spliced_events['MXE_Exon1'], MXE_Exon1_ptms = project_ptms_onto_splice_events(MXE_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = '1stExonStart_0base', region_end_col = '1stExonEnd', event_id_col = 'AS ID', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', coordinate_type=coordinate_type, taskbar_label = 'MXE, First Exon', separate_modification_types=separate_modification_types, PROCESSES = MXE_processes)
+ MXE_Exon1_ptms['Event Type'] = 'MXE (First Exon)'
+ mxe_ptms.append(MXE_Exon1_ptms)
+
+ #second mxe exon
+ spliced_events['MXE_Exon2'], MXE_Exon2_ptms = project_ptms_onto_splice_events(MXE_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = '2ndExonStart_0base', region_end_col = '2ndExonEnd', event_id_col = 'AS ID', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', coordinate_type=coordinate_type, taskbar_label = 'MXE, Second Exon', separate_modification_types=separate_modification_types, PROCESSES = MXE_processes)
+ MXE_Exon2_ptms['Event Type'] = 'MXE (Second Exon)'
+ mxe_ptms.append(MXE_Exon2_ptms)
+
+ #combine mxe ptms, and then drop any PTMs that were found in both MXE's
+ mxe_ptms = pd.concat([MXE_Exon1_ptms, MXE_Exon2_ptms])
+ mxe_ptms = mxe_ptms.drop_duplicates(subset = ['UniProtKB Accession', 'Source of PTM', 'Residue', 'PTM Position in Canonical Isoform', 'Modification', 'Modification Class', 'dPSI', 'Significance', 'Gene'], keep = False)
+ mxe_ptms['dPSI'] = mxe_ptms.apply(lambda x: x['dPSI']* -1 if x['Event Type'] == 'MXE (Second Exon)' else x['dPSI'], axis = 1)
+
+ #add mxe ptms to spliced_ptms
+ spliced_ptms.append(mxe_ptms)
+
+ spliced_ptms = pd.concat(spliced_ptms)
+ if identify_flanking_sequences:
+ spliced_flanks = pd.concat(spliced_flanks)
+ return spliced_events, spliced_ptms, spliced_flanks
+ else:
+ return spliced_events, spliced_ptms
+
+#def project_ptms_onto_MAJIQ_dPSI(majiq_data, ptm_coordinates = None, coordinate_type = 'hg38', identify_flanking_sequences = False, dPSI_col = 'dPSI', sig_col = 'FDR', separate_modification_types = False, PROCESSES = 1):
+# print('in progress')
+# pass
+
+def add_splicegraph_info(psi_data, splicegraph, purpose = 'inclusion'):
+ psi_data = psi_data[psi_data['splice_type'] != 'ME'].copy()
+
+ if purpose == 'inclusion':
+ #split exons into individual exons
+ psi_data['Individual exon'] = psi_data['exons'].apply(lambda x: x.split(':'))
+ psi_data = psi_data.explode('Individual exon').drop_duplicates()
+ psi_data['Individual exon'] = psi_data['Individual exon'].astype(float)
+
+ #add gene location information to psi data from spliceseq
+ psi_data = psi_data.merge(splicegraph, left_on = ['symbol', 'Individual exon'], right_on = ['Symbol', 'Exon'], how = 'left')
+ psi_data = psi_data.rename(columns = {'Chr_Start': 'spliced_region_start', 'Chr_Stop': 'spliced_region_end'})
+ return psi_data
+ elif purpose == 'flanking':
+ print('Not yet active. Please check back later.')
+ else:
+ raise ValueError('Purpose must be either inclusion or flanking. Please provide the correct purpose for the splicegraph information.')
+
+def project_ptms_onto_SpliceSeq(psi_data, splicegraph, gene_col ='symbol', dPSI_col = None, sig_col = None, extra_cols = None, coordinate_type = 'hg19', separate_modification_types = False, identify_flanking_sequences = False, flank_size = 5, PROCESSES = 1):
+ #remove ME events from this analysis
+ print('Removing ME events from analysis')
+ psi_data = psi_data[psi_data['splice_type'] != 'ME'].copy()
+
+ #split exons into individual exons
+ psi_data['Individual exon'] = psi_data['exons'].apply(lambda x: x.split(':'))
+ psi_data = psi_data.explode('Individual exon').drop_duplicates()
+ psi_data['Individual exon'] = psi_data['Individual exon'].astype(float)
+
+ #add gene location information to psi data from spliceseq
+
+ spliced_data = psi_data.merge(splicegraph, left_on = ['symbol', 'Individual exon'], right_on = ['Symbol', 'Exon'], how = 'left')
+ spliced_data = spliced_data.rename(columns = {'Chr_Start': 'spliced_region_start', 'Chr_Stop': 'spliced_region_end'})
+
+ print('Projecting PTMs onto SpliceSeq data')
+ spliced_data, spliced_ptms = project_ptms_onto_splice_events(spliced_data, chromosome_col = 'Chromosome', strand_col = 'Strand', gene_col = 'symbol', region_start_col = 'spliced_region_start', region_end_col = 'spliced_region_end', event_id_col = 'as_id',dPSI_col = dPSI_col, sig_col = sig_col, extra_cols = extra_cols, separate_modification_types = separate_modification_types, coordinate_type = coordinate_type, PROCESSES = PROCESSES)
+
+ ## add code for extracting flanking sequences (to do)
+ if identify_flanking_sequences:
+ altered_flanks = fs.get_flanking_changes_from_splicegraph(psi_data, splicegraph, dPSI_col = dPSI_col, sig_col = sig_col, extra_cols = extra_cols, gene_col = gene_col, coordinate_type=coordinate_type, flank_size = flank_size)
+
+ return spliced_data, spliced_ptms, altered_flanks
+ else:
+ return spliced_data, spliced_ptms
+
+
+#def project_ptms_onto_TCGA_SpliceSeq(tcga_cancer = 'PRAD'):
+# """
+# In progress. Will download and process TCGA SpliceSeq data for a specific cancer type, and project PTMs onto the spliced regions.
+# """
+# print('Not yet active. Please check back later.')
+# pass
+
+
+def check_columns(splice_data, chromosome_col = None, strand_col = None, region_start_col = None, region_end_col = None, first_flank_start_col = None, first_flank_end_col = None, second_flank_start_col = None, second_flank_end_col = None, gene_col = None, dPSI_col = None, sig_col = None, event_id_col = None, extra_cols = None):
+ """
+ Function to quickly check if the provided column names exist in the dataset and if they are the correct type of data
+ """
+ expected_cols = [chromosome_col, strand_col, region_start_col, region_end_col, first_flank_start_col, first_flank_end_col, second_flank_start_col, second_flank_end_col, gene_col, dPSI_col, sig_col, event_id_col]
+ expected_dtypes = [[str, object], [str,int, object], [int,float], [int,float], [int,float], [int,float], [int,float], [int,float], [str, object], float, float, None]
+
+ #remove cols with None and the corresponding dtype entry
+ expected_dtypes = [dtype for col, dtype in zip(expected_cols, expected_dtypes) if col is not None]
+ expected_cols = [col for col in expected_cols if col is not None]
+
+ #add extra columns to the expected columns list
+ if extra_cols is not None:
+ expected_cols += extra_cols
+ expected_dtypes += [None]*len(extra_cols) #extra columns do not have dtype requirement
+
+
+ #check to make sure columns exist in the dataframe
+ if not all([x in splice_data.columns for x in expected_cols]):
+ raise ValueError('Not all expected columns are present in the splice data. Please check the column names and provide the correct names for the following columns: {}'.format([x for x in expected_cols if x not in splice_data.columns]))
+
+ #check to make sure columns are the correct data type
+ for col, data_type in zip(expected_cols, expected_dtypes):
+ if data_type is None:
+ continue
+ elif isinstance(data_type, list):
+ if splice_data[col].dtype not in data_type:
+ #try converting to the expected data type
+ try:
+ splice_data[col] = splice_data[col].astype(data_type[0])
+ except:
+ raise ValueError('Column {} is not the expected data type. Expected data type is one of {}, but found data type {}'.format(col, data_type, splice_data[col].dtype))
+ else:
+ if splice_data[col].dtype != data_type:
+ #try converting to the expected data type
+ try:
+ splice_data[col] = splice_data[col].astype(data_type)
+ except:
+ raise ValueError('Column {} is not the expected data type. Expected data type is {}, but found data type {}'.format(col, data_type, splice_data[col].dtype))
+
+
+
+
+
+
+
\n", + " | geneSymbol | \n", + "chr | \n", + "strand | \n", + "exonStart_0base | \n", + "exonEnd | \n", + "meanDeltaPSI | \n", + "FDR | \n", + "
---|---|---|---|---|---|---|---|
0 | \n", + "SPAG9 | \n", + "chr17 | \n", + "- | \n", + "49053223 | \n", + "49053262 | \n", + "0.227 | \n", + "0 | \n", + "
1 | \n", + "ARHGAP17 | \n", + "chr16 | \n", + "- | \n", + "24950684 | \n", + "24950918 | \n", + "0.413 | \n", + "0 | \n", + "
2 | \n", + "ITGA6 | \n", + "chr2 | \n", + "+ | \n", + "173366499 | \n", + "173366629 | \n", + "-0.361 | \n", + "0 | \n", + "
3 | \n", + "KRAS | \n", + "chr12 | \n", + "- | \n", + "25368370 | \n", + "25368494 | \n", + "-0.068 | \n", + "0 | \n", + "
4 | \n", + "TCIRG1 | \n", + "chr11 | \n", + "+ | \n", + "67817953 | \n", + "67818131 | \n", + "0.368 | \n", + "0 | \n", + "
\n", + " | geneSymbol | \n", + "chr | \n", + "strand | \n", + "exonStart_0base | \n", + "exonEnd | \n", + "meanDeltaPSI | \n", + "FDR | \n", + "PTMs | \n", + "Number of PTMs Affected | \n", + "Number of Unique PTM Sites by Position | \n", + "Event Length | \n", + "PTM Density (PTMs/bp) | \n", + "
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", + "SPAG9 | \n", + "17 | \n", + "- | \n", + "49053223 | \n", + "49053262 | \n", + "0.227 | \n", + "0 | \n", + "NaN | \n", + "0 | \n", + "0 | \n", + "39 | \n", + "0.0 | \n", + "
1 | \n", + "ARHGAP17 | \n", + "16 | \n", + "- | \n", + "24950684 | \n", + "24950918 | \n", + "0.413 | \n", + "0 | \n", + "Q68EM7_S575.0 (Phosphorylation)/Q68EM7_S570.0 ... | \n", + "6 | \n", + "1 | \n", + "234 | \n", + "0.004274 | \n", + "
2 | \n", + "ITGA6 | \n", + "2 | \n", + "+ | \n", + "173366499 | \n", + "173366629 | \n", + "-0.361 | \n", + "0 | \n", + "P23229_Ynan (Phosphorylation)/P23229_Tnan (Pho... | \n", + "7 | \n", + "4 | \n", + "130 | \n", + "0.030769 | \n", + "
3 | \n", + "KRAS | \n", + "12 | \n", + "- | \n", + "25368370 | \n", + "25368494 | \n", + "-0.068 | \n", + "0 | \n", + "P01116_C186 (Methylation)/P01116_C180 (Palmito... | \n", + "3 | \n", + "2 | \n", + "124 | \n", + "0.016129 | \n", + "
4 | \n", + "TCIRG1 | \n", + "11 | \n", + "+ | \n", + "67817953 | \n", + "67818131 | \n", + "0.368 | \n", + "0 | \n", + "NaN | \n", + "0 | \n", + "0 | \n", + "178 | \n", + "0.0 | \n", + "
\n", + " | dPSI | \n", + "Significance | \n", + "Gene | \n", + "Source of PTM | \n", + "UniProtKB Accession | \n", + "Residue | \n", + "PTM Position in Canonical Isoform | \n", + "Gene Location (hg19) | \n", + "Modification | \n", + "Modification Class | \n", + "Proximity to Region Start (bp) | \n", + "Proximity to Region End (bp) | \n", + "Proximity to Splice Boundary (bp) | \n", + "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", + "0.413 | \n", + "0.0 | \n", + "ARHGAP17 | \n", + "Q68EM7-1_S575 | \n", + "Q68EM7 | \n", + "S | \n", + "575.0 | \n", + "24950686.0 | \n", + "Phosphoserine | \n", + "Phosphorylation | \n", + "2.0 | \n", + "232.0 | \n", + "2.0 | \n", + "
1 | \n", + "0.413 | \n", + "0.0 | \n", + "ARHGAP17 | \n", + "Q68EM7-1_S570 | \n", + "Q68EM7 | \n", + "S | \n", + "570.0 | \n", + "24950701.0 | \n", + "Phosphoserine | \n", + "Phosphorylation | \n", + "17.0 | \n", + "217.0 | \n", + "17.0 | \n", + "
2 | \n", + "0.413 | \n", + "0.0 | \n", + "ARHGAP17 | \n", + "Q68EM7-1_S560 | \n", + "Q68EM7 | \n", + "S | \n", + "560.0 | \n", + "24950731.0 | \n", + "Phosphoserine | \n", + "Phosphorylation | \n", + "47.0 | \n", + "187.0 | \n", + "47.0 | \n", + "
3 | \n", + "0.413 | \n", + "0.0 | \n", + "ARHGAP17 | \n", + "Q68EM7-1_S553 | \n", + "Q68EM7 | \n", + "S | \n", + "553.0 | \n", + "24950752.0 | \n", + "Phosphoserine | \n", + "Phosphorylation | \n", + "68.0 | \n", + "166.0 | \n", + "68.0 | \n", + "
4 | \n", + "0.413 | \n", + "0.0 | \n", + "ARHGAP17 | \n", + "Q68EM7-1_S547 | \n", + "Q68EM7 | \n", + "S | \n", + "547.0 | \n", + "24950770.0 | \n", + "Phosphoserine | \n", + "Phosphorylation | \n", + "86.0 | \n", + "148.0 | \n", + "86.0 | \n", + "
\n", + " | Event ID | \n", + "Source of PTM | \n", + "Residue | \n", + "PTM Position in Canonical Isoform | \n", + "Inclusion Sequence | \n", + "Exclusion Sequence | \n", + "Region | \n", + "Translation Success | \n", + "Matched | \n", + "
---|---|---|---|---|---|---|---|---|---|
0 | \n", + "3 | \n", + "P01116-2_T148;P01116-1_T148 | \n", + "T | \n", + "148 | \n", + "ETSAKtRQESG | \n", + "ETSAKtRQGC* | \n", + "Second | \n", + "True | \n", + "False | \n", + "
1 | \n", + "3 | \n", + "P01116-1_K147;P01116-2_K147 | \n", + "K | \n", + "147 | \n", + "IETSAkTRQES | \n", + "IETSAkTRQGC | \n", + "Second | \n", + "True | \n", + "False | \n", + "
0 | \n", + "8 | \n", + "Q9UPQ0-1_S746 | \n", + "S | \n", + "746 | \n", + "LPNLNsQGVAW | \n", + "LPNLNsQGGFS | \n", + "First | \n", + "True | \n", + "False | \n", + "
1 | \n", + "8 | \n", + "Q9UPQ0-10_S750;Q9UPQ0-6_S596;Q9UPQ0-1_S750 | \n", + "S | \n", + "750 | \n", + "PSQVDsPSSEK | \n", + "ILKVDsPSSEK | \n", + "Second | \n", + "True | \n", + "False | \n", + "
0 | \n", + "11 | \n", + "P62847-1_K129 | \n", + "K | \n", + "NaN | \n", + "NVGAGkKSVSW | \n", + "NVGAGkKAEGV | \n", + "First | \n", + "True | \n", + "False | \n", + "
\n", + " | Gene | \n", + "UniProtKB Accession | \n", + "Residue | \n", + "PTM Position in Canonical Isoform | \n", + "Modification Class | \n", + "PSP:ON_PROCESS | \n", + "
---|---|---|---|---|---|---|
145 | \n", + "CEACAM1 | \n", + "P13688 | \n", + "S | \n", + "461.0 | \n", + "Phosphorylation | \n", + "apoptosis, altered | \n", + "
184 | \n", + "YAP1 | \n", + "P46937 | \n", + "K | \n", + "342.0 | \n", + "Ubiquitination | \n", + "carcinogenesis, altered | \n", + "
217 | \n", + "TSC2 | \n", + "P49815 | \n", + "S | \n", + "981.0 | \n", + "Phosphorylation | \n", + "carcinogenesis, inhibited; cell growth, inhibi... | \n", + "
395 | \n", + "SPHK2 | \n", + "Q9NRA0 | \n", + "S | \n", + "387.0 | \n", + "Phosphorylation | \n", + "cell motility, altered | \n", + "
407 | \n", + "SPHK2 | \n", + "Q9NRA0 | \n", + "T | \n", + "614.0 | \n", + "Phosphorylation | \n", + "cell motility, altered | \n", + "
\n", + " | PSP:ON_PROCESS | \n", + "count | \n", + "
---|---|---|
0 | \n", + "cell motility, altered | \n", + "3 | \n", + "
1 | \n", + "cell growth, induced | \n", + "2 | \n", + "
2 | \n", + "apoptosis, altered | \n", + "1 | \n", + "
3 | \n", + "carcinogenesis, altered | \n", + "1 | \n", + "
4 | \n", + "carcinogenesis, inhibited | \n", + "1 | \n", + "
5 | \n", + "cell growth, inhibited | \n", + "1 | \n", + "
6 | \n", + "autophagy, inhibited | \n", + "1 | \n", + "
7 | \n", + "signaling pathway regulation | \n", + "1 | \n", + "
8 | \n", + "cytoskeletal reorganization | \n", + "1 | \n", + "
9 | \n", + "cell adhesion, inhibited | \n", + "1 | \n", + "
\n", + " | database | \n", + "annotation_type | \n", + "column | \n", + "
---|---|---|---|
5 | \n", + "Combined | \n", + "Interactions | \n", + "Combined:Interactions | \n", + "
8 | \n", + "Combined | \n", + "Kinase | \n", + "Combined:Kinase | \n", + "
1 | \n", + "DEPOD | \n", + "Phosphatase | \n", + "DEPOD:Phosphatase | \n", + "
2 | \n", + "ELM | \n", + "Interactions | \n", + "ELM:Interactions | \n", + "
0 | \n", + "PhosphoSitePlus | \n", + "Interactions | \n", + "PSP:ON_PROT_INTERACT | \n", + "
3 | \n", + "PhosphoSitePlus | \n", + "Disease | \n", + "PSP:Disease_Association | \n", + "
4 | \n", + "PhosphoSitePlus | \n", + "Process | \n", + "PSP:ON_PROCESS | \n", + "
6 | \n", + "PhosphoSitePlus | \n", + "Function | \n", + "PSP:ON_FUNCTION | \n", + "
7 | \n", + "RegPhos | \n", + "Kinase | \n", + "RegPhos:Kinase | \n", + "
\n", + " | Gene | \n", + "UniProtKB Accession | \n", + "Residue | \n", + "PTM Position in Canonical Isoform | \n", + "Modification Class | \n", + "PSP:ON_PROCESS | \n", + "dPSI | \n", + "Significance | \n", + "Impact | \n", + "
---|---|---|---|---|---|---|---|---|---|
0 | \n", + "BCAR1 | \n", + "P56945 | \n", + "Y | \n", + "267.0 | \n", + "Phosphorylation | \n", + "cell growth, induced | \n", + "-0.07 | \n", + "0.0458775672499 | \n", + "Excluded | \n", + "
1 | \n", + "BCAR1 | \n", + "P56945 | \n", + "Y | \n", + "287.0 | \n", + "Phosphorylation | \n", + "cell growth, induced | \n", + "-0.07 | \n", + "0.0458775672499 | \n", + "Excluded | \n", + "
2 | \n", + "BIN1 | \n", + "O00499 | \n", + "T | \n", + "348.0 | \n", + "Phosphorylation | \n", + "signaling pathway regulation | \n", + "-0.112 | \n", + "0.0233903490744 | \n", + "Excluded | \n", + "
3 | \n", + "CEACAM1 | \n", + "P13688 | \n", + "S | \n", + "461.0 | \n", + "Phosphorylation | \n", + "apoptosis, altered | \n", + "0.525 | \n", + "1.73943268451e-09 | \n", + "Included | \n", + "
4 | \n", + "CTTN | \n", + "Q14247 | \n", + "K | \n", + "272.0 | \n", + "Acetylation | \n", + "cell motility, inhibited | \n", + "0.09 | \n", + "0.0355211287599 | \n", + "Included | \n", + "
5 | \n", + "CTTN | \n", + "Q14247 | \n", + "S | \n", + "298.0 | \n", + "Phosphorylation | \n", + "cell motility, altered; cytoskeletal reorganiz... | \n", + "0.09 | \n", + "0.0355211287599 | \n", + "Included | \n", + "
6 | \n", + "SPHK2 | \n", + "Q9NRA0 | \n", + "S | \n", + "387.0 | \n", + "Phosphorylation | \n", + "cell motility, altered | \n", + "0.253 | \n", + "0.0129400018182 | \n", + "Included | \n", + "
7 | \n", + "SPHK2 | \n", + "Q9NRA0 | \n", + "T | \n", + "614.0 | \n", + "Phosphorylation | \n", + "cell motility, altered | \n", + "0.253 | \n", + "0.0129400018182 | \n", + "Included | \n", + "
8 | \n", + "TSC2 | \n", + "P49815 | \n", + "S | \n", + "981.0 | \n", + "Phosphorylation | \n", + "carcinogenesis, inhibited; cell growth, inhibi... | \n", + "-0.219 | \n", + "4.18472157275e-05 | \n", + "Excluded | \n", + "
9 | \n", + "YAP1 | \n", + "P46937 | \n", + "K | \n", + "342.0 | \n", + "Ubiquitination | \n", + "carcinogenesis, altered | \n", + "-0.188;-0.161 | \n", + "0.000211254197372;4.17884655686e-07 | \n", + "Excluded | \n", + "
\n", + " | All Impacted | \n", + "Included | \n", + "Excluded | \n", + "Altered Flank | \n", + "
---|---|---|---|---|
PSP:ON_PROCESS | \n", + "\n", + " | \n", + " | \n", + " | \n", + " |
cell motility, altered | \n", + "3 | \n", + "3.0 | \n", + "0.0 | \n", + "0.0 | \n", + "
cell growth, induced | \n", + "2 | \n", + "0.0 | \n", + "2.0 | \n", + "0.0 | \n", + "
signaling pathway regulation | \n", + "2 | \n", + "0.0 | \n", + "2.0 | \n", + "0.0 | \n", + "
apoptosis, altered | \n", + "1 | \n", + "1.0 | \n", + "0.0 | \n", + "0.0 | \n", + "
cell motility, inhibited | \n", + "1 | \n", + "1.0 | \n", + "0.0 | \n", + "0.0 | \n", + "
cytoskeletal reorganization | \n", + "1 | \n", + "1.0 | \n", + "0.0 | \n", + "0.0 | \n", + "
cell adhesion, inhibited | \n", + "1 | \n", + "1.0 | \n", + "0.0 | \n", + "0.0 | \n", + "
carcinogenesis, inhibited | \n", + "1 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "
cell growth, inhibited | \n", + "1 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "
autophagy, inhibited | \n", + "1 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "
carcinogenesis, altered | \n", + "1 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "
\n", + " | database | \n", + "annotation_type | \n", + "column | \n", + "
---|---|---|---|
4 | \n", + "Combined | \n", + "Interactions | \n", + "Combined:Interactions | \n", + "
5 | \n", + "Combined | \n", + "Kinase | \n", + "Combined:Kinase | \n", + "
2 | \n", + "DEPOD | \n", + "Phosphatase | \n", + "DEPOD:Phosphatase | \n", + "
3 | \n", + "ELM | \n", + "Interactions | \n", + "ELM:Interactions | \n", + "
0 | \n", + "PhosphoSitePlus | \n", + "Process | \n", + "PSP:ON_PROCESS | \n", + "
1 | \n", + "PhosphoSitePlus | \n", + "Interactions | \n", + "PSP:ON_PROT_INTERACT | \n", + "
6 | \n", + "PhosphoSitePlus | \n", + "Disease | \n", + "PSP:Disease_Association | \n", + "
8 | \n", + "PhosphoSitePlus | \n", + "Function | \n", + "PSP:ON_FUNCTION | \n", + "
7 | \n", + "RegPhos | \n", + "Kinase | \n", + "RegPhos:Kinase | \n", + "
\n", + " | Fraction Impacted | \n", + "p-value | \n", + "Adjusted p-value | \n", + "PTM | \n", + "
---|---|---|---|---|
PSP:ON_PROCESS | \n", + "\n", + " | \n", + " | \n", + " | \n", + " |
cell motility | \n", + "5/1078 | \n", + "0.052579 | \n", + "0.420633 | \n", + "ABI1_S392;CTTN_K272;CTTN_S298;SPHK2_S387;SPHK2... | \n", + "
cell adhesion | \n", + "2/324 | \n", + "0.122466 | \n", + "0.489864 | \n", + "CTTN_S298;MPZL1_Y241 | \n", + "
cell growth | \n", + "4/1793 | \n", + "0.427134 | \n", + "1.000000 | \n", + "BCAR1_Y267;BCAR1_Y287;BCAR1_Y306;TSC2_S981 | \n", + "
autophagy | \n", + "1/306 | \n", + "0.434215 | \n", + "0.868429 | \n", + "TSC2_S981 | \n", + "
cytoskeletal reorganization | \n", + "2/796 | \n", + "0.435637 | \n", + "0.868429 | \n", + "ABI1_S392;CTTN_S298 | \n", + "
apoptosis | \n", + "2/1179 | \n", + "0.644065 | \n", + "0.868429 | \n", + "CEACAM1_S461;CEACAM1_T457 | \n", + "
signaling pathway regulation | \n", + "2/1206 | \n", + "0.656208 | \n", + "0.868429 | \n", + "BIN1_T348;TSC2_S981 | \n", + "
carcinogenesis | \n", + "2/1501 | \n", + "0.768091 | \n", + "0.868429 | \n", + "TSC2_S981;YAP1_K342 | \n", + "
\n", + " | Gene_set | \n", + "Term | \n", + "Overlap | \n", + "P-value | \n", + "Adjusted P-value | \n", + "Old P-value | \n", + "Old Adjusted P-value | \n", + "Odds Ratio | \n", + "Combined Score | \n", + "Genes | \n", + "Type | \n", + "Genes with Differentially Included PTMs only | \n", + "Genes with PTM with Altered Flanking Sequence only | \n", + "Genes with Both | \n", + "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", + "GO_Biological_Process_2023 | \n", + "Regulation Of Neurogenesis (GO:0050767) | \n", + "5/67 | \n", + "0.000018 | \n", + "0.011675 | \n", + "0 | \n", + "0 | \n", + "17.392181 | \n", + "189.619722 | \n", + "YAP1;APLP2;DOCK7;NUMB;NF2 | \n", + "Differentially Included + Altered Flanking Seq... | \n", + "YAP1;APLP2 | \n", + "NF2 | \n", + "DOCK7;NUMB | \n", + "
1 | \n", + "GO_Biological_Process_2023 | \n", + "Enzyme-Linked Receptor Protein Signaling Pathw... | \n", + "6/124 | \n", + "0.000031 | \n", + "0.011675 | \n", + "0 | \n", + "0 | \n", + "11.055131 | \n", + "114.642865 | \n", + "CSF1;FGFR3;FGFR2;PTPRF;BCAR1;MPZL1 | \n", + "Differentially Included + Altered Flanking Seq... | \n", + "FGFR2;CSF1;FGFR3 | \n", + "\n", + " | MPZL1;BCAR1;PTPRF | \n", + "
2 | \n", + "GO_Biological_Process_2023 | \n", + "Protein Localization To Cell-Cell Junction (GO... | \n", + "3/15 | \n", + "0.000048 | \n", + "0.011675 | \n", + "0 | \n", + "0 | \n", + "52.901596 | \n", + "525.813416 | \n", + "TJP1;LSR;SCRIB | \n", + "Differentially Included + Altered Flanking Seq... | \n", + "\n", + " | LSR | \n", + "SCRIB;TJP1 | \n", + "
3 | \n", + "GO_Biological_Process_2023 | \n", + "Regulation Of Cell Migration (GO:0030334) | \n", + "10/434 | \n", + "0.000049 | \n", + "0.011675 | \n", + "0 | \n", + "0 | \n", + "5.280579 | \n", + "52.425684 | \n", + "TJP1;CEACAM1;CSF1;ADAM15;LIMCH1;APLP2;NUMB;ITG... | \n", + "Differentially Included + Altered Flanking Seq... | \n", + "APLP2;CSF1;ITGA6 | \n", + "NF2 | \n", + "ADAM15;NUMB;LIMCH1;BCAR1;TJP1;CEACAM1 | \n", + "
4 | \n", + "GO_Biological_Process_2023 | \n", + "Integrin-Mediated Signaling Pathway (GO:0007229) | \n", + "5/85 | \n", + "0.000058 | \n", + "0.011675 | \n", + "0 | \n", + "0 | \n", + "13.466712 | \n", + "131.282293 | \n", + "CEACAM1;ADAM15;ITGA6;CD47;BCAR1 | \n", + "Differentially Included + Altered Flanking Seq... | \n", + "ITGA6;CD47 | \n", + "\n", + " | ADAM15;CEACAM1;BCAR1 | \n", + "
\n", + " | Modified Gene | \n", + "Interacting Gene | \n", + "Residue | \n", + "Type | \n", + "Source | \n", + "dPSI | \n", + "Regulation Change | \n", + "
---|---|---|---|---|---|---|---|
0 | \n", + "ADAM15 | \n", + "HCK | \n", + "Y735;Y715 | \n", + "REGULATES | \n", + "PSP/RegPhos | \n", + "0.181;-0.052 | \n", + "+;- | \n", + "
1 | \n", + "ADAM15 | \n", + "LCK | \n", + "Y715 | \n", + "REGULATES | \n", + "PSP/RegPhos | \n", + "0.181;-0.052 | \n", + "+;- | \n", + "
2 | \n", + "ADAM15 | \n", + "SRC | \n", + "Y735;Y715 | \n", + "REGULATES | \n", + "PSP/RegPhos | \n", + "0.181;-0.052 | \n", + "+;- | \n", + "
3 | \n", + "BCAR1 | \n", + "SRC | \n", + "Y267;Y287 | \n", + "REGULATES | \n", + "PSP/RegPhos | \n", + "-0.07 | \n", + "- | \n", + "
4 | \n", + "BIN1 | \n", + "MAPT | \n", + "T348 | \n", + "INDUCES | \n", + "PhosphoSitePlus;PTMInt | \n", + "-0.112 | \n", + "- | \n", + "
\n", + " | UniProtKB Accession | \n", + "Residue | \n", + "PTM Position in Canonical Isoform | \n", + "Modification Class | \n", + "Inclusion Sequence | \n", + "Exclusion Sequence | \n", + "Sequence Identity | \n", + "Altered Positions | \n", + "Residue Change | \n", + "Altered Flank Side | \n", + "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", + "P01116 | \n", + "T | \n", + "148 | \n", + "Phosphorylation | \n", + "ETSAKtRQESG | \n", + "ETSAKtRQGC* | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "
1 | \n", + "P01116 | \n", + "K | \n", + "147 | \n", + "Acetylation | \n", + "IETSAkTRQES | \n", + "IETSAkTRQGC | \n", + "0.818182 | \n", + "[4.0, 5.0] | \n", + "[E->G, S->C] | \n", + "C-term only | \n", + "
2 | \n", + "P01116 | \n", + "K | \n", + "147 | \n", + "Ubiquitination | \n", + "IETSAkTRQES | \n", + "IETSAkTRQGC | \n", + "0.818182 | \n", + "[4.0, 5.0] | \n", + "[E->G, S->C] | \n", + "C-term only | \n", + "
3 | \n", + "Q9UPQ0 | \n", + "S | \n", + "746 | \n", + "Phosphorylation | \n", + "LPNLNsQGVAW | \n", + "LPNLNsQGGFS | \n", + "0.727273 | \n", + "[3.0, 4.0, 5.0] | \n", + "[V->G, A->F, W->S] | \n", + "C-term only | \n", + "
4 | \n", + "Q9UPQ0 | \n", + "S | \n", + "750 | \n", + "Phosphorylation | \n", + "PSQVDsPSSEK | \n", + "ILKVDsPSSEK | \n", + "0.727273 | \n", + "[-5.0, -4.0, -3.0] | \n", + "[P->I, S->L, Q->K] | \n", + "N-term only | \n", + "
\n", + " | Gene | \n", + "UniProtKB Accession | \n", + "Residue | \n", + "PTM Position in Canonical Isoform | \n", + "Modification Class | \n", + "Inclusion Sequence | \n", + "Exclusion Sequence | \n", + "Motif only in Inclusion | \n", + "Motif only in Exclusion | \n", + "Altered Positions | \n", + "Residue Change | \n", + "
---|---|---|---|---|---|---|---|---|---|---|---|
22 | \n", + "MLPH | \n", + "Q9BV36 | \n", + "S | \n", + "337 | \n", + "Phosphorylation | \n", + "RGRASsESQDL | \n", + "RGRASsESQGS | \n", + "LIG_14-3-3_CanoR_1 | \n", + "NaN | \n", + "[4.0, 5.0] | \n", + "[D->G, L->S] | \n", + "
23 | \n", + "MLPH | \n", + "Q9BV36 | \n", + "S | \n", + "339 | \n", + "Phosphorylation | \n", + "RASSEsQDL*A | \n", + "RASSEsQGSRC | \n", + "LIG_14-3-3_CanoR_1 | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "
50 | \n", + "CEACAM1 | \n", + "P13688 | \n", + "T | \n", + "457 | \n", + "Phosphorylation | \n", + "LHFGKtGRGKR | \n", + "LHFGKtGRLRT | \n", + "NaN | \n", + "LIG_14-3-3_CterR_2 | \n", + "[3.0, 4.0, 5.0] | \n", + "[G->L, K->R, R->T] | \n", + "
67 | \n", + "ENAH | \n", + "Q8N8S7 | \n", + "S | \n", + "512 | \n", + "Phosphorylation | \n", + "KSPVIsRTGFS | \n", + "KSPVIsRTKIH | \n", + "LIG_14-3-3_CterR_2 | \n", + "NaN | \n", + "[3.0, 4.0, 5.0] | \n", + "[G->K, F->I, S->H] | \n", + "
93 | \n", + "LMO7 | \n", + "Q8WWI1-3 | \n", + "S | \n", + "356 | \n", + "Phosphorylation | \n", + "ADGTFsRTLSK | \n", + "ADGTFsRE*VH | \n", + "LIG_14-3-3_CterR_2 | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "
129 | \n", + "MAP3K7 | \n", + "O43318 | \n", + "T | \n", + "403 | \n", + "Phosphorylation | \n", + "RIAATtGLFQA | \n", + "RIAATtGQRTA | \n", + "LIG_14-3-3_CanoR_1 | \n", + "NaN | \n", + "[2.0, 3.0, 4.0] | \n", + "[L->Q, F->R, Q->T] | \n", + "
141 | \n", + "LMO7 | \n", + "Q8WWI1-3 | \n", + "T | \n", + "354 | \n", + "Phosphorylation | \n", + "TEADGtFSR*S | \n", + "TEADGtFSRE* | \n", + "LIG_14-3-3_CterR_2 | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "
' + + '' + + Documentation.gettext("Hide Search Matches") + + "
" + ) + ); + }, + + /** + * helper function to hide the search marks again + */ + hideSearchWords: () => { + document + .querySelectorAll("#searchbox .highlight-link") + .forEach((el) => el.remove()); + document + .querySelectorAll("span.highlighted") + .forEach((el) => el.classList.remove("highlighted")); + const url = new URL(window.location); + url.searchParams.delete("highlight"); + window.history.replaceState({}, "", url); + }, + + /** + * helper function to focus on search bar + */ + focusSearchBar: () => { + document.querySelectorAll("input[name=q]")[0]?.focus(); + }, + + /** + * Initialise the domain index toggle buttons + */ + initDomainIndexTable: () => { + const toggler = (el) => { + const idNumber = el.id.substr(7); + const toggledRows = document.querySelectorAll(`tr.cg-${idNumber}`); + if (el.src.substr(-9) === "minus.png") { + el.src = `${el.src.substr(0, el.src.length - 9)}plus.png`; + toggledRows.forEach((el) => (el.style.display = "none")); + } else { + el.src = `${el.src.substr(0, el.src.length - 8)}minus.png`; + toggledRows.forEach((el) => (el.style.display = "")); + } + }; + + const togglerElements = document.querySelectorAll("img.toggler"); + togglerElements.forEach((el) => + el.addEventListener("click", (event) => toggler(event.currentTarget)) + ); + togglerElements.forEach((el) => (el.style.display = "")); + if (DOCUMENTATION_OPTIONS.COLLAPSE_INDEX) togglerElements.forEach(toggler); + }, + + initOnKeyListeners: () => { + // only install a listener if it is really needed + if ( + !DOCUMENTATION_OPTIONS.NAVIGATION_WITH_KEYS && + !DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS + ) + return; + + const blacklistedElements = new Set([ + "TEXTAREA", + "INPUT", + "SELECT", + "BUTTON", + ]); + document.addEventListener("keydown", (event) => { + if (blacklistedElements.has(document.activeElement.tagName)) return; // bail for input elements + if (event.altKey || event.ctrlKey || event.metaKey) return; // bail with special keys + + if (!event.shiftKey) { + switch (event.key) { + case "ArrowLeft": + if (!DOCUMENTATION_OPTIONS.NAVIGATION_WITH_KEYS) break; + + const prevLink = document.querySelector('link[rel="prev"]'); + if (prevLink && prevLink.href) { + window.location.href = prevLink.href; + event.preventDefault(); + } + break; + case "ArrowRight": + if (!DOCUMENTATION_OPTIONS.NAVIGATION_WITH_KEYS) break; + + const nextLink = document.querySelector('link[rel="next"]'); + if (nextLink && nextLink.href) { + window.location.href = nextLink.href; + event.preventDefault(); + } + break; + case "Escape": + if (!DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS) break; + Documentation.hideSearchWords(); + event.preventDefault(); + } + } + + // some keyboard layouts may need Shift to get / + switch (event.key) { + case "/": + if (!DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS) break; + Documentation.focusSearchBar(); + event.preventDefault(); + } + }); + }, +}; + +// quick alias for translations +const _ = Documentation.gettext; + +_ready(Documentation.init); diff --git a/_static/documentation_options.js b/_static/documentation_options.js new file mode 100644 index 0000000..10e567e --- /dev/null +++ b/_static/documentation_options.js @@ -0,0 +1,14 @@ +var DOCUMENTATION_OPTIONS = { + URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'), + VERSION: '1', + LANGUAGE: 'en', + COLLAPSE_INDEX: false, + BUILDER: 'html', + FILE_SUFFIX: '.html', + LINK_SUFFIX: '.html', + HAS_SOURCE: true, + SOURCELINK_SUFFIX: '', + NAVIGATION_WITH_KEYS: false, + SHOW_SEARCH_SUMMARY: true, + ENABLE_SEARCH_SHORTCUTS: false, +}; \ No newline at end of file diff --git a/_static/file.png b/_static/file.png new file mode 100644 index 0000000..a858a41 Binary files /dev/null and b/_static/file.png differ diff --git a/_static/images/logo_binder.svg b/_static/images/logo_binder.svg new file mode 100644 index 0000000..45fecf7 --- /dev/null +++ b/_static/images/logo_binder.svg @@ -0,0 +1,19 @@ + + + diff --git a/_static/images/logo_colab.png b/_static/images/logo_colab.png new file mode 100644 index 0000000..b7560ec Binary files /dev/null and b/_static/images/logo_colab.png differ diff --git a/_static/images/logo_deepnote.svg b/_static/images/logo_deepnote.svg new file mode 100644 index 0000000..fa77ebf --- /dev/null +++ b/_static/images/logo_deepnote.svg @@ -0,0 +1 @@ + diff --git a/_static/images/logo_jupyterhub.svg b/_static/images/logo_jupyterhub.svg new file mode 100644 index 0000000..60cfe9f --- /dev/null +++ b/_static/images/logo_jupyterhub.svg @@ -0,0 +1 @@ + diff --git a/_static/jquery-3.6.0.js b/_static/jquery-3.6.0.js new file mode 100644 index 0000000..fc6c299 --- /dev/null +++ b/_static/jquery-3.6.0.js @@ -0,0 +1,10881 @@ +/*! + * jQuery JavaScript Library v3.6.0 + * https://jquery.com/ + * + * Includes Sizzle.js + * https://sizzlejs.com/ + * + * Copyright OpenJS Foundation and other contributors + * Released under the MIT license + * https://jquery.org/license + * + * Date: 2021-03-02T17:08Z + */ +( function( global, factory ) { + + "use strict"; + + if ( typeof module === "object" && typeof module.exports === "object" ) { + + // For CommonJS and CommonJS-like environments where a proper `window` + // is present, execute the factory and get jQuery. + // For environments that do not have a `window` with a `document` + // (such as Node.js), expose a factory as module.exports. + // This accentuates the need for the creation of a real `window`. + // e.g. var jQuery = require("jquery")(window); + // See ticket #14549 for more info. + module.exports = global.document ? + factory( global, true ) : + function( w ) { + if ( !w.document ) { + throw new Error( "jQuery requires a window with a document" ); + } + return factory( w ); + }; + } else { + factory( global ); + } + +// Pass this if window is not defined yet +} )( typeof window !== "undefined" ? window : this, function( window, noGlobal ) { + +// Edge <= 12 - 13+, Firefox <=18 - 45+, IE 10 - 11, Safari 5.1 - 9+, iOS 6 - 9.1 +// throw exceptions when non-strict code (e.g., ASP.NET 4.5) accesses strict mode +// arguments.callee.caller (trac-13335). But as of jQuery 3.0 (2016), strict mode should be common +// enough that all such attempts are guarded in a try block. +"use strict"; + +var arr = []; + +var getProto = Object.getPrototypeOf; + +var slice = arr.slice; + +var flat = arr.flat ? function( array ) { + return arr.flat.call( array ); +} : function( array ) { + return arr.concat.apply( [], array ); +}; + + +var push = arr.push; + +var indexOf = arr.indexOf; + +var class2type = {}; + +var toString = class2type.toString; + +var hasOwn = class2type.hasOwnProperty; + +var fnToString = hasOwn.toString; + +var ObjectFunctionString = fnToString.call( Object ); + +var support = {}; + +var isFunction = function isFunction( obj ) { + + // Support: Chrome <=57, Firefox <=52 + // In some browsers, typeof returns "function" for HTML