-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coordinate retrieval errors on peptide belonging to multiple transcripts #123
Comments
Thanks for the detailed report @manogenome ! I'll have a closer look into it. |
The coordinates are all correctly calculated, but the problem is the mapping of Uniprot IDs to Ensembl Protein IDs: eprt <- proteins(edb, filter = ~ uniprot == "P04406")
## re-order them matching the order in `stacked`
eprt <- eprt[match(eprt$protein_id, levels(stacked$stack)), ]
eprt
DataFrame with 5 rows and 4 columns
tx_id protein_id protein_sequence uniprot_id
<character> <character> <character> <character>
1 ENST00000229239 ENSP00000229239 MGKVKVGVNGFGRIGRLVTR.. P04406
2 ENST00000396858 ENSP00000380067 MVYMFQYDSTHGKFHGTVKA.. P04406
3 ENST00000396859 ENSP00000380068 MGKVKVGVNGFGRIGRLVTR.. P04406
4 ENST00000396861 ENSP00000380070 MGKVKVGVNGFGRIGRLVTR.. P04406
5 ENST00000619601 ENSP00000478864 MVYMFQYDSTHGKFHGTVKA.. P04406 so, two of the Ensembl protein IDs that are assigned to P04406 have in fact a different amino acid sequence. Not surprisingly, the peptide sequence for the subset is thus also different: substring(eprt$protein_sequence, 85, 107)
[1] "IKWGDAGAEYVVESTGVFTTMEK" "DAPMFVMGVNHEKYDNSLKIISN"
[3] "IKWGDAGAEYVVESTGVFTTMEK" "IKWGDAGAEYVVESTGVFTTMEK"
[5] "DAPMFVMGVNHEKYDNSLKIISN" But this fits what you get translating the genomic sequence. So, I don't have a solution for your problem - the annotations and mappings between identifiers are provided by Ensembl (or Uniprot). |
Thank you, @jorainer , for the detailed answer!! This has clarified a few things for me in narrowing down the source of my problems, which I'm currently validating; will post an update on this later this week. |
Excellent! Thank for the feedback! Cross-database mapping of identifiers has always been a problem in the genomics(/proteomics/metabolomics) field. |
Turns out, there’s a discordance in ensembldb’s proteinToGenome() output between Further inspection of this peptide’s coordinate on the five matching Ensembl proteins showed that only on the three sequences did the coordinates actually match the position on the UniProt sequence, P04406. When these five peptide coordinates were used by proteinToGenome() function with Ensembl Visualizing these coordinates along with annotation tracks revealed that two transcript isoforms that encoded this peptide had in fact had a different translational start points as opposed to the other three, but all fell within the same exon (ENSE00003678358) on the genome. We can conclude proteinToGenome() function need to take this into account to correctly retrieve genomic coordinates for such isoforms when using peptide coordinates on the canonical uniport protein as its input. We can also observe that the error gets propagated from proteinToTranscript() to proteinToGenome() function by looking at the result, requiring further investigation. To address this we will need the canonical status of the transcripts to check for exon ids in the isoforms to perform coordinate shift if they've different translational start positions. From version 104, Ensembl has included this canonical tag for the transcripts. I think it would be useful to include this tag as metadata for downstream analysis. loading librarieslibrary(BSgenome.Hsapiens.UCSC.hg38)
library(GenomicRanges)
library(ensembldb)
library(AnnotationHub)
Hsapiens <- BSgenome.Hsapiens.UCSC.hg38
ah <- AnnotationHub()
#> snapshotDate(): 2021-10-20
qr <- query(ah, c("EnsDb", "sapiens", "97"))
edb <- qr[[1]]
#> loading from cache
edb
#> EnsDb for Ensembl:
#> |Backend: SQLite
#> |Db type: EnsDb
#> |Type of Gene ID: Ensembl Gene ID
#> |Supporting package: ensembldb
#> |Db created by: ensembldb package from Bioconductor
#> |script_version: 0.3.4
#> |Creation time: Sat Jul 6 01:28:30 2019
#> |ensembl_version: 97
#> |ensembl_host: localhost
#> |Organism: Homo sapiens
#> |taxonomy_id: 9606
#> |genome_build: GRCh38
#> |DBSCHEMAVERSION: 2.1
#> | No. of genes: 67667.
#> | No. of transcripts: 248916.
#> |Protein data available. 1) protein_id based genome coordinate retrieval## retrieve matching Ensembl proteins for P04406
prts <- proteins(edb, filter = UniprotFilter("P04406"))
prts
#> DataFrame with 5 rows and 4 columns
#> tx_id protein_id protein_sequence uniprot_id
#> <character> <character> <character> <character>
#> 1 ENST00000229239 ENSP00000229239 MGKVKVGVNGFGRIGRLVTR.. P04406
#> 2 ENST00000396861 ENSP00000380070 MGKVKVGVNGFGRIGRLVTR.. P04406
#> 3 ENST00000396859 ENSP00000380068 MGKVKVGVNGFGRIGRLVTR.. P04406
#> 4 ENST00000396858 ENSP00000380067 MVYMFQYDSTHGKFHGTVKA.. P04406
#> 5 ENST00000619601 ENSP00000478864 MVYMFQYDSTHGKFHGTVKA.. P04406
## peptide's start coordinates of ENSPs
prts$start <-
nchar(do.call(
"rbind",
base::strsplit(prts$protein_sequence, 'IKWGDAGAEYVVESTGVFTTMEK')
)[, 1]) + 1
## peptide end coordinates of ENSPs
prts$end <- prts$start + nchar('IKWGDAGAEYVVESTGVFTTMEK') - 1
prts
#> DataFrame with 5 rows and 6 columns
#> tx_id protein_id protein_sequence uniprot_id start
#> <character> <character> <character> <character> <numeric>
#> 1 ENST00000229239 ENSP00000229239 MGKVKVGVNGFGRIGRLVTR.. P04406 85
#> 2 ENST00000396861 ENSP00000380070 MGKVKVGVNGFGRIGRLVTR.. P04406 85
#> 3 ENST00000396859 ENSP00000380068 MGKVKVGVNGFGRIGRLVTR.. P04406 85
#> 4 ENST00000396858 ENSP00000380067 MVYMFQYDSTHGKFHGTVKA.. P04406 43
#> 5 ENST00000619601 ENSP00000478864 MVYMFQYDSTHGKFHGTVKA.. P04406 43
#> end
#> <numeric>
#> 1 107
#> 2 107
#> 3 107
#> 4 65
#> 5 65
## peptide ranges on ENSPs
ir <- IRanges(start = prts$start, end = prts$end, names = prts$protein_id)
ir
#> IRanges object with 5 ranges and 0 metadata columns:
#> start end width
#> <integer> <integer> <integer>
#> ENSP00000229239 85 107 23
#> ENSP00000380070 85 107 23
#> ENSP00000380068 85 107 23
#> ENSP00000380067 43 65 23
#> ENSP00000478864 43 65 23
## peptide coordinates on ENSTs
trl <- proteinToTranscript(ir, edb, idType = "protein_id")
#> Fetching CDS for 5 proteins ... 5 found
#> Checking CDS and protein sequence lengths ... 5/5 OK
tr <- unlist(trl, use.names = FALSE)
tr
#> IRanges object with 5 ranges and 5 metadata columns:
#> start end width | protein_id
#> <integer> <integer> <integer> | <character>
#> ENST00000229239 329 397 69 | ENSP00000229239
#> ENST00000396861 404 472 69 | ENSP00000380070
#> ENST00000396859 332 400 69 | ENSP00000380068
#> ENST00000396858 392 460 69 | ENSP00000380067
#> ENST00000619601 127 195 69 | ENSP00000478864
#> tx_id cds_ok protein_start protein_end
#> <character> <logical> <integer> <integer>
#> ENST00000229239 ENST00000229239 TRUE 85 107
#> ENST00000396861 ENST00000396861 TRUE 85 107
#> ENST00000396859 ENST00000396859 TRUE 85 107
#> ENST00000396858 ENST00000396858 TRUE 43 65
#> ENST00000619601 ENST00000619601 TRUE 43 65
## peptide coordinates on ENSG
grl <- proteinToGenome(ir, edb, idType = "protein_id")
#> Fetching CDS for 5 proteins ... 5 found
#> Checking CDS and protein sequence lengths ... 5/5 OK
gr <- unlist(GRangesList(grl))
gr
#> GRanges object with 5 ranges and 7 metadata columns:
#> seqnames ranges strand | protein_id
#> <Rle> <IRanges> <Rle> | <character>
#> ENSP00000229239 12 6536936-6537004 + | ENSP00000229239
#> ENSP00000380070 12 6536936-6537004 + | ENSP00000380070
#> ENSP00000380068 12 6536936-6537004 + | ENSP00000380068
#> ENSP00000380067 12 6536936-6537004 + | ENSP00000380067
#> ENSP00000478864 12 6536936-6537004 + | ENSP00000478864
#> tx_id exon_id exon_rank cds_ok
#> <character> <character> <integer> <logical>
#> ENSP00000229239 ENST00000229239 ENSE00003678358 5 TRUE
#> ENSP00000380070 ENST00000396861 ENSE00003678358 5 TRUE
#> ENSP00000380068 ENST00000396859 ENSE00003678358 4 TRUE
#> ENSP00000380067 ENST00000396858 ENSE00003678358 4 TRUE
#> ENSP00000478864 ENST00000619601 ENSE00003678358 3 TRUE
#> protein_start protein_end
#> <integer> <integer>
#> ENSP00000229239 85 107
#> ENSP00000380070 85 107
#> ENSP00000380068 85 107
#> ENSP00000380067 43 65
#> ENSP00000478864 43 65
#> -------
#> seqinfo: 1 sequence from GRCh38 genome
## extract genomic sequences
seqlevelsStyle(gr) <- "UCSC"
sequence <- getSeq(Hsapiens,gr)
sequence
#> DNAStringSet object of length 5:
#> width seq names
#> [1] 69 ATCAAGTGGGGCGATGCTGGCGC...GCGTCTTCACCACCATGGAGAAG ENSP00000229239
#> [2] 69 ATCAAGTGGGGCGATGCTGGCGC...GCGTCTTCACCACCATGGAGAAG ENSP00000380070
#> [3] 69 ATCAAGTGGGGCGATGCTGGCGC...GCGTCTTCACCACCATGGAGAAG ENSP00000380068
#> [4] 69 ATCAAGTGGGGCGATGCTGGCGC...GCGTCTTCACCACCATGGAGAAG ENSP00000380067
#> [5] 69 ATCAAGTGGGGCGATGCTGGCGC...GCGTCTTCACCACCATGGAGAAG ENSP00000478864
## shows the retrieved coordinates code for the same peptide
translate(sequence)
#> AAStringSet object of length 5:
#> width seq names
#> [1] 23 IKWGDAGAEYVVESTGVFTTMEK ENSP00000229239
#> [2] 23 IKWGDAGAEYVVESTGVFTTMEK ENSP00000380070
#> [3] 23 IKWGDAGAEYVVESTGVFTTMEK ENSP00000380068
#> [4] 23 IKWGDAGAEYVVESTGVFTTMEK ENSP00000380067
#> [5] 23 IKWGDAGAEYVVESTGVFTTMEK ENSP00000478864 2) uniprot_id based genome coordinate retrievaledb <- qr[[1]]
#> loading from cache
## let's define IRanges for the peptide
ir <- IRanges(start = 85, end = 107)
names(ir) <- "P04406"
## peptide coordinates on ENSTs (we can see two transcripts with incorrect coords)
trl <- proteinToTranscript(ir, edb, idType = "uniprot_id")
#> Fetching CDS for 1 proteins ...
#> 1 found
#> Checking CDS and protein sequence lengths ... 1/1 OK
tr <- unlist(trl, use.names = FALSE)
tr
#> IRanges object with 5 ranges and 6 metadata columns:
#> start end width | protein_id
#> <integer> <integer> <integer> | <character>
#> ENST00000229239 329 397 69 | ENSP00000229239
#> ENST00000396858 518 586 69 | ENSP00000380067
#> ENST00000396859 332 400 69 | ENSP00000380068
#> ENST00000396861 404 472 69 | ENSP00000380070
#> ENST00000619601 253 321 69 | ENSP00000478864
#> tx_id cds_ok protein_start protein_end
#> <character> <logical> <integer> <integer>
#> ENST00000229239 ENST00000229239 TRUE 85 107
#> ENST00000396858 ENST00000396858 TRUE 85 107
#> ENST00000396859 ENST00000396859 TRUE 85 107
#> ENST00000396861 ENST00000396861 TRUE 85 107
#> ENST00000619601 ENST00000619601 TRUE 85 107
#> uniprot_id
#> <character>
#> ENST00000229239 P04406
#> ENST00000396858 P04406
#> ENST00000396859 P04406
#> ENST00000396861 P04406
#> ENST00000619601 P04406
## peptide coordinates on ENSG
prt_gn <- proteinToGenome(ir, edb, idType = "uniprot_id")
#> Fetching CDS for 1 proteins ... 1 found
#> Checking CDS and protein sequence lengths ... 1/1 OK
gr <- stack(prt_gn[[1]], index.var = "stack")
gr
#> GRanges object with 7 ranges and 9 metadata columns:
#> seqnames ranges strand | stack uniprot_id
#> <Rle> <IRanges> <Rle> | <Rle> <character>
#> [1] 12 6536936-6537004 + | ENSP00000229239 P04406
#> [2] 12 6537152-6537216 + | ENSP00000380067 P04406
#> [3] 12 6537309-6537312 + | ENSP00000380067 P04406
#> [4] 12 6536936-6537004 + | ENSP00000380068 P04406
#> [5] 12 6536936-6537004 + | ENSP00000380070 P04406
#> [6] 12 6537152-6537216 + | ENSP00000478864 P04406
#> [7] 12 6537309-6537312 + | ENSP00000478864 P04406
#> tx_id protein_id exon_id exon_rank cds_ok
#> <character> <character> <character> <integer> <logical>
#> [1] ENST00000229239 ENSP00000229239 ENSE00003678358 5 TRUE
#> [2] ENST00000396858 ENSP00000380067 ENSE00003562150 5 TRUE
#> [3] ENST00000396858 ENSP00000380067 ENSE00003663529 6 TRUE
#> [4] ENST00000396859 ENSP00000380068 ENSE00003678358 4 TRUE
#> [5] ENST00000396861 ENSP00000380070 ENSE00003678358 5 TRUE
#> [6] ENST00000619601 ENSP00000478864 ENSE00003562150 4 TRUE
#> [7] ENST00000619601 ENSP00000478864 ENSE00003663529 5 TRUE
#> protein_start protein_end
#> <integer> <integer>
#> [1] 85 107
#> [2] 85 107
#> [3] 85 107
#> [4] 85 107
#> [5] 85 107
#> [6] 85 107
#> [7] 85 107
#> -------
#> seqinfo: 1 sequence from GRCh38 genome
## extract genomic sequences
seqlevelsStyle(gr) <- "UCSC"
sequence <- getSeq(Hsapiens,gr)
sequence
#> DNAStringSet object of length 7:
#> width seq
#> [1] 69 ATCAAGTGGGGCGATGCTGGCGCTGAGTACGTCGTGGAGTCCACTGGCGTCTTCACCACCATGGAGAAG
#> [2] 65 GATGCCCCCATGTTCGTCATGGGTGTGAACCATGAGAAGTATGACAACAGCCTCAAGATCATCAG
#> [3] 4 CAAT
#> [4] 69 ATCAAGTGGGGCGATGCTGGCGCTGAGTACGTCGTGGAGTCCACTGGCGTCTTCACCACCATGGAGAAG
#> [5] 69 ATCAAGTGGGGCGATGCTGGCGCTGAGTACGTCGTGGAGTCCACTGGCGTCTTCACCACCATGGAGAAG
#> [6] 65 GATGCCCCCATGTTCGTCATGGGTGTGAACCATGAGAAGTATGACAACAGCCTCAAGATCATCAG
#> [7] 4 CAAT
## confirms we have indeed retrieved incorrect coordinates for two transcripts
## with different translational start positions.
translate(sequence)
#> Warning in .Call2("DNAStringSet_translate", x, skip_code,
#> dna_codes[codon_alphabet], : in 'x[[2]]': last 2 bases were ignored
#> Warning in .Call2("DNAStringSet_translate", x, skip_code,
#> dna_codes[codon_alphabet], : in 'x[[3]]': last base was ignored
#> Warning in .Call2("DNAStringSet_translate", x, skip_code,
#> dna_codes[codon_alphabet], : in 'x[[6]]': last 2 bases were ignored
#> Warning in .Call2("DNAStringSet_translate", x, skip_code,
#> dna_codes[codon_alphabet], : in 'x[[7]]': last base was ignored
#> AAStringSet object of length 7:
#> width seq
#> [1] 23 IKWGDAGAEYVVESTGVFTTMEK
#> [2] 21 DAPMFVMGVNHEKYDNSLKII
#> [3] 1 Q
#> [4] 23 IKWGDAGAEYVVESTGVFTTMEK
#> [5] 23 IKWGDAGAEYVVESTGVFTTMEK
#> [6] 21 DAPMFVMGVNHEKYDNSLKII
#> [7] 1 Q Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.1.1 (2021-08-10)
#> os Ubuntu 18.04.5 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Brussels
#> date 2021-12-16
#> pandoc 2.11.4 @ /usr/lib/rstudio-server/bin/pandoc/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> AnnotationDbi * 1.56.2 2021-11-09 [1] Bioconductor
#> AnnotationFilter * 1.18.0 2021-10-26 [1] Bioconductor
#> AnnotationHub * 3.2.0 2021-10-26 [1] Bioconductor
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.1)
#> Biobase * 2.54.0 2021-10-26 [1] Bioconductor
#> BiocFileCache * 2.2.0 2021-10-26 [1] Bioconductor
#> BiocGenerics * 0.40.0 2021-10-26 [1] Bioconductor
#> BiocIO 1.4.0 2021-10-26 [1] Bioconductor
#> BiocManager 1.30.16 2021-06-15 [1] CRAN (R 4.1.1)
#> BiocParallel 1.28.2 2021-11-25 [1] Bioconductor
#> BiocVersion 3.14.0 2021-05-19 [1] Bioconductor
#> biomaRt 2.50.1 2021-11-21 [1] Bioconductor
#> Biostrings * 2.62.0 2021-10-26 [1] Bioconductor
#> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.1)
#> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.1)
#> bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.1)
#> blob 1.2.2 2021-07-23 [1] CRAN (R 4.1.1)
#> BSgenome * 1.62.0 2021-10-26 [1] Bioconductor
#> BSgenome.Hsapiens.UCSC.hg38 * 1.4.4 2021-11-26 [1] Bioconductor
#> cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.1)
#> cli 3.1.0 2021-10-27 [1] CRAN (R 4.1.1)
#> crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.1)
#> curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.1)
#> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.1)
#> dbplyr * 2.1.1 2021-04-06 [1] CRAN (R 4.1.1)
#> DelayedArray 0.20.0 2021-10-26 [1] Bioconductor
#> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.1)
#> dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.1)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.1)
#> ensembldb * 2.18.2 2021-11-08 [1] Bioconductor
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.1)
#> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.1)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.1)
#> filelock 1.0.2 2018-10-05 [1] CRAN (R 4.1.1)
#> fs 1.5.1 2021-11-30 [1] CRAN (R 4.1.1)
#> generics 0.1.1 2021-10-25 [1] CRAN (R 4.1.1)
#> GenomeInfoDb * 1.30.0 2021-10-26 [1] Bioconductor
#> GenomeInfoDbData 1.2.7 2021-09-28 [1] Bioconductor
#> GenomicAlignments 1.30.0 2021-10-26 [1] Bioconductor
#> GenomicFeatures * 1.46.1 2021-10-27 [1] Bioconductor
#> GenomicRanges * 1.46.1 2021-11-18 [1] Bioconductor
#> glue 1.5.1 2021-11-30 [1] CRAN (R 4.1.1)
#> hms 1.1.1 2021-09-26 [1] CRAN (R 4.1.1)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1)
#> httpuv 1.6.3 2021-09-09 [1] CRAN (R 4.1.1)
#> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.1)
#> interactiveDisplayBase 1.32.0 2021-10-26 [1] Bioconductor
#> IRanges * 2.28.0 2021-10-26 [1] Bioconductor
#> KEGGREST 1.34.0 2021-10-26 [1] Bioconductor
#> knitr 1.36 2021-09-29 [1] CRAN (R 4.1.1)
#> later 1.3.0 2021-08-18 [1] CRAN (R 4.1.1)
#> lattice 0.20-44 2021-05-02 [4] CRAN (R 4.1.0)
#> lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.1.1)
#> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.1)
#> Matrix 1.3-4 2021-06-01 [4] CRAN (R 4.1.0)
#> MatrixGenerics 1.6.0 2021-10-26 [1] Bioconductor
#> matrixStats 0.61.0 2021-09-17 [1] CRAN (R 4.1.1)
#> memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.1)
#> mime 0.12 2021-09-28 [1] CRAN (R 4.1.1)
#> pillar 1.6.4 2021-10-18 [1] CRAN (R 4.1.1)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.1)
#> png 0.1-7 2013-12-03 [1] CRAN (R 4.1.1)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.1)
#> progress 1.2.2 2019-05-16 [1] CRAN (R 4.1.1)
#> promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.1.1)
#> ProtGenerics 1.26.0 2021-10-26 [1] Bioconductor
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.1)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1)
#> rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.1.1)
#> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.1)
#> RCurl 1.98-1.5 2021-09-17 [1] CRAN (R 4.1.1)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.1)
#> restfulr 0.0.13 2017-08-06 [1] CRAN (R 4.1.1)
#> rjson 0.2.20 2018-06-08 [1] CRAN (R 4.1.1)
#> rlang 0.4.12 2021-10-18 [1] CRAN (R 4.1.1)
#> rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.1)
#> Rsamtools 2.10.0 2021-10-26 [1] Bioconductor
#> RSQLite 2.2.9 2021-12-06 [1] CRAN (R 4.1.1)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.1)
#> rtracklayer * 1.54.0 2021-10-26 [1] Bioconductor
#> S4Vectors * 0.32.3 2021-11-21 [1] Bioconductor
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.1)
#> shiny 1.7.1 2021-10-02 [1] CRAN (R 4.1.1)
#> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.1)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.1)
#> SummarizedExperiment 1.24.0 2021-10-26 [1] Bioconductor
#> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.1)
#> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.1)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.1)
#> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.1)
#> withr 2.4.3 2021-11-30 [1] CRAN (R 4.1.1)
#> xfun 0.28 2021-11-04 [1] CRAN (R 4.1.1)
#> XML 3.99-0.8 2021-09-17 [1] CRAN (R 4.1.1)
#> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.1.1)
#> xtable 1.8-4 2019-04-21 [1] CRAN (R 4.1.1)
#> XVector * 0.34.0 2021-10-26 [1] Bioconductor
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.1)
#> zlibbioc 1.40.0 2021-10-26 [1] Bioconductor
#>
#> [1] /home/manoj/R/x86_64-pc-linux-gnu-library/4.1
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library
#>
#> ────────────────────────────────────────────────────────────────────────────── |
Thanks for this very comprehensive and detailed summary @manogenome ! So, if I understand correctly, restricting genomic mapping using canonical transcript of a gene would solve this? |
Yes, this would help us correctly retrieve the genome coordinate for the peptide as we use uniprot_id and peptide coordinates as inputs. We would still need genome coordinates for the peptide on all the isoforms. Now that we know the exon id(s) for the peptide on the canonical transcript, we can simply assign (instead of recalculating) the genome coordinates to all the transcript isoforms that share this peptide's exon id(s) and drop the other isoforms. This would require some validation. But first, we will need the canonical tag (Ensembl version >= 104). |
Once you have genomic coordinates you could also use the I have now a version of a With the update it would be e.g. possible to filter the |
- Extract also `is_canonical` field for transcripts from the Perl API. - Add the `TxIsCanonicalFilter` filter.
Thanks, @jorainer for adding this tag. You're right this tag will only work for newer Ensembl releases >= 104. I just cross-checked the hyperlinks for P04406-1 and P04406-2 in the below table. We can see I also went through the Uniprot reviewed fasta file for humans. It only had a |
You can install the updated version of The You could now restrict/subset the whole edb <- filter(edb, filter = ~ tx_is_canonical == 1) |
Thank you, @jorainer !! The updated package, database and the new |
Thanks @jorainer - will that flag be available on the database itself in the package? |
The annotation is available as an additional column to the transcript database table (
|
After running
proteinToGenome
, I'm observing a case where a peptide, arising from a gene with multiple isoforms, is found to have a mix of within exon and junction coordinates. Upon closer inspection, some of those coordinates turned out to be incorrect.To illustrate the issue, I calculated the genome coordinates for the peptide, "IKWGDAGAEYVVESTGVFTTMEK", from the protein, P04006 (UniProt id). This returned seven genomic ranges -- three as within exon ranges on three transcripts and the remaining four as belonging to a junction on two transcripts.
Upon extracting the genomic sequence corresponding to the above genomic ranges for the peptide and translating it, we observe that only the three within exon ranges returned the actual peptide, "IKWGDAGAEYVVESTGVFTTMEK". The junction ranges don't match the supplied peptide.
Manually inspecting IRanges returned by
proteinToTranscript
showed that the junction ranges were shifted from the actual ranges and are incorrect. I suspect the bug to be hiding somewhere in this step. Below I've added a reproducible code to demonstrate the issue using this peptide as an example.The text was updated successfully, but these errors were encountered: