-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE, : subscript is out of bounds #126
Comments
The error is due to > ensembldb::transcripts(edb, filter = AnnotationFilter::UniprotFilter("Q1MSJ5"))
GRanges object with 2 ranges and 9 metadata columns:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <character>
ENST00000262210 8 67064368-67196263 + | ENST00000262210
ENST00000519668 8 67083848-67195611 + | ENST00000519668
tx_biotype tx_cds_seq_start tx_cds_seq_end
<character> <integer> <integer>
ENST00000262210 protein_coding 67064399 67195593
ENST00000519668 protein_coding 67095665 67195593
gene_id tx_support_level tx_id_version
<character> <integer> <character>
ENST00000262210 ENSG00000104218 1 ENST00000262210.9
ENST00000519668 ENSG00000104218 1 ENST00000519668.1
tx_name uniprot_id
<character> <character>
ENST00000262210 ENST00000262210 Q1MSJ5
ENST00000519668 ENST00000519668 Q1MSJ5
-------
seqinfo: 1 sequence from GRCh38 genome But some of the ranges must be out of bounds: > i <- which(x$names == "Q1MSJ5")
> ir <- IRanges(start = x$start[i], end = x$end[i], names = x$names[i])
> ir
IRanges object with 381 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
Q1MSJ5 1 15 15
Q1MSJ5 16 26 11
Q1MSJ5 31 36 6
Q1MSJ5 37 49 13
Q1MSJ5 57 69 13
... ... ... ...
Q1MSJ5 1137 1168 32
Q1MSJ5 1156 1172 17
Q1MSJ5 1167 1174 8
Q1MSJ5 1169 1175 7
Q1MSJ5 1173 1200 28
> ensembldb::proteinToGenome(ir, edb, idType = "uniprot_id")
Fetching CDS for 381 proteins ... 377 found
Checking CDS and protein sequence lengths ... Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE, :
subscript is out of bounds
In addition: Warning messages:
1: In ensembldb::proteinToGenome(ir, edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5, Q1MSJ5, Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
4: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
5: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. I'll dig deeper and report back. |
So this is actually rather weird. Elements 69 and 70 are empty (which leads to the out of bounds error - see below for details), but the error only happens when they are with a non-empty element: > ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id")
Fetching CDS for 3 proteins ... 1 found
Checking CDS and protein sequence lengths ... Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE, :
subscript is out of bounds
In addition: Warning messages:
1: In ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
> ensembldb::proteinToGenome(ir[69], edb, idType = "uniprot_id")
Fetching CDS for 1 proteins ... 0 found
Checking CDS and protein sequence lengths ... 0/0 OK
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
Warning message:
In ensembldb::proteinToGenome(ir[69], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5
> ensembldb::proteinToGenome(ir[70], edb, idType = "uniprot_id")
Fetching CDS for 1 proteins ... 0 found
Checking CDS and protein sequence lengths ... 0/0 OK
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
Warning message:
In ensembldb::proteinToGenome(ir[70], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5
> ensembldb::proteinToGenome(ir[69:70], edb, idType = "uniprot_id")
Fetching CDS for 2 proteins ... 0 found
Checking CDS and protein sequence lengths ... 0/0 OK
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
Warning message:
In ensembldb::proteinToGenome(ir[69:70], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5, Q1MSJ5
> ensembldb::proteinToGenome(ir[69:71], edb, idType = "uniprot_id")
Fetching CDS for 3 proteins ... 1 found
Checking CDS and protein sequence lengths ... Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE, :
subscript is out of bounds
In addition: Warning messages:
1: In ensembldb::proteinToGenome(ir[69:71], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. So back to the error, that is thrown here: are_ok <- unlist(lapply(cds_genome, function(z) {
if (is(z, "GRangesList"))
all(z[[1]]$cds_ok) ## error when z has no ranges
else NA
})) Below, I debug the code and z is element 69 Browse[1]> debug at /tmp/proteinToX.R!1gB7hU#20: are_ok <- unlist(lapply(cds_genome, function(z) {
if (is(z, "GRangesList"))
all(z[[1]]$cds_ok)
else NA
}))
Browse[2]> z <- cds_genome[[2]]
Browse[2]> z
GRangesList object of length 0:
<0 elements>
Browse[2]> is(z, "GRangesList")
[1] TRUE
Browse[2]> z[[1]]
Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE, :
subscript is out of bounds This can be fixed with are_ok <- unlist(lapply(cds_genome, function(z) {
if (is(z, "GRangesList") & length(z) > 0)
all(z[[1]]$cds_ok)
else NA
})) which will come as a PR in a moment (plus another minor issue related that). But it's not over... > ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id")
Fetching CDS for 3 proteins ... 1 found
Checking CDS and protein sequence lengths ... 1/3 OK
Error in `[[<-`(`*tmp*`, name, value = 1201L) :
1 elements in value to replace 0 elements
In addition: Warning messages:
1: In ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. Will continue to investigate and report. Coming back to the oddity above, as to why the code worked with only empty cds_genome
$Q1MSJ5
NULL
$Q1MSJ5
NULL which would, as anticipated, graciously produce are_ok
Q1MSJ5 Q1MSJ5
NA NA |
Continuing my stroll along the res <- mapply(
cds_genome, as(coords_cds, "IRangesList"), as(x, "IRangesList"),
FUN = function(gnm, cds, prt) {
if (is.null(gnm)) {
GRanges()
} else {
## Unlist because we'd like to have a GRanges here. Will split
## again later.
maps <- unlist(.to_genome(gnm, cds))
## Don't want to have GRanges names!
names(maps) <- NULL
mcols(maps)$protein_start <- start(prt) ## ERROR
mcols(maps)$protein_end <- end(prt) ## ERROR
maps[order(maps$exon_rank)]
}
}) Given that we have an empty Error in `[[<-`(`*tmp*`, name, value = 1) :
1 elements in value to replace 0 elements This can be fixes with an additional condition in the test (PR to follow): ...
if (is.null(gnm) | !length(gnm)) {
GRanges()
} else {
... > ensembldb::proteinToGenome(ir[68:71], edb, idType = "uniprot_id")
Fetching CDS for 4 proteins ... 2 found
Checking CDS and protein sequence lengths ... 2/4 OK
$Q1MSJ5
GRanges object with 1 range and 8 metadata columns:
seqnames ranges strand | uniprot_id tx_id
<Rle> <IRanges> <Rle> | <character> <character>
[1] 8 67195453-67195527 + | Q1MSJ5 ENST00000262210
protein_id exon_id exon_rank cds_ok protein_start
<character> <character> <integer> <logical> <integer>
[1] ENSP00000262210 ENSE00002101366 29 TRUE 1176
protein_end
<integer>
[1] 1200
-------
seqinfo: 1 sequence from GRCh38 genome
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
$Q1MSJ5
GRangesList object of length 2:
$ENSP00000262210
GRanges object with 1 range and 8 metadata columns:
seqnames ranges strand | uniprot_id tx_id
<Rle> <IRanges> <Rle> | <character> <character>
[1] 8 67064399-67064476 + | Q1MSJ5 ENST00000262210
protein_id exon_id exon_rank cds_ok protein_start
<character> <character> <integer> <logical> <integer>
[1] ENSP00000262210 ENSE00003513069 1 TRUE 1
protein_end
<integer>
[1] 26
-------
seqinfo: 1 sequence from GRCh38 genome
$ENSP00000430092
GRanges object with 2 ranges and 8 metadata columns:
seqnames ranges strand | uniprot_id tx_id
<Rle> <IRanges> <Rle> | <character> <character>
[1] 8 67095665-67095732 + | Q1MSJ5 ENST00000519668
[2] 8 67103037-67103046 + | Q1MSJ5 ENST00000519668
protein_id exon_id exon_rank cds_ok protein_start
<character> <character> <integer> <logical> <integer>
[1] ENSP00000430092 ENSE00003677216 4 TRUE 1
[2] ENSP00000430092 ENSE00001284106 5 TRUE 1
protein_end
<integer>
[1] 26
[2] 26
-------
seqinfo: 1 sequence from GRCh38 genome
Warning messages:
1: In ensembldb::proteinToGenome(ir[68:71], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
> ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id")
Fetching CDS for 3 proteins ... 1 found
Checking CDS and protein sequence lengths ... 1/3 OK
$Q1MSJ5
GRanges object with 1 range and 8 metadata columns:
seqnames ranges strand | uniprot_id tx_id
<Rle> <IRanges> <Rle> | <character> <character>
[1] 8 67195453-67195527 + | Q1MSJ5 ENST00000262210
protein_id exon_id exon_rank cds_ok protein_start
<character> <character> <integer> <logical> <integer>
[1] ENSP00000262210 ENSE00002101366 29 TRUE 1176
protein_end
<integer>
[1] 1200
-------
seqinfo: 1 sequence from GRCh38 genome
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
Warning messages:
1: In ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. What I'm worried about is that the originally empty |
Thanks for the detailed report @lgatto ! The main reason seems to me that there's an issue between the CDS of the transcripts and the protein sequence. That is something I have already observed and that I can not really explain. This is also the main reason for this additional check I'm performing that compares the length of the CDS with the length of the protein sequence - if they don't match you'll get the Could not find a CDS whith the expected length for protein warnings - and in this cases it can well be that the peptide sequence is within the protein sequence but, because the CDS does not match the protein sequence, outside of the CDS (and the CDS is used to map coordinates to the genome). |
Note that there are two different types of Uniprot-to-Ensembl protein mappings (info provided by Ensembl): > listUniprotMappingTypes(edb)
[1] "DIRECT" "SEQUENCE_MATCH" Maybe worth to also try filtering the edb <- filter(edb, filter = ~ uniprot_mapping_type == "DIRECT") |
There's also the issue documented in #123 where some of the peptide coordinates along the UniProt proteins might actually not be mapped to the ENSEMBL canonical transcripts. Thanks for the uniprot mapping types hint! |
fix: all_ok check for empyt GRangesList (see issue #126)
I get the following error where start/end/names indicate start/end ranges of peptides along UnitProt identifiers (the same as discussed over slack yesterday) and
edb
isIndeed, I confirm the first warning:
Could you help with the error, @jorainer
The text was updated successfully, but these errors were encountered: