-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create more realistic examples #77
Comments
Hi @mmokrejs, Thank you for opening this. I am not sure I follow the entire issue, and I'll try to answer from what I understand. Please correct me if I misunderstood something.
So the first step for your usecase is to parse the rows and extract the intervals data_raw = [('NC_000011.10', 'DA340864;', '(complement)', [(49200441, 49200301), (49206172, 49206067), (49207868, 49208160)], [(1, 144), (145, 250), (251, 549)], ['97%', '99%', '97%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA394358;', '(complement)', [(49200441, 49200294), (49206172, 49206067), (49208292, 49208636)], [(1, 148), (149, 254), (255, 599)], ['100%', '100%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA301520;', '(complement)', [(49200441, 49200263), (49206172, 49206067), (49208292, 49208559)], [(1, 179), (180, 285), (286, 553)], ['99%', '99%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA259004;', '(complement)', [(49192894, 49192792), (49200441, 49200255), (49206172, 49206067), (49208292, 49208453)], [(6, 108), (109, 295), (296, 401), (402, 563)], ['100%', '100%', '100%', '100%'], ['->', '->', '->', None], ['(GT/AG)', '(GT/AG)', '(GT/AG)', None], [24, 24, 24, 24]),
('NC_000011.10', 'DA282176;', '(complement)', [(49200441, 49200267), (49206172, 49206067), (49208292, 49208559)], [(1, 175), (176, 281), (282, 549)], ['100%', '100%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA299779;', '(complement)', [(49200441, 49200318), (49206172, 49206067), (49208292, 49208636)], [(1, 124), (125, 228), (229, 573)], ['100%', '98%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA301520;', '(complement)', [(49200441, 49200263), (49206172, 49206067), (49208292, 49208559)], [(1, 179), (180, 285), (286, 553)], ['99%', '99%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA308616;', '(complement)', [(49192894, 49192799), (49200441, 49200255), (49206172, 49206067), (49208292, 49208466)], [(1, 96), (97, 283), (284, 389), (390, 564)], ['100%', '100%', '100%', '100%'], ['->', '->', '->', None], ['(GT/AG)', '(GT/AG)', '(GT/AG)', None], [24, 24, 24, 24])]
regions = []
exon_names = []
widths = []
starts = []
for dr in data_raw:
widths.extend([abs(x[0] - x[1]) for x in dr[3]])
starts.extend([x[0] if x[1] > x[0] else x[1] for i, x in enumerate(dr[3])])
exon_names.extend([dr[1]] * len(dr[3]))
regions.extend([dr[0]] * len(dr[3])) Now that we have some of the information extracted, lets create a from genomicranges import GenomicRanges
from iranges import IRanges
from biocframe import BiocFrame
from random import random
gr = GenomicRanges(
seqnames=[
"chr11"
] * len(starts),
ranges=IRanges(start=starts, width=widths),
strand=["-"] * len(starts),
mcols=BiocFrame(
{
"exon": exon_names,
"region": regions
}
),
)
print(gr)
## OUTPUT
GenomicRanges with 26 ranges and 26 metadata columns
seqnames ranges strand exon region
<str> <IRanges> <ndarray[int8]> <list> <list>
[0] chr11 49200301 - 49200441 - | DA340864; NC_000011.10
[1] chr11 49206067 - 49206172 - | DA340864; NC_000011.10
[2] chr11 49207868 - 49208160 - | DA340864; NC_000011.10
... ... ... | ... ...
[23] chr11 49200255 - 49200441 - | DA308616; NC_000011.10
[24] chr11 49206067 - 49206172 - | DA308616; NC_000011.10
[25] chr11 49208292 - 49208466 - | DA308616; NC_000011.10
------
seqinfo(1 sequences): chr11 Lets reduce these intervals into non-overlapping regions, print(gr.reduce(with_reverse_map=True))
## OUTPUT
GenomicRanges with 5 ranges and 5 metadata columns
seqnames ranges strand revmap
<str> <IRanges> <ndarray[int8]> <list>
[0] chr11 49192792 - 49192894 - | [9, 22]
[1] chr11 49200255 - 49200441 - | [10, 23, 6, 19, 13, 3, 0, 16]
[2] chr11 49206067 - 49206172 - | [1, 4, 7, 11, 14, 17, 20, 24]
[3] chr11 49207868 - 49208160 - | [2]
[4] chr11 49208292 - 49208636 - | [12, 25, 8, 15, 21, 5, 18]
------
seqinfo(1 sequences): chr11
By default the |
Hi @jkanche , thank you for your excellent answer, I edited the code to match better biological terms and I fixed/swapped the coordinates of the trailing exons (my mistake in the initial example). data_raw = [
('NC_000011.10', 'DA340864;', '(complement)', [(49200441, 49200301), (49206172, 49206067), (49208160, 49207868)], [(1, 144), (145, 250), (251, 549)], ['97%', '99%', '97%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA394358;', '(complement)', [(49200441, 49200294), (49206172, 49206067), (49208636, 49208292)], [(1, 148), (149, 254), (255, 599)], ['100%', '100%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA301520;', '(complement)', [(49200441, 49200263), (49206172, 49206067), (49208559, 49208292)], [(1, 179), (180, 285), (286, 553)], ['99%', '99%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA259004;', '(complement)', [(49192894, 49192792), (49200441, 49200255), (49206172, 49206067), (49208453, 49208292)], [(6, 108), (109, 295), (296, 401), (402, 563)], ['100%', '100%', '100%', '100%'], ['->', '->', '->', None], ['(GT/AG)', '(GT/AG)', '(GT/AG)', None], [24, 24, 24, 24]),
('NC_000011.10', 'DA259004;', None, [(32091250, 32091265), (81783019, 81783034), (89639236, 89639342), (89645003, 89645189), (89652496, 89652598)], [(116, 131), (145, 160), (162, 268), (269, 455), (456, 558)], ['100%', '100%', '100%', '98%', '99%'], ['==', '==', '->', '->', None], [None, None, '(GT/AG)', '(GT/AG)', None], [None, None, 24, 24, 24]),
('NC_000011.10', 'DA282176;', '(complement)', [(49200441, 49200267), (49206172, 49206067), (49208559, 49208292)], [(1, 175), (176, 281), (282, 549)], ['100%', '100%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA299779;', '(complement)', [(49200441, 49200318), (49206172, 49206067), (49208636, 49208292)], [(1, 124), (125, 228), (229, 573)], ['100%', '98%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA301520;', '(complement)', [(49200441, 49200263), (49206172, 49206067), (49208559, 49208292)], [(1, 179), (180, 285), (286, 553)], ['99%', '99%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA308616;', '(complement)', [(49192894, 49192799), (49200441, 49200255), (49206172, 49206067), (49208466, 49208292)], [(1, 96), (97, 283), (284, 389), (390, 564)], ['100%', '100%', '100%', '100%'], ['->', '->', '->', None], ['(GT/AG)', '(GT/AG)', '(GT/AG)', None], [24, 24, 24, 24])
]
chromosome_accessions = []
mRNA_accessions = []
widths = []
starts = []
for dr in data_raw:
widths.extend([abs(x[0] - x[1]) for x in dr[3]])
starts.extend([x[0] if x[1] > x[0] else x[1] for i, x in enumerate(dr[3])]) # this will break if a gene spans origin (zero) in a circular chromosome
mRNA_accessions.extend([dr[1] for splice_var in data_raw])
chromosome_accessions.extend([dr[0] for splice_var in data_raw])
from genomicranges import GenomicRanges
from iranges import IRanges
from biocframe import BiocFrame
from random import random
import string
def generate_exon_names(count):
names = []
for letter1 in string.ascii_uppercase:
names.append('exon_%s' % letter1)
if count > len(string.ascii_uppercase):
for letter1 in string.ascii_uppercase:
for letter2 in string.ascii_uppercase:
names.append('exon_%s%s' % (letter1, letter2))
return names[:count]
splice_variants = [(i, splice_var[3]) for i, splice_var in enumerate(data_raw)]
strands = []
for splice_variant in splice_variants:
#print("splice variant: %s" % str(splice_variant))
i, exons = splice_variant
#print("All exons %s" % str(exons))
for exon in exons:
#print("Individual exon %s" % str(exon))
if data_raw[i][2] == '(complement)':
strands.append('-')
elif data_raw[i][2] is None:
strands.append('+')
else:
raise("Unexpected strand: %s" % str(data_raw[i][2]))
gr = GenomicRanges(
seqnames=[
"chr11"
] * len(starts),
ranges=IRanges(start=starts, width=widths),
strand=strands,
mcols=BiocFrame(
{
"accession": mRNA_accessions,
"exon": generate_exon_names(len(starts)),
"chr_accession": chromosome_accessions
}
),
)
print(gr) Here I show which ot the splice variants can be merged:
>>> print(gr)
GenomicRanges with 31 ranges and 31 metadata columns
seqnames ranges strand accession exon chr_accession
<str> <IRanges> <ndarray[int8]> <list> <list> <list>
[0] chr11 49200301 - 49200441 - | DA340864; exon_A NC_000011.10
[1] chr11 49206067 - 49206172 - | DA340864; exon_B NC_000011.10
[2] chr11 49207868 - 49208160 - | DA340864; exon_C NC_000011.10
... ... ... | ... ... ...
[28] chr11 49200255 - 49200441 - | DA308616; exon_AC NC_000011.10
[29] chr11 49206067 - 49206172 - | DA308616; exon_AD NC_000011.10
[30] chr11 49208292 - 49208466 - | DA308616; exon_AE NC_000011.10
------
seqinfo(1 sequences): chr11
>>> print(gr.reduce(with_reverse_map=True))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/foo/genomicranges/lib/python3.11/site-packages/genomicranges/GenomicRanges.py", line 1561, in reduce
output._mcols.set_column("revmap", rev_map, in_place=True)
File "/foo/genomicranges/lib/python3.11/site-packages/biocframe/BiocFrame.py", line 936, in set_column
return self.set_columns({column: value}, in_place=in_place)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/foo/genomicranges/lib/python3.11/site-packages/biocframe/BiocFrame.py", line 964, in set_columns
raise ValueError(
ValueError: Length of `value`, does not match the number of the rows,need to be 10 but provided 15.
>>> |
@mmokrejs I fixed this bug yesterday, you may need to update the package to 0.4.14. |
OK, 0.4.14 fixed the 0.4.12 issue. So now I have: >>> print(gr)
GenomicRanges with 31 ranges and 31 metadata columns
seqnames ranges strand accession exon chr_accession
<str> <IRanges> <ndarray[int8]> <list> <list> <list>
[0] chr11 49200301 - 49200441 - | DA340864; exon_A NC_000011.10
[1] chr11 49206067 - 49206172 - | DA340864; exon_B NC_000011.10
[2] chr11 49207868 - 49208160 - | DA340864; exon_C NC_000011.10
... ... ... | ... ... ...
[28] chr11 49200255 - 49200441 - | DA308616; exon_AC NC_000011.10
[29] chr11 49206067 - 49206172 - | DA308616; exon_AD NC_000011.10
[30] chr11 49208292 - 49208466 - | DA308616; exon_AE NC_000011.10
------
seqinfo(1 sequences): chr11
>>> print(gr.reduce(with_reverse_map=True))
GenomicRanges with 10 ranges and 10 metadata columns
seqnames ranges strand revmap
<str> <IRanges> <ndarray[int8]> <list>
[0] chr11 32091250 - 32091265 + | [13]
[1] chr11 81783019 - 81783034 + | [14]
[2] chr11 89639236 - 89639342 + | [15]
[3] chr11 89645003 - 89645189 + | [16]
[4] chr11 89652496 - 89652598 + | [17]
[5] chr11 49192792 - 49192894 - | [9, 27]
[6] chr11 49200255 - 49200441 - | [10, 28, 6, 24, 18, 3, 0, 21]
[7] chr11 49206067 - 49206172 - | [1, 4, 7, 11, 19, 22, 25, 29]
[8] chr11 49207868 - 49208160 - | [2]
[9] chr11 49208292 - 49208636 - | [12, 30, 8, 20, 26, 5, 23]
------
seqinfo(1 sequences): chr11
>>> But, I did not want to have isolated list of exons but to retain compatible (actually existing) range combinations . Assumably the 8 items from |
Basically, it does not do what I need. It should act on the lists of ranges (read on the fragments of splice variants defined by a combination of exonic ranges). The very first and very last exon might be incomplete, hence why I initially wrote that |
The |
Chiming in here. I find it very unusual to see start positions to be greater than the end position, irrespective of the strand. Certainly this is not the case for any standard formats (GFF, BED), nor is it the case for the R As for the As for the question of what you're trying to do: I don't think the
Seems fairly complex. |
Hi @LTLA ,
About half of the genes in a genome are on a minus strand.
Please realize the first exon (exon A) is on the far right, while last exon is on far left. Start of an exon will always be where it actually starts, that is where a ribosome starts to synthesize a protein. You cannot change it hoping that biological reality will adjust to broken programmers assumptions. There is no way to enumerate the exons from "left" nor number them from lower to higher coordinate positions. Take the below as authoritative example of a gene annotated on a minus strand. Notably, the exonic ExonsSpreadsheet-Homo_sapiens_Transcript_Exons_ENST00000356696.csv
When you parse any annotation format like GBK or GFF, the coordinates of exons and of genes in general are
This is exactly why I think
Propagation of broken underlying design does not justify anything. This leads to nowhere, like if we discuss 0-based numbering still being retained in python as it is compatible with the slicing syntax and in BED format.
Each genome annotation/prediction website tries to merge compatible splice variants to a minimum. It is not difficult, just checking if the inner exon junction have exactly same values and once they differ, the two variants in question cannot be merged. Alternatively, if all the inner exon junctions are same then union of the inner exons and (pre/ap)pend those wider of the first and last exons. |
Hi,
I thought I can use
GenomicRanges
to create its objects based on[(start1,end1), (start2,end2), (start3,end3), (start4,end4), (start5,end5)]
exon positions and then unify these splice variants into less "rows" while ensuringstart[2,3,4,5]
must match exactly andend[1,2,3,4]
must match exactly. The examples in README.md but even docs/tutorial.md are simply comfusing. I understand it is easy to programmatically smash in some number forstart
s andend
s (orwidth
s, huh?) but neither of the examples seems to carefully demonstrate how one can represent several splice variants sharing several exons.Moreover, the https://biocpy.github.io/GenomicRanges/api/genomicranges.html#genomicranges.GenomicRanges.GenomicRanges.union seems to do something what I do not want.
Then https://biocpy.github.io/GenomicRanges/api/genomicranges.html#genomicranges.GenomicRanges.GenomicRanges.reduce does something what I do not understand.
What is scary for me that IRanges accept
start
andwidth
instead ofstart
andend
. I fear if I specify on minus strand start=21, end=10 some unknown code breaks becausestart > end
and I fearstart=21
,width=9
will result instart=21, end=30
.The documentation is so large that I do not even want to study if this originates from
IRanges
orR
implementation ofgenomicranges
else.Please make the examples realistic.
These shoukd be a helper method so one does not need to learn the internals of the implementation and simply pass-in a list of tuples and be done with instantiation of a
GenomicRanges
object.My data are in the following format, some examples:
I will have thousand of those and want to meaningfully merge the splice variants if they have exactly same exons except
start1
andend5
values which do not have to be exactly same. The resulting unified splice variant(s) should retain these two values to accommodate the widest splice variant from the input.Ideally I want to numsort by start and end positions, left to right. Please note this gene is on minus strand.
Thank you for any effort to make the package and its documentation more usefull to biologists.
The text was updated successfully, but these errors were encountered: