Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create more realistic examples #77

Open
Tracked by #79
mmokrejs opened this issue Apr 10, 2024 · 8 comments
Open
Tracked by #79

Create more realistic examples #77

mmokrejs opened this issue Apr 10, 2024 · 8 comments
Assignees

Comments

@mmokrejs
Copy link
Contributor

mmokrejs commented Apr 10, 2024

Hi,
I thought I can use GenomicRanges to create its objects based on [(start1,end1), (start2,end2), (start3,end3), (start4,end4), (start5,end5)] exon positions and then unify these splice variants into less "rows" while ensuring start[2,3,4,5] must match exactly and end[1,2,3,4] must match exactly. The examples in README.md but even docs/tutorial.md are simply comfusing. I understand it is easy to programmatically smash in some number for starts and ends (or widths, huh?) but neither of the examples seems to carefully demonstrate how one can represent several splice variants sharing several exons.

Moreover, the https://biocpy.github.io/GenomicRanges/api/genomicranges.html#genomicranges.GenomicRanges.GenomicRanges.union seems to do something what I do not want.

Then https://biocpy.github.io/GenomicRanges/api/genomicranges.html#genomicranges.GenomicRanges.GenomicRanges.reduce does something what I do not understand.

What is scary for me that IRanges accept start and width instead of start and end. I fear if I specify on minus strand start=21, end=10 some unknown code breaks because start > end and I fear start=21, width=9 will result in start=21, end=30.

The documentation is so large that I do not even want to study if this originates from IRanges or R implementation of genomicranges else.

Please make the examples realistic.

>>> GenomicRanges.set_ranges([(49200441, 49200301), (49206172, 49206067), (49207868, 49208160)])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: GenomicRanges.set_ranges() missing 1 required positional argument: 'ranges'
>>>

These shoukd be a helper method so one does not need to learn the internals of the implementation and simply pass-in a list of tuples and be done with instantiation of a GenomicRanges object.

My data are in the following format, some examples:

('NC_000011.10', 'DA340864;', '(complement)', [(49200441, 49200301), (49206172, 49206067), (49207868, 49208160)], [(1, 144), (145, 250), (251, 549)], ['97%', '99%', '97%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24])
('NC_000011.10', 'DA394358;', '(complement)', [(49200441, 49200294), (49206172, 49206067), (49208292, 49208636)], [(1, 148), (149, 254), (255, 599)], ['100%', '100%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24])
('NC_000011.10', 'DA301520;', '(complement)', [(49200441, 49200263), (49206172, 49206067), (49208292, 49208559)], [(1, 179), (180, 285), (286, 553)], ['99%', '99%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24])
('NC_000011.10', 'DA259004;', '(complement)', [(49192894, 49192792), (49200441, 49200255), (49206172, 49206067), (49208292, 49208453)], [(6, 108), (109, 295), (296, 401), (402, 563)], ['100%', '100%', '100%', '100%'], ['->', '->', '->', None], ['(GT/AG)', '(GT/AG)', '(GT/AG)', None], [24, 24, 24, 24])
('NC_000011.10', 'DA282176;', '(complement)', [(49200441, 49200267), (49206172, 49206067), (49208292, 49208559)], [(1, 175), (176, 281), (282, 549)], ['100%', '100%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24])
('NC_000011.10', 'DA299779;', '(complement)', [(49200441, 49200318), (49206172, 49206067), (49208292, 49208636)], [(1, 124), (125, 228), (229, 573)], ['100%', '98%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24])
('NC_000011.10', 'DA301520;', '(complement)', [(49200441, 49200263), (49206172, 49206067), (49208292, 49208559)], [(1, 179), (180, 285), (286, 553)], ['99%', '99%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24])
('NC_000011.10', 'DA308616;', '(complement)', [(49192894, 49192799), (49200441, 49200255), (49206172, 49206067), (49208292, 49208466)], [(1, 96), (97, 283), (284, 389), (390, 564)], ['100%', '100%', '100%', '100%'], ['->', '->', '->', None], ['(GT/AG)', '(GT/AG)', '(GT/AG)', None], [24, 24, 24, 24])

I will have thousand of those and want to meaningfully merge the splice variants if they have exactly same exons except start1 and end5 values which do not have to be exactly same. The resulting unified splice variant(s) should retain these two values to accommodate the widest splice variant from the input.

Ideally I want to numsort by start and end positions, left to right. Please note this gene is on minus strand.

Thank you for any effort to make the package and its documentation more usefull to biologists.

@jkanche
Copy link
Member

jkanche commented Apr 12, 2024

Hi @mmokrejs, Thank you for opening this. I am not sure I follow the entire issue, and I'll try to answer from what I understand. Please correct me if I misunderstood something.

IRanges is to specify intervals. Its irrelevant to this class if the interval is on the positive strand or the negative strand. and yes in R one can specify intervals using a combination of (start, width) or (start, end) pairs. But in the Python implementation this is limited to (start, width) pairs.

So the first step for your usecase is to parse the rows and extract the intervals

data_raw = [('NC_000011.10', 'DA340864;', '(complement)', [(49200441, 49200301), (49206172, 49206067), (49207868, 49208160)], [(1, 144), (145, 250), (251, 549)], ['97%', '99%', '97%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA394358;', '(complement)', [(49200441, 49200294), (49206172, 49206067), (49208292, 49208636)], [(1, 148), (149, 254), (255, 599)], ['100%', '100%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA301520;', '(complement)', [(49200441, 49200263), (49206172, 49206067), (49208292, 49208559)], [(1, 179), (180, 285), (286, 553)], ['99%', '99%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA259004;', '(complement)', [(49192894, 49192792), (49200441, 49200255), (49206172, 49206067), (49208292, 49208453)], [(6, 108), (109, 295), (296, 401), (402, 563)], ['100%', '100%', '100%', '100%'], ['->', '->', '->', None], ['(GT/AG)', '(GT/AG)', '(GT/AG)', None], [24, 24, 24, 24]),
('NC_000011.10', 'DA282176;', '(complement)', [(49200441, 49200267), (49206172, 49206067), (49208292, 49208559)], [(1, 175), (176, 281), (282, 549)], ['100%', '100%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA299779;', '(complement)', [(49200441, 49200318), (49206172, 49206067), (49208292, 49208636)], [(1, 124), (125, 228), (229, 573)], ['100%', '98%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA301520;', '(complement)', [(49200441, 49200263), (49206172, 49206067), (49208292, 49208559)], [(1, 179), (180, 285), (286, 553)], ['99%', '99%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA308616;', '(complement)', [(49192894, 49192799), (49200441, 49200255), (49206172, 49206067), (49208292, 49208466)], [(1, 96), (97, 283), (284, 389), (390, 564)], ['100%', '100%', '100%', '100%'], ['->', '->', '->', None], ['(GT/AG)', '(GT/AG)', '(GT/AG)', None], [24, 24, 24, 24])]

regions = []
exon_names = []
widths = []
starts = []

for dr in data_raw:
    widths.extend([abs(x[0] - x[1]) for x in dr[3]])
    starts.extend([x[0] if x[1] > x[0] else x[1] for i, x in enumerate(dr[3])])
    exon_names.extend([dr[1]] * len(dr[3]))
    regions.extend([dr[0]] * len(dr[3]))

Now that we have some of the information extracted, lets create a GenomicsRanges object,

from genomicranges import GenomicRanges
from iranges import IRanges
from biocframe import BiocFrame
from random import random

gr = GenomicRanges(
    seqnames=[
        "chr11"
    ] * len(starts),
    ranges=IRanges(start=starts, width=widths),
    strand=["-"] * len(starts),
    mcols=BiocFrame(
        {
            "exon": exon_names,
            "region": regions
        }
    ),
)

print(gr)

## OUTPUT
GenomicRanges with 26 ranges and 26 metadata columns
     seqnames              ranges          strand        exon       region
        <str>           <IRanges> <ndarray[int8]>      <list>       <list>
 [0]    chr11 49200301 - 49200441               - | DA340864; NC_000011.10
 [1]    chr11 49206067 - 49206172               - | DA340864; NC_000011.10
 [2]    chr11 49207868 - 49208160               - | DA340864; NC_000011.10
          ...                 ...             ... |       ...          ...
[23]    chr11 49200255 - 49200441               - | DA308616; NC_000011.10
[24]    chr11 49206067 - 49206172               - | DA308616; NC_000011.10
[25]    chr11 49208292 - 49208466               - | DA308616; NC_000011.10
------
seqinfo(1 sequences): chr11

Lets reduce these intervals into non-overlapping regions,

print(gr.reduce(with_reverse_map=True))

## OUTPUT
GenomicRanges with 5 ranges and 5 metadata columns
    seqnames              ranges          strand                          revmap
       <str>           <IRanges> <ndarray[int8]>                          <list>
[0]    chr11 49192792 - 49192894               - |                       [9, 22]
[1]    chr11 49200255 - 49200441               - | [10, 23, 6, 19, 13, 3, 0, 16]
[2]    chr11 49206067 - 49206172               - | [1, 4, 7, 11, 14, 17, 20, 24]
[3]    chr11 49207868 - 49208160               - |                           [2]
[4]    chr11 49208292 - 49208636               - |    [12, 25, 8, 15, 21, 5, 18]
------
seqinfo(1 sequences): chr11

revmap contains indices from the original GenomicRanges object that map to the reduced intervals.

By default the reduce method computes non-overlapping intervals for every unique (seqname, strand) combination in the object, similar to bioconductor's implementation. Let me know if this is incorrect to what you were expecting. It would also help if you can provide R snippets, so it helps me with debugging and tests.

@mmokrejs
Copy link
Contributor Author

mmokrejs commented Apr 12, 2024

Hi @jkanche , thank you for your excellent answer, I edited the code to match better biological terms and I fixed/swapped the coordinates of the trailing exons (my mistake in the initial example).

data_raw = [
('NC_000011.10', 'DA340864;', '(complement)', [(49200441, 49200301), (49206172, 49206067), (49208160, 49207868)], [(1, 144), (145, 250), (251, 549)], ['97%', '99%', '97%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA394358;', '(complement)', [(49200441, 49200294), (49206172, 49206067), (49208636, 49208292)], [(1, 148), (149, 254), (255, 599)], ['100%', '100%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA301520;', '(complement)', [(49200441, 49200263), (49206172, 49206067), (49208559, 49208292)], [(1, 179), (180, 285), (286, 553)], ['99%', '99%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA259004;', '(complement)', [(49192894, 49192792), (49200441, 49200255), (49206172, 49206067), (49208453, 49208292)], [(6, 108), (109, 295), (296, 401), (402, 563)], ['100%', '100%', '100%', '100%'], ['->', '->', '->', None], ['(GT/AG)', '(GT/AG)', '(GT/AG)', None], [24, 24, 24, 24]),
('NC_000011.10', 'DA259004;', None, [(32091250, 32091265), (81783019, 81783034), (89639236, 89639342), (89645003, 89645189), (89652496, 89652598)], [(116, 131), (145, 160), (162, 268), (269, 455), (456, 558)], ['100%', '100%', '100%', '98%', '99%'], ['==', '==', '->', '->', None], [None, None, '(GT/AG)', '(GT/AG)', None], [None, None, 24, 24, 24]),
('NC_000011.10', 'DA282176;', '(complement)', [(49200441, 49200267), (49206172, 49206067), (49208559, 49208292)], [(1, 175), (176, 281), (282, 549)], ['100%', '100%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA299779;', '(complement)', [(49200441, 49200318), (49206172, 49206067), (49208636, 49208292)], [(1, 124), (125, 228), (229, 573)], ['100%', '98%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA301520;', '(complement)', [(49200441, 49200263), (49206172, 49206067), (49208559, 49208292)], [(1, 179), (180, 285), (286, 553)], ['99%', '99%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA308616;', '(complement)', [(49192894, 49192799), (49200441, 49200255), (49206172, 49206067), (49208466, 49208292)], [(1, 96), (97, 283), (284, 389), (390, 564)], ['100%', '100%', '100%', '100%'], ['->', '->', '->', None], ['(GT/AG)', '(GT/AG)', '(GT/AG)', None], [24, 24, 24, 24])
]

chromosome_accessions = []
mRNA_accessions = []
widths = []
starts = []

for dr in data_raw:
    widths.extend([abs(x[0] - x[1]) for x in dr[3]])
    starts.extend([x[0] if x[1] > x[0] else x[1] for i, x in enumerate(dr[3])]) # this will break if a gene spans origin (zero) in a circular chromosome
    mRNA_accessions.extend([dr[1] for splice_var in data_raw])
    chromosome_accessions.extend([dr[0] for splice_var in data_raw])


from genomicranges import GenomicRanges
from iranges import IRanges
from biocframe import BiocFrame
from random import random
import string

def generate_exon_names(count):
    names = []
    for letter1 in string.ascii_uppercase:
        names.append('exon_%s' % letter1)
    if count > len(string.ascii_uppercase):
        for letter1 in string.ascii_uppercase:
            for letter2 in string.ascii_uppercase:
                names.append('exon_%s%s' % (letter1, letter2))
    return names[:count]

splice_variants = [(i, splice_var[3]) for i, splice_var in enumerate(data_raw)]
strands = []
for splice_variant in splice_variants:
    #print("splice variant: %s" % str(splice_variant))
    i, exons = splice_variant
    #print("All exons %s" % str(exons))
    for exon in exons:
        #print("Individual exon %s" % str(exon))
        if data_raw[i][2] == '(complement)':
            strands.append('-')
        elif data_raw[i][2] is None:
            strands.append('+')
        else:
            raise("Unexpected strand: %s" % str(data_raw[i][2]))

  
gr = GenomicRanges(
    seqnames=[
        "chr11"
    ] * len(starts),
    ranges=IRanges(start=starts, width=widths),
    strand=strands,
    mcols=BiocFrame(
        {
            "accession": mRNA_accessions,
            "exon": generate_exon_names(len(starts)),
            "chr_accession": chromosome_accessions
        }
    ),
)

print(gr)

Here I show which ot the splice variants can be merged:

[0]                         (49200441, 49200301), (49206172, 49206067), (49208160, 49207868)
[1]                         (49200441, 49200294), (49206172, 49206067), (49208636, 49208292)
[2]                         (49200441, 49200263), (49206172, 49206067), (49208559, 49208292)  can be merged with data_raw[6]
[3] (49192894, 49192792),   (49200441, 49200255), (49206172, 49206067) cannot be merged with data_raw[7] because 49192792 != 49192799
[4]                         (49200441, 49200267), (49206172, 49206067), (49208559, 49208292)
[5]                         (49200441, 49200318), (49206172, 49206067), (49208636, 49208292)
[6]                         (49200441, 49200263), (49206172, 49206067), (49208559, 49208292)
[7] (49192894, 49192799),   (49200441, 49200255), (49206172, 49206067), (49208466, 49208292)
>>> print(gr)
GenomicRanges with 31 ranges and 31 metadata columns
     seqnames              ranges          strand   accession    exon chr_accession
        <str>           <IRanges> <ndarray[int8]>      <list>  <list>        <list>
 [0]    chr11 49200301 - 49200441               - | DA340864;  exon_A  NC_000011.10
 [1]    chr11 49206067 - 49206172               - | DA340864;  exon_B  NC_000011.10
 [2]    chr11 49207868 - 49208160               - | DA340864;  exon_C  NC_000011.10
          ...                 ...             ... |       ...     ...           ...
[28]    chr11 49200255 - 49200441               - | DA308616; exon_AC  NC_000011.10
[29]    chr11 49206067 - 49206172               - | DA308616; exon_AD  NC_000011.10
[30]    chr11 49208292 - 49208466               - | DA308616; exon_AE  NC_000011.10
------
seqinfo(1 sequences): chr11
>>> print(gr.reduce(with_reverse_map=True))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/foo/genomicranges/lib/python3.11/site-packages/genomicranges/GenomicRanges.py", line 1561, in reduce
    output._mcols.set_column("revmap", rev_map, in_place=True)
  File "/foo/genomicranges/lib/python3.11/site-packages/biocframe/BiocFrame.py", line 936, in set_column
    return self.set_columns({column: value}, in_place=in_place)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/foo/genomicranges/lib/python3.11/site-packages/biocframe/BiocFrame.py", line 964, in set_columns
    raise ValueError(
ValueError: Length of `value`, does not match the number of the rows,need to be 10 but provided 15.
>>> 

@jkanche
Copy link
Member

jkanche commented Apr 12, 2024

@mmokrejs I fixed this bug yesterday, you may need to update the package to 0.4.14.

@mmokrejs
Copy link
Contributor Author

mmokrejs commented Apr 12, 2024

OK, 0.4.14 fixed the 0.4.12 issue. So now I have:

>>> print(gr)
GenomicRanges with 31 ranges and 31 metadata columns
     seqnames              ranges          strand   accession    exon chr_accession
        <str>           <IRanges> <ndarray[int8]>      <list>  <list>        <list>
 [0]    chr11 49200301 - 49200441               - | DA340864;  exon_A  NC_000011.10
 [1]    chr11 49206067 - 49206172               - | DA340864;  exon_B  NC_000011.10
 [2]    chr11 49207868 - 49208160               - | DA340864;  exon_C  NC_000011.10
          ...                 ...             ... |       ...     ...           ...
[28]    chr11 49200255 - 49200441               - | DA308616; exon_AC  NC_000011.10
[29]    chr11 49206067 - 49206172               - | DA308616; exon_AD  NC_000011.10
[30]    chr11 49208292 - 49208466               - | DA308616; exon_AE  NC_000011.10
------
seqinfo(1 sequences): chr11
>>> print(gr.reduce(with_reverse_map=True))
GenomicRanges with 10 ranges and 10 metadata columns
    seqnames              ranges          strand                          revmap
       <str>           <IRanges> <ndarray[int8]>                          <list>
[0]    chr11 32091250 - 32091265               + |                          [13]
[1]    chr11 81783019 - 81783034               + |                          [14]
[2]    chr11 89639236 - 89639342               + |                          [15]
[3]    chr11 89645003 - 89645189               + |                          [16]
[4]    chr11 89652496 - 89652598               + |                          [17]
[5]    chr11 49192792 - 49192894               - |                       [9, 27]
[6]    chr11 49200255 - 49200441               - | [10, 28, 6, 24, 18, 3, 0, 21]
[7]    chr11 49206067 - 49206172               - | [1, 4, 7, 11, 19, 22, 25, 29]
[8]    chr11 49207868 - 49208160               - |                           [2]
[9]    chr11 49208292 - 49208636               - |    [12, 30, 8, 20, 26, 5, 23]
------
seqinfo(1 sequences): chr11
>>> 

But, I did not want to have isolated list of exons but to retain compatible (actually existing) range combinations . Assumably the 8 items from data_raw would shrink down to 7 items. Let me think what the reduce actually did.

@mmokrejs
Copy link
Contributor Author

mmokrejs commented Apr 12, 2024

Basically, it does not do what I need. It should act on the lists of ranges (read on the fragments of splice variants defined by a combination of exonic ranges). The very first and very last exon might be incomplete, hence why I initially wrote that start1 and end5 positions are not important for the decision if the two listings of ranges can be merged (in the virtual example I initially spotted both splice variants consist of 5 exons). If we speak of this 5-exonic example and being on minus strand of the genome, the start1 being greater and endX being smaller should be incorporated after the union. All the internal exon junctions must be exactly same (read the start[2,3,4,5] must match exactly in the pair of splice variants to be eventually unified together and likewise, end[1,2,3,4] must match exactly. In brief, the inner start/end values must match exactly while the very first start and very last end values may differ and the widest result of the union should be output as a result. Am I clearer now?

@mmokrejs
Copy link
Contributor Author

The abs(x[0] - x[1]) is asking for a trouble. It would be much better if python crashed somewhere on a negative value. Moreover I think x[0] - x[1] + 1 should be there if x[0] > x[1] else x[1] - x[0] + 1. This will break anyway if the range crosses zero (bacterial circular chromosome origin). The (start, width) definition is bad and error-prone that it should be dismounted.

@LTLA
Copy link
Member

LTLA commented Apr 12, 2024

Chiming in here.

I find it very unusual to see start positions to be greater than the end position, irrespective of the strand. Certainly this is not the case for any standard formats (GFF, BED), nor is it the case for the R GRanges (aside from the corner case of zero-width ranges due to the closed nature of the interval definition). We will not be supporting this.

As for the (start, width) choice; well, you either have to specify the width or the end to define an interval, and one is as good as the other. Your background may suggest to you that end is a "obvious" choice, but this opinion is not necessarily shared. For example, I regularly use width instead of end when handling interval data, because the use of unsigned integer widths guarantees the validity of the interval without requiring a check for end >= start. Both the R and Python implementations also use width under the hood to represent IRanges, so it is natural for the constructor to accept width rather than end if the former is already available. You could argue that it may be more user-friendly to support end in the IRanges constructor, seeing as how all of the standard formats use end rather than width; I wouldn't disagree with that, but that would be in addition to the existing width= option.

As for the question of what you're trying to do: I don't think the GRanges generic functions, either here or in the R package, will give you an easy "one-click" solution. Off the top of my head; it would be something like:

  1. find_overlaps with type="equal" (note: not yet supported in the Python version) on the GRanges containing the exons, to find all groups of shared exons across splice variants.
  2. For each variant, iterate across its exons, extract the set of variants with the same exon, and find the intersection. Each variant in the intersection contains a superset of the current variant's exons.
  3. Decide which ones you want to keep.

Seems fairly complex.

@mmokrejs
Copy link
Contributor Author

mmokrejs commented Apr 15, 2024

Hi @LTLA ,
thank you for joining in.

I find it very unusual to see start positions to be greater than the end position, irrespective of the strand. Certainly this is not

About half of the genes in a genome are on a minus strand.

the case for any standard formats (GFF, BED), nor is it the case for the R GRanges (aside from the corner case of zero-width
ranges due to the closed nature of the interval definition). We will not be supporting this.

Please realize the first exon (exon A) is on the far right, while last exon is on far left. Start of an exon will always be where it actually starts, that is where a ribosome starts to synthesize a protein. You cannot change it hoping that biological reality will adjust to broken programmers assumptions. There is no way to enumerate the exons from "left" nor number them from lower to higher coordinate positions. Take the below as authoritative example of a gene annotated on a minus strand. Notably, the exonic start positions are > those of end and it cannot be any other way.

http://www.ensembl.org/Homo_sapiens/Transcript/Exons?g=ENSG00000086205;r=11:49146636-49208532;t=ENST00000356696

Screenshot 2024-04-15 at 13-25-33 Transcript ENST00000356696 7 (FOLH1-203) - Exons - Homo_sapiens - Ensembl genome browser 111

ExonsSpreadsheet-Homo_sapiens_Transcript_Exons_ENST00000356696.csv
ExonsSpreadsheet-Homo_sapiens_Transcript_Exons_ENST00000340334.csv
ExonsSpreadsheet-Homo_sapiens_Transcript_Exons_ENST00000256999.csv

As for the (start, width) choice; well, you either have to specify the width or the end to define an interval, and one is as good as the other. Your background may suggest to you that end is a "obvious" choice, but this opinion is not necessarily shared.

When you parse any annotation format like GBK or GFF, the coordinates of exons and of genes in general are start and end positions. The width must be calculated as an extra work and moreover, is prone to mistakes if there is uncertainty whether the width should be applied rightwards or leftwards from the start.

For example, I regularly use width instead of end when handling interval data, because the use of unsigned integer widths guarantees the validity of the interval without requiring a check for end >= start. Both the R and Python implementations also

This is exactly why I think start, width should be deprecated. Those asking for a trouble may continue using it but assumingly those working with minus strand annotation will burn their hands sooner or later.

use width under the hood to represent IRanges, so it is natural for the constructor to accept width rather than end if the

Propagation of broken underlying design does not justify anything. This leads to nowhere, like if we discuss 0-based numbering still being retained in python as it is compatible with the slicing syntax and in BED format.

former is already available. You could argue that it may be more user-friendly to support end in the IRanges constructor, seeing as how all of the standard formats use end rather than width; I wouldn't disagree with that, but that would be in addition to the existing width= option.

As for the question of what you're trying to do: I don't think the GRanges generic functions, either here or in the R package, will give you an easy "one-click" solution. Off the top of my head; it would be something like:

1. `find_overlaps` with `type="equal"` (note: not yet supported in the Python version) on the `GRanges` containing the exons, to find all groups of shared exons across splice variants.

2. For each variant, iterate across its exons, extract the set of variants with the same exon, and find the intersection. Each variant in the intersection contains a superset of the current variant's exons.

3. Decide which ones you want to keep.

Seems fairly complex.

Each genome annotation/prediction website tries to merge compatible splice variants to a minimum. It is not difficult, just checking if the inner exon junction have exactly same values and once they differ, the two variants in question cannot be merged. Alternatively, if all the inner exon junctions are same then union of the inner exons and (pre/ap)pend those wider of the first and last exons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants