Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for null intervals #59

Closed
mtivadar opened this issue Aug 22, 2017 · 5 comments
Closed

support for null intervals #59

mtivadar opened this issue Aug 22, 2017 · 5 comments
Labels
Q&A Questions and answers

Comments

@mtivadar
Copy link

it should be nice to support null intervals, like [4, 4]

@chaimleib
Copy link
Owner

chaimleib commented Sep 12, 2017

There have been many requests to include points within the range search. I tried doing this once, but the algorithms for traversing and modifying the tree really depend on intervals containing some range. So, if you would like to include 4 in your interval tree, you could add the interval [4, 5), which would start at 4 inclusively and end at 5 without including it, which is how all Intervals work in the IntervalTree. They include the start value, but not the end value.

This behavior is analogous to string or array indexing:

>>> s = "abcdef"
>>> s[4:5]
"e"
>>> s[4:4]
""

>>> a = [0, 1, 2, 3, 4, 5]
>>> a[4:5]
[4]
>>> a[4:4]
[]

If your range limits are not restricted to discrete values, for example, they can be reals with fractional parts, then for now, I would suggest coding around sortedcontainers.SortedDict. This is how I would like to add point support in a future release, but to a new class with a different name, since this is different in behavior from the current interval tree.

@kotoroshinoto
Copy link

kotoroshinoto commented Apr 11, 2018

in the case where left is included and right is excluded a position described as i:i would refer to the location between i-1 and i, not including either i-1 or i themselves.

to give context for this, a description of an insertion mutation in a genome would refer to a position between nucleotides as the location of origin for the variant, because that "range" doesn't exist in the reference sequence, but rather has a 'location' that lies between existing positions

such a position would NOT be encompassed by ranges that touch either side of it, but would be if they crossed over it.

lets take position 3:3 for instance. It is a "between" position that lies in the potential insertion gap between 2 and 3, not including 2 OR 3. It would NOT overlap with either 1:3 (which technically ends at 2 if we're dealing only with integers) or even with 3:6 (because this range begins at 3 itself, not the gap between 2 and 3).

2:3 would overlap it, 3:3 explicitly would as well.

@chaimleib
Copy link
Owner

chaimleib commented Dec 17, 2018

In order to support using floats and other non-integer ranges, Intervals include the lower bound, but not the upper bound, so Interval(2, 3) is 2 ≤ x < 3. This is a deep dependency in the search, insertion and balancing algorithms, and other methods of IntervalTree.

If you tried to create Interval(3, 3), you would get 3 ≤ x < 3. There is no x that satisfies this inequality, since x = 3 won't satisfy x < 3. The interval includes no points, and is equivalent to the null set.

If I were to make a new kind of tree where point searches did not include boundaries, weird and awkward behavior would result. For example, if I asked it "What events do I have at 3 o'clock?", an event beginning at 3 would not appear.

There do exist other libraries that support all types of intervals, inclusive or not inclusive of either the lower or upper bound. However, the added flexibility results in a very complicated API and even more complicated internal logic for dealing with intervals. I opted to only support half-open intervals, as defined above.

@chaimleib chaimleib pinned this issue Dec 18, 2018
@chaimleib chaimleib added the Q&A Questions and answers label Mar 5, 2019
@gbuckholtz
Copy link

gbuckholtz commented Jun 4, 2020

It is useful for me to have null intervals in the context I am using the intervaltree. I am making an interval tree of CIDR format IP addresses. 1.0.254.0/23 for example. There is a degenerate case of 1.0.254.0/32 that boils down to a single IP address rather than a range.

I did read your first reply and bumped the upper range of the degenerate case to be [lower:lower+1] to set the interval.

@GillesJ
Copy link

GillesJ commented Jun 5, 2020

Handling of null intervals is a better fit conceptually where the intervals represent unit positions in data sets.
My use-case is in natural language processing where sequences of words in text are labelled. Often you want to assess if annotations in running text overlap. Any text dataset has a lot of one-word length annotations. [Word spans] are typically represented by intervals of word positions, e.g. [0,1] = [Word spans]. As mentioned, you can transform the positional representation to boundary representation by "end_position + 1" in combination with exclusive right boundary matching.
However the programmer has to always be cognizant of this representative transformation.

I used to use Sympy for text problems because it is a better fit conceptually as it casts null intervals to FiniteSet objects. But Sympy is way too slow and memory intensive on large datasets. Intervaltree.py is fast and light, so I will keep using this package over Sympy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Q&A Questions and answers
Projects
None yet
Development

No branches or pull requests

5 participants