Conditional merge_neighbors #130

tthorleifson · 2023-06-28T17:53:40Z

The current merge_neighbors is distance-based, which makes it useful only for the narrow use case of dealing with "sliver" intervals, essentially by merging them into adjacent, longer intervals. A more useful general case would be to perform a merge of adjacent intervals based on some condition in the adjacent intervals' data objects. For instance, consider a case where the data objects are dictionaries. Under a conditional merge_neighbors, one would only perform the merge of adjacent intervals where the item values for a given key in the respective data objects of two adjacent intervals are equal to each other, or to some specified value. This type of conditional merge is independent of interval length, and is dependent rather on the data content of adjacent intervals.

This type of conditional merge can currently be performed, but only with rather cumbersome and clunky code. It is necessary to iterate over the sorted intervals of a tree, comparing the data object of the current interval to that the previous interval, and, if the condition is met, deleting them, and then constructing a new interval that covers the span of the adjacent deleted intervals, building a new data object for the new "merged' interval based on the remains of the deleted intervals' data objects, and then inserting the new "merged" interval back into the tree. It's tedious and not particularly performant. Adding the ability to specify a merge condition into the arguments for merge_neighbors would be very helpful.

tthorleifson · 2023-06-30T21:36:43Z

What follows is a code sample of doing a conditional merge with data aggregation. This is a pipeline example. Consider the pipeline, "Route 100." It has two High Consequence Areas (HCAs) along its length, and no Moderate Consequence Areas (MCAs). (HCAs and MCAs are those portions of a natural gas transmission pipeline where the consequences of a release could be unpleasant, due the presence of people and the possibility of ignition.) There have also been eight leaks recorded over time along the line. Our task is to summarize the count of leaks that have occurred within each HCA range. The Interval Tree object is well suited to performing this type of analysis.

Note that we are using Python dictionaries as the data objects for the interval objects. This is fine for this simple example, but in actual practice, when dealing with thousands of miles of pipelines in a large system, using Pandas dataframes would make better sense.

The example shown here actually comes from a Jupyter notebook, attached (zipped), as well as a simple pictorial representation of the data:

(https://github.com/chaimleib/intervaltree/files/11923340/IntervalTree.Point.Dissolve.Test.zip)

"""
Import IntervalTree classes.
"""
from intervaltree import Interval, IntervalTree

"""
Build the HCA interval tree.
"""
hca_interval_tree = IntervalTree()
hca_interval_tree[0:20000] = {"HCAType": "Not HCA or MCA"}
hca_interval_tree[20000:28000] = {"HCAType": "HCA"}
hca_interval_tree[28000:36000] = {"HCAType": "Not HCA or MCA"}
hca_interval_tree[36000:48000] = {"HCAType": "HCA"}
hca_interval_tree[48000:65000] = {"HCAType": "Not HCA or MCA"}
print (hca_interval_tree)

Print output:
IntervalTree([Interval(0, 20000, {'HCAType': 'Not HCA or MCA'}), Interval(20000, 28000, {'HCAType': 'HCA'}), Interval(28000, 36000, {'HCAType': 'Not HCA or MCA'}), Interval(36000, 48000, {'HCAType': 'HCA'}), Interval(48000, 65000, {'HCAType': 'Not HCA or MCA'})])

"""
Build the Leaks interval tree.
Note that IntervalTree does not accept true "point events." Point events are considered to be Null intervals, 
and that's a no-no.
All intervals in a tree must have a finite length greater than 0. The work around here is create one foot long linear events.
This is not ideal, and you'll have to be careful with point events at the end of a line, remembering to subtract 1, 
rather than adding 1.
"""
leaks_interval_tree = IntervalTree()
leaks_interval_tree[4000:4001] = {"LeakID": "Leak 1"}
leaks_interval_tree[10000:10001] = {"LeakID": "Leak 2"}
leaks_interval_tree[22000:22001] = {"LeakID": "Leak 3"}
leaks_interval_tree[27000:27001] = {"LeakID": "Leak 4"}
leaks_interval_tree[37000:37001] = {"LeakID": "Leak 5"}
leaks_interval_tree[40000:40001] = {"LeakID": "Leak 6"}
leaks_interval_tree[45000:45001] = {"LeakID": "Leak 7"}
leaks_interval_tree[54000:54001] = {"LeakID": "Leak 8"}
print (leaks_interval_tree)

Print output:
IntervalTree([Interval(4000, 4001, {'LeakID': 'Leak 1'}), Interval(10000, 10001, {'LeakID': 'Leak 2'}), Interval(22000, 22001, {'LeakID': 'Leak 3'}), Interval(27000, 27001, {'LeakID': 'Leak 4'}), Interval(37000, 37001, {'LeakID': 'Leak 5'}), Interval(40000, 40001, {'LeakID': 'Leak 6'}), Interval(45000, 45001, {'LeakID': 'Leak 7'}), Interval(54000, 54001, {'LeakID': 'Leak 8'})])

"""
Build the Route interval tree.
"""
route_interval_tree = IntervalTree()
route_interval_tree[0:65000] = {"RouteID": "Route 100"}
print (route_interval_tree)

Print output:
IntervalTree([Interval(0, 65000, {'RouteID': 'Route 100'})])

"""
Rock some dynamic segmentation.
First, union all the trees.
"""
dynseg_tree = route_interval_tree | leaks_interval_tree
dynseg_tree |= hca_interval_tree
"""
Now, dynseg.
"""
dynseg_tree.split_overlaps()
print (dynseg_tree)

Print output:
IntervalTree([Interval(0, 4000, {'HCAType': 'Not HCA or MCA'}), Interval(0, 4000, {'RouteID': 'Route 100'}), Interval(4000, 4001, {'LeakID': 'Leak 1'}), Interval(4000, 4001, {'RouteID': 'Route 100'}), Interval(4000, 4001, {'HCAType': 'Not HCA or MCA'}), Interval(4001, 10000, {'RouteID': 'Route 100'}), Interval(4001, 10000, {'HCAType': 'Not HCA or MCA'}), Interval(10000, 10001, {'LeakID': 'Leak 2'}), Interval(10000, 10001, {'RouteID': 'Route 100'}), Interval(10000, 10001, {'HCAType': 'Not HCA or MCA'}), Interval(10001, 20000, {'RouteID': 'Route 100'}), Interval(10001, 20000, {'HCAType': 'Not HCA or MCA'}), Interval(20000, 22000, {'RouteID': 'Route 100'}), Interval(20000, 22000, {'HCAType': 'HCA'}), Interval(22000, 22001, {'RouteID': 'Route 100'}), Interval(22000, 22001, {'HCAType': 'HCA'}), Interval(22000, 22001, {'LeakID': 'Leak 3'}), Interval(22001, 27000, {'RouteID': 'Route 100'}), Interval(22001, 27000, {'HCAType': 'HCA'}), Interval(27000, 27001, {'RouteID': 'Route 100'}), Interval(27000, 27001, {'LeakID': 'Leak 4'}), Interval(27000, 27001, {'HCAType': 'HCA'}), Interval(27001, 28000, {'RouteID': 'Route 100'}), Interval(27001, 28000, {'HCAType': 'HCA'}), Interval(28000, 36000, {'RouteID': 'Route 100'}), Interval(28000, 36000, {'HCAType': 'Not HCA or MCA'}), Interval(36000, 37000, {'RouteID': 'Route 100'}), Interval(36000, 37000, {'HCAType': 'HCA'}), Interval(37000, 37001, {'RouteID': 'Route 100'}), Interval(37000, 37001, {'HCAType': 'HCA'}), Interval(37000, 37001, {'LeakID': 'Leak 5'}), Interval(37001, 40000, {'RouteID': 'Route 100'}), Interval(37001, 40000, {'HCAType': 'HCA'}), Interval(40000, 40001, {'RouteID': 'Route 100'}), Interval(40000, 40001, {'HCAType': 'HCA'}), Interval(40000, 40001, {'LeakID': 'Leak 6'}), Interval(40001, 45000, {'RouteID': 'Route 100'}), Interval(40001, 45000, {'HCAType': 'HCA'}), Interval(45000, 45001, {'RouteID': 'Route 100'}), Interval(45000, 45001, {'LeakID': 'Leak 7'}), Interval(45000, 45001, {'HCAType': 'HCA'}), Interval(45001, 48000, {'RouteID': 'Route 100'}), Interval(45001, 48000, {'HCAType': 'HCA'}), Interval(48000, 54000, {'HCAType': 'Not HCA or MCA'}), Interval(48000, 54000, {'RouteID': 'Route 100'}), Interval(54000, 54001, {'HCAType': 'Not HCA or MCA'}), Interval(54000, 54001, {'RouteID': 'Route 100'}), Interval(54000, 54001, {'LeakID': 'Leak 8'}), Interval(54001, 65000, {'HCAType': 'Not HCA or MCA'}), Interval(54001, 65000, {'RouteID': 'Route 100'})])

"""
Now, merge the overlaps, combining the data objects into a single dictionary for each unique interval.
The combine function simply combines the dictionary objects for the overlapping intervals into a single dictionary 
for a given unique interval in the output.
Note that although we used merge_overlaps here, merge_equals would work just as well.
"""
def combine(a,b):
    return a | b
dynseg_tree.merge_overlaps(data_reducer=combine, data_initializer={})
print (dynseg_tree)

Print output:
IntervalTree([Interval(0, 4000, {'HCAType': 'Not HCA or MCA', 'RouteID': 'Route 100'}), Interval(4000, 4001, {'LeakID': 'Leak 1', 'RouteID': 'Route 100', 'HCAType': 'Not HCA or MCA'}), Interval(4001, 10000, {'RouteID': 'Route 100', 'HCAType': 'Not HCA or MCA'}), Interval(10000, 10001, {'LeakID': 'Leak 2', 'RouteID': 'Route 100', 'HCAType': 'Not HCA or MCA'}), Interval(10001, 20000, {'RouteID': 'Route 100', 'HCAType': 'Not HCA or MCA'}), Interval(20000, 22000, {'RouteID': 'Route 100', 'HCAType': 'HCA'}), Interval(22000, 22001, {'RouteID': 'Route 100', 'HCAType': 'HCA', 'LeakID': 'Leak 3'}), Interval(22001, 27000, {'RouteID': 'Route 100', 'HCAType': 'HCA'}), Interval(27000, 27001, {'RouteID': 'Route 100', 'LeakID': 'Leak 4', 'HCAType': 'HCA'}), Interval(27001, 28000, {'RouteID': 'Route 100', 'HCAType': 'HCA'}), Interval(28000, 36000, {'RouteID': 'Route 100', 'HCAType': 'Not HCA or MCA'}), Interval(36000, 37000, {'RouteID': 'Route 100', 'HCAType': 'HCA'}), Interval(37000, 37001, {'RouteID': 'Route 100', 'HCAType': 'HCA', 'LeakID': 'Leak 5'}), Interval(37001, 40000, {'RouteID': 'Route 100', 'HCAType': 'HCA'}), Interval(40000, 40001, {'RouteID': 'Route 100', 'HCAType': 'HCA', 'LeakID': 'Leak 6'}), Interval(40001, 45000, {'RouteID': 'Route 100', 'HCAType': 'HCA'}), Interval(45000, 45001, {'RouteID': 'Route 100', 'LeakID': 'Leak 7', 'HCAType': 'HCA'}), Interval(45001, 48000, {'RouteID': 'Route 100', 'HCAType': 'HCA'}), Interval(48000, 54000, {'HCAType': 'Not HCA or MCA', 'RouteID': 'Route 100'}), Interval(54000, 54001, {'HCAType': 'Not HCA or MCA', 'RouteID': 'Route 100', 'LeakID': 'Leak 8'}), Interval(54001, 65000, {'HCAType': 'Not HCA or MCA', 'RouteID': 'Route 100'})])

"""
Dissolve back to the original HCA range intervals. Aggregate the count of leaks in each HCA range interval. 
The methodology utilized here is very similar to the data_reducer functionality in merge_overlaps, in that it functions 
as a data accumulator. However, instead of working across a series of vertical intervals with the same begin and end, 
we are working horizontally across adjacent intervals in the tree, merging adjacent intervals that have the same 
HCAType value.
Note that because leaks are point events, the LeakID key appears only in the dictionaries of those intervals that 
actually contain a leak. So, you have to be mindful of that when accumulating leak counts by HCA range. 
For our purposes, simply checking for presence of a Leak feature data key suffices.
"""

"""
Initialize the global variables to be accessed within the merge_neighbors_conditional function.
"""
dissolve_tree = IntervalTree() # This tree will contain our dissolve results.
begin_prev_interval = dynseg_tree.begin() # This variable stores the begin coords of the dissolve_tree intervals.
end_prev_interval = dynseg_tree.begin() # This variable stores the end coords of the dissolve_tree intervals.
prev_interval_hca_type = list(sorted(dynseg_tree))[0].data.get("HCAType")
leak_count = 0

def merge_neighbors_conditional(curr_interval, output_tree): # define the adjacent interval dissolve function.
    global begin_prev_interval
    global end_prev_interval
    global prev_interval_hca_type
    global leak_count
    end_prev_interval = curr_interval.begin
    curr_interval_hca_type = curr_interval.data.get("HCAType")
    if curr_interval_hca_type != prev_interval_hca_type: # Write out the previous interval data to the output tree and reinitialize variables and...
        output_tree[begin_prev_interval:end_prev_interval] = {"HCAType": prev_interval_hca_type, "LeakCount": leak_count}
        # ...initialize accumulation for the next previous interval.
        begin_prev_interval = curr_interval.begin
        end_prev_interval = curr_interval.end
        prev_interval_hca_type = curr_interval_hca_type
        if "LeakID" in curr_interval.data:
            leak_count = 1
        else:
            leak_count = 0
    else: # No change in HCAType, so just keep on accumulating.
        if "LeakID" in curr_interval.data:
            leak_count += 1
        end_prev_interval = curr_interval.end

"""
Process the dyn seg tree using intervals the merge_neighbors_conditional function.
"""
for interval_obj in sorted(dynseg_tree):
    merge_neighbors_conditional(interval_obj, dissolve_tree)

"""
Write out the last interval to the dissolve tree.
"""
dissolve_tree[begin_prev_interval:end_prev_interval] = {"HCAType": prev_interval_hca_type, "LeakCount": leak_count}
print(dissolve_tree)

Here's what that the final print output looks like:
IntervalTree([Interval(0, 20000, {'HCAType': 'Not HCA or MCA', 'LeakCount': 2}), Interval(20000, 28000, {'HCAType': 'HCA', 'LeakCount': 2}), Interval(28000, 36000, {'HCAType': 'Not HCA or MCA', 'LeakCount': 0}), Interval(36000, 48000, {'HCAType': 'HCA', 'LeakCount': 3}), Interval(48000, 65000, {'HCAType': 'Not HCA or MCA', 'LeakCount': 1})])

[IntervalTree Point Dissolve Test.zip]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conditional merge_neighbors #130

Conditional merge_neighbors #130

tthorleifson commented Jun 28, 2023

tthorleifson commented Jun 30, 2023

Conditional merge_neighbors #130

Conditional merge_neighbors #130

Comments

tthorleifson commented Jun 28, 2023

tthorleifson commented Jun 30, 2023