extend edges method #2651

petrelharp · 2022-11-28T21:12:33Z

Here we (me and @hfr1tz3 and @avabamf) have implemented the "extend_edges" and related things in tskit.

Here's the description of what this does (copied from the draft docstring):

        Returns a new tree sequence in which the span covered by ancestral nodes
        is "extended" to regions of the genome according to the following rule:
        If an ancestral segment corresponding to node `n` has parent `p` and
        child `c` on some portion of the genome, and on an adjacent segment of
        genome `p` is the immediate parent of `c`, then `n` is inserted into the
        edge from `p` to `c`. This involves extending the span of the edges
        from `p` to `n` and `n` to `c` and reducing the span of the edge from
        `p` to `c`. Since the latter edge may be removed entirely, this process
        reduces (or at least does not increase) the number of edges in the tree
        sequence.
        The method works by iterating over the genome to look for edges that can
        be extended in this way; the maximum number of such iterations is
        controlled by ``max_iter``.
        The rationale is that we know that `n` carries a portion of the segment
        of ancestral genome inherited by `c` from `p`, and so likely carries
        the *entire* inherited segment (since the implication otherwise would
        be that distinct recombined segments were passed down separately from
        `p` to `c`).

python/tskit/trees.py

petrelharp · 2023-02-20T17:23:06Z

@hfr1tz3 I think you need to add .ipynb_checkpoints to the .gitignore (and un-do the adding of that last file); something like

cd python/tskit
git rm .ipynb_checkpoints/trees-checkpoint.py
echo ".ipynb_checkpoints" >> .gitignore
git add .gitignore
git commit -m 'remove checkpoint thing'

should do it.

python/tskit/.ipynb_checkpoints/trees-checkpoint.py

petrelharp · 2023-02-27T20:31:53Z

Nice fix! That looks like an annoying one to find. I have a suggestion; see my comment.

petrelharp · 2023-04-27T19:46:13Z

TODO: implement this test (it needs both forwards and backwards):

petrelharp · 2023-07-21T19:13:39Z

TODO:

test forwards + backwards python code
implement backwards in C also
move python code to testing
make TreeSequence method call the TableCollection method (which calls the C)
test again (which now calls C code)

petrelharp · 2023-07-21T22:47:00Z

Reminder: once c code is edited, to test changes locally you've got to run make in the python/ subdir. (And, currently C code isn't being run.)

python/tskit/trees.py

petrelharp · 2023-08-04T16:37:24Z

OK - I think this is ready to go.

FYI, we loop over 'edges in' and 'edges out', and it seems like perhaps we should be building the trees there (and thus using the new C tree iteration stuff), but I think not, actually, here's why: the way the algorithm works is it looks for single edges a->c that can be replaced (or at least shortened) by extending adjacent pairs of edges a->b->c; so it looks for places where two edges a->b->c are being removed and an edge a->c is being added; this is a question only about the 'edges in' and 'edges out', not about the entire tree. Some things we have to do are to "check if two edges form a chain (a->b->c)" and "check if the edge a->c is in the set of new edges (and hence in the tree)"; this would at least be simpler to do using the tree, and would be much more efficient if there were lots of edges to add and remove. However, it would make the code more complex, since we'd then have to check "is this a newly added edge", for instance.

Hm, so this does mean that this is O(k^3) where k is the number of edges added/removed in a tree transition; for msprime trees this is O(1) but for other trees it might not be.

petrelharp · 2023-08-05T05:26:59Z

I've had a good think about how to use the trees to make the algorithm not O(k^3), and it's possible but not obvious (to me yet). Unless it becomes obvious I'd like to wait to see if it's an problem in practice.

And, I don't know what's up with codecov? The actions say they've uploaded the results, e.g. to https://app.codecov.io/github/tskit-dev/tskit/commit/8973c7b86ada7a19decd9f88031ea045dfd95e6a but at that link it says that

Commit YAML is invalid
Coverage data is unable to be displayed, as the commit YAML appears to be invalid.

petrelharp · 2023-08-10T18:44:31Z

I'm not seeing the codecov report for some reason, but here it is for this branch:
https://app.codecov.io/github/tskit-dev/tskit/tree/extend_edges

TODO:

test the max_iter argument
test the 'tables not indexed' error https://app.codecov.io/github/tskit-dev/tskit/blob/extend_edges/c%2Ftskit%2Ftables.c#L12988

petrelharp · 2023-08-10T19:25:19Z

Note: to run the code on this branch locally (in, say, a different project):

make sure you've got this branch checked out
go to the python/ subdirectory
run make
run pip install .

This should replace tskit with this one; to make sure you're using the right version you can first uninstall tskit, so if it didn't work then it'll be obvious.

petrelharp · 2023-08-11T16:11:53Z

TODO:

add this to the docs
use the tree positioning stuff (see e.g. around l.60 in python/tests/test_tree_positioning.py for how to use it) to do the iteration

hfr1tz3 · 2023-08-15T20:57:36Z

I am trying to run new empirical tests with our edge extend, but I am getting an error for some reason.
TypeError: cannot unpack non-iterable TreeSequence object
We should clear this up on Thursday.

petrelharp · 2023-08-15T23:19:09Z

Can you post the code that produces that error, and/or the information above the error that says at whichi point the error occurs?

hfr1tz3 · 2023-08-16T02:26:11Z

Can you post the code that produces that error, and/or the information above the error that says at whichi point the error occurs?

In last_truth on our num_edges repository cell 3, I installed our extend edges method and tried to run the code. However, I think I may be just calling the function wrong. I will look into it more tomorrow

petrelharp · 2023-08-17T04:56:37Z

I have - after quite a bit of staring at things - modified the algorithm to use the new "tree position" method of iterating over trees. (The staring did result in better code and more robust tests, however.) However, I need some input here - probably from @jeromekelleher? So, right now (a) the python implementation in test_topology.py uses "tree position", but (b) the c version in tables.c uses I/O; and (c) in trees.c there is a C translation that I got most of the way through porting over before realizing I needed some guidance, namely:

The tsk_tree_position_t stuff needs a tree sequence, not just tables. This makes sense since it has to do with trees. However, tree sequence methods return a whole new tree sequence, while this algorithm pretty naturally modifies things more or less in place - it only makes a copy of the edge table. In fact, it rebuilds the edge table and the indexes a bunch of times. I feel like it is more natural for this reason to have extend edges be a table collection method (that operates in-place); but then to use the tree position thing we'll have to build a tree sequence within the function (and, tables.c doesn't even import trees.h). What's the right way to set this up?

jeromekelleher · 2023-08-17T07:08:29Z

I see where you're coming from @petrelharp, but I don't see the point of making this an in-place TableCollection method. Why not a TreeSequence method that returns a new TreeSequence? You need everything you need for the tables to be a valid TreeSequence anyway. This is never going to be as performance-critical as (say) simplify, so the extra copy incurred isn't going to amount to anything. I think that would simplify things a bit?

The implementation of tsk_treeseq_simplify is a good model for how to copy the tables, modify then (with some code in trees.c), and then pass ownership to the new returned tsk_treeseq_t object.

jeromekelleher

I haven't got my head around the algorithm fully, but it's much easier to understand now!

I vote for ditching the tsk_table_collection version - surely this is a premature optimisation?

c/tskit/trees.c

jeromekelleher · 2023-08-17T07:14:52Z

c/tskit/trees.c

+        remove_unextended(&edges_in_head, &edges_in_tail);
+        remove_unextended(&edges_out_head, &edges_out_tail);
+
+        for (tj = tree_pos.out.start; tj != tree_pos.out.stop; tj += sign_int) {


You could use

for (tj = tree_pos.out.start; tj != tree_pos.out.stop; tj += tree_pos.direction) { }

which is partly the intention of having direction done like it is.

petrelharp · 2023-08-17T13:04:14Z

I totally agree that the performance here isn't critical. But I do think that there's a conceptual hole here: how do we write a method that modifies tables in place and uses the trees to do so? (I mean, there's no technical barrier to that, but that's not how the API is set up.) A good example of something where that's exactly what we need to do is compute_mutation_parents, which uses the I/O method of constructing the trees.

petrelharp · 2023-08-17T13:08:46Z

It seems like the way to do this (eg in compute_mutation_parents) would be to (a) make a tree sequence from the tables; (b) do the stuff to the tables; (c) forget about the tree sequence, leaving just the tables. I just don't have my head entirely around whether that's the right way to go.

jeromekelleher · 2023-08-17T13:27:37Z

I totally agree that the performance here isn't critical. But I do think that there's a conceptual hole here: how do we write a method that modifies tables in place and uses the trees to do so? (I mean, there's no technical barrier to that, but that's not how the API is set up.) A good example of something where that's exactly what we need to do is compute_mutation_parents, which uses the I/O method of constructing the trees.

I think the number of things we need to do with the tables in place that require iterating over the trees will be pretty small, going forward. So, I'd vote we just leave stuff like compute_mutation_parents etc as they are, and concentrate on using the tree_pos stuff for things that are logically operations on a tree sequence.

We don't have to depend on the tree sequence for the tree_position thing --- we're really only using the breakpoints array, which we only really need for long-range jumping. But, I think the abstraction is getting very leaky when we do things on both the tree sequence and table collection, just for the sake of avoiding a copy, or creating a treeseq_t object (which can be as cheap as validating a set of tables has the required tree properties).

petrelharp · 2023-08-17T14:05:27Z

We don't have to depend on the tree sequence for the tree_position thing --- we're really only using the breakpoints array, which we only really need for long-range jumping. But, I think the abstraction is getting very leaky when we do things on both the tree sequence and table collection, just for the sake of avoiding a copy, or creating a treeseq_t object (which can be as cheap as validating a set of tables has the required tree properties).

Seems like then if we were to write something like compute_mutation_parents with tree_positions, then we'd create a treeseq_t object - that seems fine? I am pretty tempted to do things that way here, because (a) it'd take less re-plumbing, and (b) one of the use cases for this method is that it compresses the tree sequence.

petrelharp · 2023-08-17T14:12:50Z

I am pretty tempted to do things that way here, because (a) it'd take less re-plumbing, and (b) one of the use cases for this method is that it compresses the tree sequence.

Never mind, since we wouldn't be able to do the no-copy thing in python really anyhow.

petrelharp · 2023-08-17T16:00:06Z

Okay - this is ready for review, @jeromekelleher. valgrind reports a leak in the C tests related to the tables - I must be missing something silly there, but I am not sure what.

jeromekelleher · 2023-08-18T13:32:21Z

I've gone over this and refactored things a bit. The memory management at the C level was really quite tricky, but hopefully it's settled now.

Some high-level stuff:

There's some problem with mutations. I think you probably need to keep track of the edges that are extended, and then remap the mutations?
Testing is pretty weak. I think we need a few more fully worked examples where we have the input and the exact expected output, after one or two iterations. I know this is tedious, but it really pays off in terms of helping the code reader understand what's going on
We're missing some checks like "the other tables shouldn't be affected" (but see point above about mutations)

jeromekelleher

A few minor comments. I'm not sure what the right pattern will be in order to update the mutation table too, but I guess that'll take some figuring out first.

jeromekelleher · 2023-08-18T13:39:03Z

c/tskit/trees.c

+    tsk_id_t e, e1, e2, e_in;
+    tsk_blkalloc_t edge_list_heap;
+    double *near_side, *far_side;
+    edge_list_t *edges_in_head, *edges_in_tail;


If I was starting this from scratch, I would't bother with linked lists and do something like this instead:

tsk_bool_t extended = calloc(num_edges, sizeof(*extended)); tsk_id_t *edges_in = mallco(num_edges, sizeof(*edges_in)); tsk_id_t *edges_out = mallco(num_edges, sizeof(*edges_out)); tsk_size_t num_edge_in, num_edges_out;

and keep track of the lists like that. Repacking to get rid of unextended edges will be slightly trickier, but there's less faddling around with linked lists, memory allocation and so on, and it would almost certainly be more efficient.

But like I say - you might not want to get into this, since it's working.

c/tskit/trees.h

python/tests/test_extend_edges.py

jeromekelleher · 2023-08-18T13:51:01Z

One thing that might not be immediately obvious - I made a new file test_extend_edges.py

petrelharp · 2023-08-18T16:00:27Z

Oh riiiight, MUTATIONS. Thanks very much; I'll have a go at this.

petrelharp · 2023-08-18T16:12:59Z

Ah, I see about the memory management. Makes sense.

petrelharp · 2023-08-31T22:25:24Z

TODO: put mutations on the edges; proposed method:

for each edge to be postponed,
for each mutation on said edges,
figure out what edge the mutation should be on

petrelharp · 2023-09-17T21:39:11Z

Okay! Assuming I can get the CI errors to go away, I think this is ready to go:

I have re-written the algorithm so it is no longer O(n^3), where n is the number of edges differing between adjacent trees. Now it is O(n), in the usual case anyhow. This should make it more possible to apply it to Relate-produced tree sequences.
It does mutations now: it currently just errors if mutations do not have times, and directs the user to impute_mutation_times; we could always just keep mutations at the nodes they are assigned to, which would be equivalent to the (currently default) method of impute_mutation_times that gives mutations the time of their nodes; but it seems statistically much better to distribute mtuation times evenly over the edges, so we're making the user do that step explicitly.
More tests are possible, but I feel good about it the current set of them.
Migrations are a can of worms - really, they are additional information that constrain what edges can be extended and how -- so they are disallowed.

Edit: anyhow know what's going on with the doc build?

petrelharp · 2023-10-11T15:51:39Z

This is ready for review: @jeromekelleher? @benjeffery?

codecov · 2023-10-11T16:28:46Z

Codecov Report

Merging #2651 (ae44e89) into main (6e99c11) will decrease coverage by 0.07%.
The diff coverage is 82.55%.

@@            Coverage Diff             @@
##             main    #2651      +/-   ##
==========================================
- Coverage   89.75%   89.69%   -0.07%     
==========================================
  Files          30       30              
  Lines       29902    30160     +258     
  Branches     5803     5860      +57     
==========================================
+ Hits        26840    27053     +213     
- Misses       1755     1778      +23     
- Partials     1307     1329      +22

Flag	Coverage Δ
c-tests	`86.09% <82.89%> (-0.06%)`	⬇️
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`67.90% <70.00%> (+<0.01%)`	⬆️
python-tests	`98.92% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
c/tskit/core.c	`95.62% <100.00%> (+0.03%)`	⬆️
c/tskit/core.h	`100.00% <ø> (ø)`
python/tskit/trees.py	`98.62% <100.00%> (+<0.01%)`	⬆️
python/_tskitmodule.c	`88.64% <76.92%> (-0.05%)`	⬇️
c/tskit/trees.c	`90.17% <82.43%> (-0.42%)`	⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e99c11...ae44e89. Read the comment docs.

petrelharp · 2023-10-17T05:49:59Z

Darn it, looks like we need more test coverage.

jeromekelleher

LGTM. I'm not seeing much you can do to update coverage.

Two minor comments:

We should probably do some checks on max_iter in the C code. This will actually make it easier to bump up test coverage in, e.g., the Python C API
Do we really want to error on unspecified mutation times? It would probably be simpler to "min" impute if unknown, and that's what people will end up doing most of the time anyway?

Edit - whoops, just saw your point about impute mutation times above. Ignore comment 2 then!

python/tests/test_extend_edges.py

jeromekelleher · 2023-10-17T08:49:33Z

python/tests/test_extend_edges.py

+    # their time; requires mutation times be nonmissing and the mutation times
+    # be >= their nodes' times.
+
+    assert np.all(~tskit.is_unknown_time(mutations.time)), "times must be known"


Can use ts.impute_mutation_timeson the input ts instead of the hard requirement?

could, but as noted above let's leave this to the end user (we have a note about this in the docstring)

python/tests/test_extend_edges.py

python/tests/test_lowlevel.py

c/tskit/trees.c

c/tskit/trees.h

c/tskit/trees.c

petrelharp · 2023-10-26T03:31:28Z

Note: tskit-dev/tsdate#317 depends on this.

petrelharp · 2023-10-26T05:14:13Z

I think this is ready to go, except for the doc build failure. Codecov says that I'm not testing the

+                 ret = TSK_ERR_DISALLOWED_UNKNOWN_MUTATION_TIME;

line, but I totally am - not sure what's up with that?

Edit: Well, I added a C test for this; should be okay now.

petrelharp · 2023-10-30T19:13:23Z

Note: I've also done like tskit-dev/msprime#1395 in this PR to get it to pass the docs build.

petrelharp · 2023-10-30T19:20:58Z

Okay! I think this is ready to merge!

jeromekelleher

LGTM, just spotted on mistake.

I think we can squash down to one commit (or 3 max, if you really want to keep my changes?)

c/tskit/trees.c

petrelharp · 2023-11-06T22:20:10Z

Okay! I fixed that problem, added another C test that codecov pointed out, rebased & squashed; I think we're ready to go!

petrelharp force-pushed the extend_edges branch from c00d375 to 66ca353 Compare January 23, 2023 20:23

petrelharp commented Feb 20, 2023

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp commented Feb 20, 2023

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp commented Feb 27, 2023

View reviewed changes

python/tskit/.ipynb_checkpoints/trees-checkpoint.py Outdated Show resolved Hide resolved

petrelharp commented Jul 27, 2023

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp force-pushed the extend_edges branch from 811453d to 0301bee Compare August 3, 2023 16:32

petrelharp marked this pull request as ready for review August 4, 2023 16:12

jeromekelleher reviewed Aug 17, 2023

View reviewed changes

jeromekelleher reviewed Aug 18, 2023

View reviewed changes

petrelharp force-pushed the extend_edges branch from b3de586 to ebb79ac Compare August 18, 2023 23:10

duncanMR mentioned this pull request Aug 23, 2023

Add tsk_tree_position_step function to C API #2825

Closed

petrelharp force-pushed the extend_edges branch from 90139ee to cbf2dda Compare September 17, 2023 20:52

petrelharp force-pushed the extend_edges branch from 40419ff to 75f4b10 Compare October 11, 2023 15:50

jeromekelleher reviewed Oct 17, 2023

View reviewed changes

petrelharp force-pushed the extend_edges branch 2 times, most recently from 8fc98b1 to b9b0f16 Compare October 30, 2023 19:11

jeromekelleher reviewed Nov 6, 2023

View reviewed changes

c/tskit/trees.c Outdated Show resolved Hide resolved

Implement extend edges

ae44e89

petrelharp force-pushed the extend_edges branch from 37f5387 to ae44e89 Compare November 6, 2023 22:19

jeromekelleher approved these changes Nov 7, 2023

View reviewed changes

jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 7, 2023

mergify bot merged commit b06a718 into tskit-dev:main Nov 7, 2023
22 of 23 checks passed

mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 7, 2023

extend edges method #2651

extend edges method #2651

Conversation

petrelharp commented Nov 28, 2022 • edited Loading

petrelharp commented Feb 20, 2023

petrelharp commented Feb 27, 2023

petrelharp commented Apr 27, 2023

petrelharp commented Jul 21, 2023

petrelharp commented Jul 21, 2023

petrelharp commented Aug 4, 2023

petrelharp commented Aug 5, 2023

petrelharp commented Aug 10, 2023

petrelharp commented Aug 10, 2023

petrelharp commented Aug 11, 2023

hfr1tz3 commented Aug 15, 2023

petrelharp commented Aug 15, 2023

hfr1tz3 commented Aug 16, 2023

petrelharp commented Aug 17, 2023

jeromekelleher commented Aug 17, 2023 • edited Loading

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Aug 17, 2023

Choose a reason for hiding this comment

petrelharp commented Aug 17, 2023

petrelharp commented Aug 17, 2023

jeromekelleher commented Aug 17, 2023

petrelharp commented Aug 17, 2023

petrelharp commented Aug 17, 2023

petrelharp commented Aug 17, 2023

jeromekelleher commented Aug 18, 2023

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

jeromekelleher commented Aug 18, 2023

petrelharp commented Aug 18, 2023

petrelharp commented Aug 18, 2023

petrelharp commented Aug 31, 2023

petrelharp commented Sep 17, 2023 • edited Loading

petrelharp commented Oct 11, 2023

codecov bot commented Oct 11, 2023 • edited Loading

Codecov Report

petrelharp commented Oct 17, 2023

jeromekelleher left a comment • edited Loading

Choose a reason for hiding this comment

jeromekelleher Oct 17, 2023

Choose a reason for hiding this comment

petrelharp Oct 26, 2023

Choose a reason for hiding this comment

petrelharp commented Oct 26, 2023

petrelharp commented Oct 26, 2023 • edited Loading

petrelharp commented Oct 30, 2023

petrelharp commented Oct 30, 2023

jeromekelleher left a comment

Choose a reason for hiding this comment

petrelharp commented Nov 6, 2023

petrelharp commented Nov 28, 2022 •

edited

Loading

jeromekelleher commented Aug 17, 2023 •

edited

Loading

jeromekelleher Aug 18, 2023 •

edited

Loading

petrelharp commented Sep 17, 2023 •

edited

Loading

codecov bot commented Oct 11, 2023 •

edited

Loading

jeromekelleher left a comment •

edited

Loading

petrelharp commented Oct 26, 2023 •

edited

Loading