Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend edges method #2651

Merged
merged 1 commit into from
Nov 7, 2023
Merged

extend edges method #2651

merged 1 commit into from
Nov 7, 2023

Conversation

petrelharp
Copy link
Contributor

@petrelharp petrelharp commented Nov 28, 2022

Here we (me and @hfr1tz3 and @avabamf) have implemented the "extend_edges" and related things in tskit.

Here's the description of what this does (copied from the draft docstring):

        Returns a new tree sequence in which the span covered by ancestral nodes
        is "extended" to regions of the genome according to the following rule:
        If an ancestral segment corresponding to node `n` has parent `p` and
        child `c` on some portion of the genome, and on an adjacent segment of
        genome `p` is the immediate parent of `c`, then `n` is inserted into the
        edge from `p` to `c`. This involves extending the span of the edges
        from `p` to `n` and `n` to `c` and reducing the span of the edge from
        `p` to `c`. Since the latter edge may be removed entirely, this process
        reduces (or at least does not increase) the number of edges in the tree
        sequence.
        The method works by iterating over the genome to look for edges that can
        be extended in this way; the maximum number of such iterations is
        controlled by ``max_iter``.
        The rationale is that we know that `n` carries a portion of the segment
        of ancestral genome inherited by `c` from `p`, and so likely carries
        the *entire* inherited segment (since the implication otherwise would
        be that distinct recombined segments were passed down separately from
        `p` to `c`).

python/tskit/trees.py Outdated Show resolved Hide resolved
python/tskit/trees.py Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor Author

@hfr1tz3 I think you need to add .ipynb_checkpoints to the .gitignore (and un-do the adding of that last file); something like

cd python/tskit
git rm .ipynb_checkpoints/trees-checkpoint.py
echo ".ipynb_checkpoints" >> .gitignore
git add .gitignore
git commit -m 'remove checkpoint thing'

should do it.

@petrelharp
Copy link
Contributor Author

Nice fix! That looks like an annoying one to find. I have a suggestion; see my comment.

@petrelharp
Copy link
Contributor Author

TODO: implement this test (it needs both forwards and backwards):
Uploading Screenshot from 2023-04-27 12-45-37.png…

@petrelharp
Copy link
Contributor Author

TODO:

  1. test forwards + backwards python code
  2. implement backwards in C also
  3. move python code to testing
  4. make TreeSequence method call the TableCollection method (which calls the C)
  5. test again (which now calls C code)

@petrelharp
Copy link
Contributor Author

Reminder: once c code is edited, to test changes locally you've got to run make in the python/ subdir. (And, currently C code isn't being run.)

python/tskit/trees.py Outdated Show resolved Hide resolved
@petrelharp petrelharp marked this pull request as ready for review August 4, 2023 16:12
@petrelharp
Copy link
Contributor Author

OK - I think this is ready to go.

FYI, we loop over 'edges in' and 'edges out', and it seems like perhaps we should be building the trees there (and thus using the new C tree iteration stuff), but I think not, actually, here's why: the way the algorithm works is it looks for single edges a->c that can be replaced (or at least shortened) by extending adjacent pairs of edges a->b->c; so it looks for places where two edges a->b->c are being removed and an edge a->c is being added; this is a question only about the 'edges in' and 'edges out', not about the entire tree. Some things we have to do are to "check if two edges form a chain (a->b->c)" and "check if the edge a->c is in the set of new edges (and hence in the tree)"; this would at least be simpler to do using the tree, and would be much more efficient if there were lots of edges to add and remove. However, it would make the code more complex, since we'd then have to check "is this a newly added edge", for instance.

Hm, so this does mean that this is O(k^3) where k is the number of edges added/removed in a tree transition; for msprime trees this is O(1) but for other trees it might not be.

@petrelharp
Copy link
Contributor Author

I've had a good think about how to use the trees to make the algorithm not O(k^3), and it's possible but not obvious (to me yet). Unless it becomes obvious I'd like to wait to see if it's an problem in practice.

And, I don't know what's up with codecov? The actions say they've uploaded the results, e.g. to https://app.codecov.io/github/tskit-dev/tskit/commit/8973c7b86ada7a19decd9f88031ea045dfd95e6a but at that link it says that

Commit YAML is invalid
Coverage data is unable to be displayed, as the commit YAML appears to be invalid.

@petrelharp
Copy link
Contributor Author

I'm not seeing the codecov report for some reason, but here it is for this branch:
https://app.codecov.io/github/tskit-dev/tskit/tree/extend_edges

TODO:

@petrelharp
Copy link
Contributor Author

Note: to run the code on this branch locally (in, say, a different project):

  • make sure you've got this branch checked out
  • go to the python/ subdirectory
  • run make
  • run pip install .

This should replace tskit with this one; to make sure you're using the right version you can first uninstall tskit, so if it didn't work then it'll be obvious.

@petrelharp
Copy link
Contributor Author

TODO:

  • add this to the docs
  • use the tree positioning stuff (see e.g. around l.60 in python/tests/test_tree_positioning.py for how to use it) to do the iteration

@hfr1tz3
Copy link
Contributor

hfr1tz3 commented Aug 15, 2023

I am trying to run new empirical tests with our edge extend, but I am getting an error for some reason.
TypeError: cannot unpack non-iterable TreeSequence object
We should clear this up on Thursday.

@petrelharp
Copy link
Contributor Author

Can you post the code that produces that error, and/or the information above the error that says at whichi point the error occurs?

@hfr1tz3
Copy link
Contributor

hfr1tz3 commented Aug 16, 2023

Can you post the code that produces that error, and/or the information above the error that says at whichi point the error occurs?

In last_truth on our num_edges repository cell 3, I installed our extend edges method and tried to run the code. However, I think I may be just calling the function wrong. I will look into it more tomorrow

@petrelharp
Copy link
Contributor Author

I have - after quite a bit of staring at things - modified the algorithm to use the new "tree position" method of iterating over trees. (The staring did result in better code and more robust tests, however.) However, I need some input here - probably from @jeromekelleher? So, right now (a) the python implementation in test_topology.py uses "tree position", but (b) the c version in tables.c uses I/O; and (c) in trees.c there is a C translation that I got most of the way through porting over before realizing I needed some guidance, namely:

The tsk_tree_position_t stuff needs a tree sequence, not just tables. This makes sense since it has to do with trees. However, tree sequence methods return a whole new tree sequence, while this algorithm pretty naturally modifies things more or less in place - it only makes a copy of the edge table. In fact, it rebuilds the edge table and the indexes a bunch of times. I feel like it is more natural for this reason to have extend edges be a table collection method (that operates in-place); but then to use the tree position thing we'll have to build a tree sequence within the function (and, tables.c doesn't even import trees.h). What's the right way to set this up?

@jeromekelleher
Copy link
Member

jeromekelleher commented Aug 17, 2023

I see where you're coming from @petrelharp, but I don't see the point of making this an in-place TableCollection method. Why not a TreeSequence method that returns a new TreeSequence? You need everything you need for the tables to be a valid TreeSequence anyway. This is never going to be as performance-critical as (say) simplify, so the extra copy incurred isn't going to amount to anything. I think that would simplify things a bit?

The implementation of tsk_treeseq_simplify is a good model for how to copy the tables, modify then (with some code in trees.c), and then pass ownership to the new returned tsk_treeseq_t object.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't got my head around the algorithm fully, but it's much easier to understand now!

I vote for ditching the tsk_table_collection version - surely this is a premature optimisation?

c/tskit/trees.c Show resolved Hide resolved
c/tskit/trees.c Outdated
remove_unextended(&edges_in_head, &edges_in_tail);
remove_unextended(&edges_out_head, &edges_out_tail);

for (tj = tree_pos.out.start; tj != tree_pos.out.stop; tj += sign_int) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use

for (tj = tree_pos.out.start; tj != tree_pos.out.stop; tj += tree_pos.direction) {
}

which is partly the intention of having direction done like it is.

@petrelharp
Copy link
Contributor Author

I totally agree that the performance here isn't critical. But I do think that there's a conceptual hole here: how do we write a method that modifies tables in place and uses the trees to do so? (I mean, there's no technical barrier to that, but that's not how the API is set up.) A good example of something where that's exactly what we need to do is compute_mutation_parents, which uses the I/O method of constructing the trees.

@petrelharp
Copy link
Contributor Author

It seems like the way to do this (eg in compute_mutation_parents) would be to (a) make a tree sequence from the tables; (b) do the stuff to the tables; (c) forget about the tree sequence, leaving just the tables. I just don't have my head entirely around whether that's the right way to go.

@jeromekelleher
Copy link
Member

I totally agree that the performance here isn't critical. But I do think that there's a conceptual hole here: how do we write a method that modifies tables in place and uses the trees to do so? (I mean, there's no technical barrier to that, but that's not how the API is set up.) A good example of something where that's exactly what we need to do is compute_mutation_parents, which uses the I/O method of constructing the trees.

I think the number of things we need to do with the tables in place that require iterating over the trees will be pretty small, going forward. So, I'd vote we just leave stuff like compute_mutation_parents etc as they are, and concentrate on using the tree_pos stuff for things that are logically operations on a tree sequence.

We don't have to depend on the tree sequence for the tree_position thing --- we're really only using the breakpoints array, which we only really need for long-range jumping. But, I think the abstraction is getting very leaky when we do things on both the tree sequence and table collection, just for the sake of avoiding a copy, or creating a treeseq_t object (which can be as cheap as validating a set of tables has the required tree properties).

@petrelharp
Copy link
Contributor Author

We don't have to depend on the tree sequence for the tree_position thing --- we're really only using the breakpoints array, which we only really need for long-range jumping. But, I think the abstraction is getting very leaky when we do things on both the tree sequence and table collection, just for the sake of avoiding a copy, or creating a treeseq_t object (which can be as cheap as validating a set of tables has the required tree properties).

Seems like then if we were to write something like compute_mutation_parents with tree_positions, then we'd create a treeseq_t object - that seems fine? I am pretty tempted to do things that way here, because (a) it'd take less re-plumbing, and (b) one of the use cases for this method is that it compresses the tree sequence.

@petrelharp
Copy link
Contributor Author

I am pretty tempted to do things that way here, because (a) it'd take less re-plumbing, and (b) one of the use cases for this method is that it compresses the tree sequence.

Never mind, since we wouldn't be able to do the no-copy thing in python really anyhow.

@petrelharp
Copy link
Contributor Author

Okay - this is ready for review, @jeromekelleher. valgrind reports a leak in the C tests related to the tables - I must be missing something silly there, but I am not sure what.

@jeromekelleher
Copy link
Member

I've gone over this and refactored things a bit. The memory management at the C level was really quite tricky, but hopefully it's settled now.

Some high-level stuff:

  • There's some problem with mutations. I think you probably need to keep track of the edges that are extended, and then remap the mutations?
  • Testing is pretty weak. I think we need a few more fully worked examples where we have the input and the exact expected output, after one or two iterations. I know this is tedious, but it really pays off in terms of helping the code reader understand what's going on
  • We're missing some checks like "the other tables shouldn't be affected" (but see point above about mutations)

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments. I'm not sure what the right pattern will be in order to update the mutation table too, but I guess that'll take some figuring out first.

tsk_id_t e, e1, e2, e_in;
tsk_blkalloc_t edge_list_heap;
double *near_side, *far_side;
edge_list_t *edges_in_head, *edges_in_tail;
Copy link
Member

@jeromekelleher jeromekelleher Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I was starting this from scratch, I would't bother with linked lists and do something like this instead:

tsk_bool_t extended = calloc(num_edges, sizeof(*extended));
tsk_id_t *edges_in = mallco(num_edges, sizeof(*edges_in));
tsk_id_t *edges_out = mallco(num_edges, sizeof(*edges_out));
tsk_size_t num_edge_in, num_edges_out;

and keep track of the lists like that. Repacking to get rid of unextended edges will be slightly trickier, but there's less faddling around with linked lists, memory allocation and so on, and it would almost certainly be more efficient.

But like I say - you might not want to get into this, since it's working.

c/tskit/trees.h Show resolved Hide resolved
c/tskit/trees.h Show resolved Hide resolved
python/tests/test_extend_edges.py Show resolved Hide resolved
@jeromekelleher
Copy link
Member

One thing that might not be immediately obvious - I made a new file test_extend_edges.py

@petrelharp
Copy link
Contributor Author

Oh riiiight, MUTATIONS. Thanks very much; I'll have a go at this.

@petrelharp
Copy link
Contributor Author

Ah, I see about the memory management. Makes sense.

@petrelharp
Copy link
Contributor Author

TODO: put mutations on the edges; proposed method:

  • for each edge to be postponed,
  • for each mutation on said edges,
  • figure out what edge the mutation should be on

@petrelharp
Copy link
Contributor Author

petrelharp commented Sep 17, 2023

Okay! Assuming I can get the CI errors to go away, I think this is ready to go:

  1. I have re-written the algorithm so it is no longer O(n^3), where n is the number of edges differing between adjacent trees. Now it is O(n), in the usual case anyhow. This should make it more possible to apply it to Relate-produced tree sequences.
  2. It does mutations now: it currently just errors if mutations do not have times, and directs the user to impute_mutation_times; we could always just keep mutations at the nodes they are assigned to, which would be equivalent to the (currently default) method of impute_mutation_times that gives mutations the time of their nodes; but it seems statistically much better to distribute mtuation times evenly over the edges, so we're making the user do that step explicitly.
  3. More tests are possible, but I feel good about it the current set of them.
  4. Migrations are a can of worms - really, they are additional information that constrain what edges can be extended and how -- so they are disallowed.

Edit: anyhow know what's going on with the doc build?

@petrelharp
Copy link
Contributor Author

This is ready for review: @jeromekelleher? @benjeffery?

@codecov
Copy link

codecov bot commented Oct 11, 2023

Codecov Report

Merging #2651 (ae44e89) into main (6e99c11) will decrease coverage by 0.07%.
The diff coverage is 82.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2651      +/-   ##
==========================================
- Coverage   89.75%   89.69%   -0.07%     
==========================================
  Files          30       30              
  Lines       29902    30160     +258     
  Branches     5803     5860      +57     
==========================================
+ Hits        26840    27053     +213     
- Misses       1755     1778      +23     
- Partials     1307     1329      +22     
Flag Coverage Δ
c-tests 86.09% <82.89%> (-0.06%) ⬇️
lwt-tests 80.78% <ø> (ø)
python-c-tests 67.90% <70.00%> (+<0.01%) ⬆️
python-tests 98.92% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
c/tskit/core.c 95.62% <100.00%> (+0.03%) ⬆️
c/tskit/core.h 100.00% <ø> (ø)
python/tskit/trees.py 98.62% <100.00%> (+<0.01%) ⬆️
python/_tskitmodule.c 88.64% <76.92%> (-0.05%) ⬇️
c/tskit/trees.c 90.17% <82.43%> (-0.42%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e99c11...ae44e89. Read the comment docs.

@petrelharp
Copy link
Contributor Author

Darn it, looks like we need more test coverage.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm not seeing much you can do to update coverage.

Two minor comments:

  • We should probably do some checks on max_iter in the C code. This will actually make it easier to bump up test coverage in, e.g., the Python C API
  • Do we really want to error on unspecified mutation times? It would probably be simpler to "min" impute if unknown, and that's what people will end up doing most of the time anyway?

Edit - whoops, just saw your point about impute mutation times above. Ignore comment 2 then!

python/tests/test_extend_edges.py Outdated Show resolved Hide resolved
# their time; requires mutation times be nonmissing and the mutation times
# be >= their nodes' times.

assert np.all(~tskit.is_unknown_time(mutations.time)), "times must be known"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use ts.impute_mutation_timeson the input ts instead of the hard requirement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could, but as noted above let's leave this to the end user (we have a note about this in the docstring)

python/tests/test_extend_edges.py Outdated Show resolved Hide resolved
python/tests/test_extend_edges.py Outdated Show resolved Hide resolved
python/tests/test_lowlevel.py Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.h Show resolved Hide resolved
c/tskit/trees.c Show resolved Hide resolved
@petrelharp
Copy link
Contributor Author

Note: tskit-dev/tsdate#317 depends on this.

@petrelharp
Copy link
Contributor Author

petrelharp commented Oct 26, 2023

I think this is ready to go, except for the doc build failure. Codecov says that I'm not testing the

+                 ret = TSK_ERR_DISALLOWED_UNKNOWN_MUTATION_TIME;

line, but I totally am - not sure what's up with that?

Edit: Well, I added a C test for this; should be okay now.

@petrelharp petrelharp force-pushed the extend_edges branch 2 times, most recently from 8fc98b1 to b9b0f16 Compare October 30, 2023 19:11
@petrelharp
Copy link
Contributor Author

Note: I've also done like tskit-dev/msprime#1395 in this PR to get it to pass the docs build.

@petrelharp
Copy link
Contributor Author

Okay! I think this is ready to merge!

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just spotted on mistake.

I think we can squash down to one commit (or 3 max, if you really want to keep my changes?)

c/tskit/trees.c Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor Author

Okay! I fixed that problem, added another C test that codecov pointed out, rebased & squashed; I think we're ready to go!

@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 7, 2023
@mergify mergify bot merged commit b06a718 into tskit-dev:main Nov 7, 2023
22 of 23 checks passed
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants