Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modular incremental trees #2786

Merged
merged 3 commits into from
Jul 18, 2023

Conversation

jeromekelleher
Copy link
Member

Try to centralise our tree positioning code by bringing the logic into TreePosition class. See #2778 for discussion.

@molpopgen - the idea here is to try and pull out all the of the code that we use in the tsk_tree_t class for tree positioning into this class, so that it can then be shared by incremental algorithms as well. We should then be able to do efficient skipping to a particular tree, which will be useful for parallel algorithms.

WIP

@molpopgen
Copy link
Member

Sounds great. IMO, it is important to have that core logic separated from updating all those tree arrays!

@codecov
Copy link

codecov bot commented Jul 7, 2023

Codecov Report

Merging #2786 (2c264be) into main (67f2ef3) will increase coverage by 0.02%.
The diff coverage is 96.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2786      +/-   ##
==========================================
+ Coverage   89.93%   89.96%   +0.02%     
==========================================
  Files          30       30              
  Lines       29171    29274     +103     
  Branches     5697     5706       +9     
==========================================
+ Hits        26236    26335      +99     
- Misses       1666     1669       +3     
- Partials     1269     1270       +1     
Flag Coverage Δ
c-tests 86.49% <96.25%> (+0.07%) ⬆️
lwt-tests 80.13% <ø> (ø)
python-c-tests 67.12% <ø> (ø)
python-tests 99.02% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
c/tskit/trees.c 90.88% <95.48%> (+0.08%) ⬆️
c/tskit/haplotype_matching.c 90.67% <100.00%> (+0.26%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 67f2ef3...2c264be. Read the comment docs.

@jeromekelleher
Copy link
Member Author

I think this is ready for review and merging now.

The idea is to add a new class TreePosition (or tsk_tree_position_t in C) which takes care of the details of keeping track of which edges you need to insert and remove from the current tree as you move along the sequence. This is already a substantial improvement over what we have now because it allows us to reuse code in incremental algorithms, and we can also now define incremental algorithms that go backwards as well as forwards.

I've implemented the basic next/prev operations here in Python and C, and used the core code to replace the bits of the library code that use the tsk_tree_diff_iterator. There's lots more that can be done with this, which I'll open some issues to track.

One key thing that needs to be mentioned here is that I've incorporated quite a clean way of seeking to arbitrary locations along the sequence while touching only the edges that need to be changed in order to transform the current tree into the target tree. However this has raise this issue that these edges are not necessarily inserted in time-increasing order. This needs to be thought about a bit, which is why I haven't gone ahead and switched over to using this code for the tsk_tree_t class itself.

@molpopgen - any chance you could take a look and see what you think?

@molpopgen
Copy link
Member

I'll take a look, but it'll be next week.

@jeromekelleher
Copy link
Member Author

@duncanMR good to get your eyes on this also please

@duncanMR
Copy link
Contributor

@jeromekelleher I will start examining the code today. Have you found a test case where the edges aren't inserted in time-decreasing order?

@jeromekelleher
Copy link
Member Author

Excellent. I haven't picked out specific examples where the edges aren't time sorted in seek_forward, and that would be a great thing to do in a follow up.

I stopped working on the random seek aspect once I realised the sortedness problem, and decided to focus on getting the basic code merged so we could start using it. So it's in a half finished state at the moment

@molpopgen
Copy link
Member

Let's see if I got this:

  1. The new type book-keeps index ranges of output/input edges. This is essentially the edge diffs for the new tree.
  2. Nodes can get out of (time) order because the input/output index vectors are sorted first by position and then time, and input/output do time in different orders from one another?

@jeromekelleher
Copy link
Member Author

The new type book-keeps index ranges of output/input edges. This is essentially the edge diffs for the new tree.

Yep - but without having to allocate any extra memory because we're using the existing index arrays to represent the ranges.

Nodes can get out of (time) order because the input/output index vectors are sorted first by position and then time, and input/output do time in different orders from one another?

This is subtle. The input/output arrays are sorted by position and time, which guarantees that we insert/remove the edges in the correct order for adjacent trees. However, when we're seeking to non-adjacent trees (or randomly into the middle of the sequence, like seek_from_null) we lose this guarantee.

@jeromekelleher
Copy link
Member Author

Can I get a review here please @benjeffery? I want to merge this so downstream stuff can go in too

@jeromekelleher
Copy link
Member Author

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Jul 18, 2023

rebase

✅ Branch has been successfully rebased

@jeromekelleher jeromekelleher merged commit ab5ef8f into tskit-dev:main Jul 18, 2023
19 checks passed
Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, just a couple of small comments.



@dataclasses.dataclass
class Interval:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the existing Interval class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just simpler to keep track of what it does for this kind of prototype/test code, and not have to worry about external APIs and so on

left_current_index = self->in.stop;
right_current_index = self->out.stop;
} else {
left_current_index = self->out.stop + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some missing coverage here and below.

@jeromekelleher
Copy link
Member Author

Whoops, didn't spot these missing coverage bits. @duncanMR could you try and get these covered in C tests as part of #2794 please?

@jeromekelleher jeromekelleher deleted the incremental-trees branch July 18, 2023 14:08
@duncanMR
Copy link
Contributor

Sure, will do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants