Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two locus branch stats python prototype #2912

Closed
wants to merge 1 commit into from

Conversation

lkirk
Copy link
Contributor

@lkirk lkirk commented Mar 21, 2024

Description

This is a python prototype for computing two-locus branch statistics. There are a number of things missing from this prototype, notably:

  • Windows / Positions
  • Sample Sets
  • Polarisation

Right now, the code outputs the statistical results in units of branch length, (to get the true stats, one will need to multiply by $\mu^2$ or by the product of the total branch lengths of each compared tree).

Hopefully, I stopped before things got too complicated, though I did integrate this method with bit arrays (soon to be bitsets?). The initial commit for this PR (though undocumented) contains a working algorithm with python sets if that's easier to read.

I implemented this almost exactly as I plan to implement it in C, so we should discuss whether or not this iteration approach / state storage approach works or if there is a more modern way of doing things. I remember discussions about a more modern approach to tree diff iteration in C, but I don't recall where the example is. In any case, it will be easy to update to whatever iteration pattern is preferred.

Copy link

codecov bot commented Mar 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.65%. Comparing base (e6483fc) to head (1761e05).

❗ Current head 1761e05 differs from pull request most recent head 850600c. Consider uploading reports for the commit 850600c to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2912      +/-   ##
==========================================
- Coverage   89.62%   86.65%   -2.97%     
==========================================
  Files          29       11      -18     
  Lines       30176    22934    -7242     
  Branches     5874     4255    -1619     
==========================================
- Hits        27044    19874    -7170     
+ Misses       1793     1754      -39     
+ Partials     1339     1306      -33     
Flag Coverage Δ
c-tests 86.21% <ø> (ø)
lwt-tests 80.78% <ø> (ø)
python-c-tests 88.72% <ø> (ø)
python-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 18 files with indirect coverage changes

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening up the PR nice and early @lkirk, this is really helpful.

So, I guess I don't see the point of TreeState object - couldn't we do most of this with the standard Tree API using traversals? It's not clear to me that the minor savings that your approach would give would amount to much when compared with all the function evaluations etc, and so it may just be needless complexity.

I'd like to see a version of this code that is as simple as possible, which adds as few new things on top of the existing APIs as possible.

Note that the C level tree object now has access to the edges-in and edges-out via the tsk_tree_position_t struct, so getting access to this stuff should be much simpler now.

See the python/tests/test_tree_positioning.py file for a starting point on this approach, and #2786 (and linked issues) for discussion on how this works.

total_branch_len: int = 0 # cumulative branch length for the current tree

def __init__(self, ts):
self.parents = [tskit.NULL] * ts.num_nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May was well use numpy array here?

Also we usually call this array parent


def __init__(self, ts):
self.parents = [tskit.NULL] * ts.num_nodes
self.nodes = [tskit.NULL] * ts.num_nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about self.node_in_tree = np.zeros(ts.num_nodes , dtype=bool)

I don't see the advantage of the -1/+1 encoding, and this is a bit easier to understand?

@jeromekelleher
Copy link
Member

Note - the specific comments I made are not particularly relevant given the high-level ones. I just made them as I was going, and thought they might be worth keeping anyway. Feel free to ignore.

@lkirk
Copy link
Contributor Author

lkirk commented Mar 26, 2024

@jeromekelleher Thanks for taking a look. The TreeState object was just a way of storing state so that I could easily track and duplicate it between comparisons of trees. I wasn't exactly sure how to gather that data effectively, but tsk_tree_position_t is exactly what I was looking for. I can imagine replacing what I have with calls from that API.

EDIT: I read through your comment more carefully and saw the bit about the python tree positioning. I'll give that a try.

@lkirk
Copy link
Contributor Author

lkirk commented Mar 29, 2024

@jeromekelleher I did a pass over the code and simplified things a bit. I'm still using my TreeState object for now because it makes copying the state easy. I've switched to using the TreePosition object (this is great, definitely makes edge addition/removal more fail-safe). I've also removed some redundant looping over parents. I tried to make compute_stat_update a bit easier to read, but the bitsets make it complicated.

What do you think about this? I'm imagining a C implementation that uses tsk_tree_position_next to iterate through trees, using the tsk_tree_position_t struct to do what I'm doing in this python prototype -- or is that not an API that I should be consuming? Were you thinking it would be better to use tsk_tree_next? I think it's nice that we can update the stat during edge removal, similar to tsk_treeseq_branch_general_stat.

@jeromekelleher
Copy link
Member

What do you think about this? I'm imagining a C implementation that uses tsk_tree_position_next to iterate through trees, using the tsk_tree_position_t struct to do what I'm doing in this python prototype -- or is that not an API that I should be consuming? Were you thinking it would be better to use tsk_tree_next? I think it's nice that we can update the stat during edge removal, similar to tsk_treeseq_branch_general_stat.

I like it, nice and clean. I think your sketch sounds about right. Note that you can now find out the edges that have changes after calling tsk_tree_next, so that we don't end up duplicating quite so much logic around tree transitions. I would probably implement the tsk_treeseq_branch_general_stat using this now, if I was doing it again. However, sometimes you do need to update edge-by-edge, and so maintaining the parent is necessary.

As a side-note, I do want to expose the tsk_tree_position_t to Python at some point, and come up with an efficient way of doing these incremental style algorithms in numba as discussed in #2778. Ultimately, I would like to reach the point where doing stuff like these LD calculations can be achieved efficiently using numba, so that tree sequence algorithms don't have to be implemented in C (tree algorithms can totally be done in numba at the moment). This is a key thing I want to illustrate in the paper.

@lkirk
Copy link
Contributor Author

lkirk commented Apr 2, 2024

What do you think about this? I'm imagining a C implementation that uses tsk_tree_position_next to iterate through trees, using the tsk_tree_position_t struct to do what I'm doing in this python prototype -- or is that not an API that I should be consuming? Were you thinking it would be better to use tsk_tree_next? I think it's nice that we can update the stat during edge removal, similar to tsk_treeseq_branch_general_stat.

I like it, nice and clean. I think your sketch sounds about right. Note that you can now find out the edges that have changes after calling tsk_tree_next, so that we don't end up duplicating quite so much logic around tree transitions. I would probably implement the tsk_treeseq_branch_general_stat using this now, if I was doing it again. However, sometimes you do need to update edge-by-edge, and so maintaining the parent is necessary.

As a side-note, I do want to expose the tsk_tree_position_t to Python at some point, and come up with an efficient way of doing these incremental style algorithms in numba as discussed in #2778. Ultimately, I would like to reach the point where doing stuff like these LD calculations can be achieved efficiently using numba, so that tree sequence algorithms don't have to be implemented in C (tree algorithms can totally be done in numba at the moment). This is a key thing I want to illustrate in the paper.

Great, I'm glad this is converging on something we're happy with. I'm going to clean this up and integrate it with the rest of the prototype code, adding some realistic tests along the way.

On the subject of the C sketch, I suppose there's a couple of things to consider:

  1. It might be more efficient to update as on the fly instead of using tsk_tree_next, but I'm not sure if any of those additional cycles will be relevant given the computational burden of intersecting sets. As we've discussed in the past, sometimes simple outweighs efficient. That said, I'd like to do a quick sketch with tree_next and tree_position_next to see if it really matters.

  2. I'd be interested in working with numba, but I think I'd like to produce an initial cut with C because I can do that quickly and I have a couple of other downstream analyses I'd really like to get to. That said, I'd be happy to follow up on this and help to re-implement some things in numba.

@lkirk
Copy link
Contributor Author

lkirk commented Apr 18, 2024

Okay, things are shaping up here. A couple of notes:

  1. I'm going to do another pass in subsequent PRs to implement positions/sample sets/polarisation.
  2. The "naive_matrix" function is very slow and inefficient. I want to rip it out once I implement the C code. It's there right now as an orthogonal check to the current implementation. However, there are some caveats: 1) it can't cover all of the test cases because it requires MRCAs. 2) It's too slow to be run in the CI right now. I've capped the slowest test at 3 minutes (I hope that's acceptable amount of time. I can also just disable them altogether if that's preferable).

I think this is ready for another round of review, I'm removing the draft status.

@lkirk lkirk marked this pull request as ready for review April 18, 2024 00:08
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM. I'm happy to merge if you'd like to squash

@lkirk
Copy link
Contributor Author

lkirk commented Apr 23, 2024

Great, thank you for taking a look. I've squashed, it's ready when you get the chance.

Oh, hm. I'm seeing some cache issues with the tests. -- I couldn't immediately see a way to rerun these.

@jeromekelleher
Copy link
Member

Feel free to open a fresh PR if that's easier?

Currently, this algorithm creates a matrix of LD, performing a pairwise
comparison of all trees in the tree sequence.

This implementation lacks windows/positions, sample sets and
polarisation. The outputs of the code produce results in units of branch
length, needing to be multiplied by mu^2 or divided by product of the
total branch length of the two trees.

This algorithm works by keeping a running sum of the statistic between
two trees, updating each time we encounter a branch addition or removal.
The tricky part is that we have to remove or add LD contributed by
samples that already existed or that will remain under a given node
after the addition/removal of branches.

We include a validation against the original formulation of this
problem, by including an implementation that was described in McVean
2002. The original formulation computing the covariance of tMRCAs of
2, 3, and 4 samples of individuals on the trees in question. This
implementation has several limitations 1) it is very slow and 2) it does
not work for trees that are decapitated, because certain samples do not
have MRCAs.
@benjeffery
Copy link
Member

Feel free to open a fresh PR if that's easier?

For future reference - the cache is shared between PRs, so this usually doesn't help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants