Two locus branch stats python prototype #2912

lkirk · 2024-03-21T20:58:24Z

Description

This is a python prototype for computing two-locus branch statistics. There are a number of things missing from this prototype, notably:

Windows / Positions
Sample Sets
Polarisation

Right now, the code outputs the statistical results in units of branch length, (to get the true stats, one will need to multiply by $\mu^2$ or by the product of the total branch lengths of each compared tree).

Hopefully, I stopped before things got too complicated, though I did integrate this method with bit arrays (soon to be bitsets?). The initial commit for this PR (though undocumented) contains a working algorithm with python sets if that's easier to read.

I implemented this almost exactly as I plan to implement it in C, so we should discuss whether or not this iteration approach / state storage approach works or if there is a more modern way of doing things. I remember discussions about a more modern approach to tree diff iteration in C, but I don't recall where the example is. In any case, it will be easy to update to whatever iteration pattern is preferred.

codecov · 2024-03-21T21:49:37Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.65%. Comparing base (e6483fc) to head (1761e05).

❗ Current head 1761e05 differs from pull request most recent head 850600c. Consider uploading reports for the commit 850600c to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2912      +/-   ##
==========================================
- Coverage   89.62%   86.65%   -2.97%     
==========================================
  Files          29       11      -18     
  Lines       30176    22934    -7242     
  Branches     5874     4255    -1619     
==========================================
- Hits        27044    19874    -7170     
+ Misses       1793     1754      -39     
+ Partials     1339     1306      -33

Flag	Coverage Δ
c-tests	`86.21% <ø> (ø)`
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`88.72% <ø> (ø)`
python-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

see 18 files with indirect coverage changes

jeromekelleher

Thanks for opening up the PR nice and early @lkirk, this is really helpful.

So, I guess I don't see the point of TreeState object - couldn't we do most of this with the standard Tree API using traversals? It's not clear to me that the minor savings that your approach would give would amount to much when compared with all the function evaluations etc, and so it may just be needless complexity.

I'd like to see a version of this code that is as simple as possible, which adds as few new things on top of the existing APIs as possible.

Note that the C level tree object now has access to the edges-in and edges-out via the tsk_tree_position_t struct, so getting access to this stuff should be much simpler now.

See the python/tests/test_tree_positioning.py file for a starting point on this approach, and #2786 (and linked issues) for discussion on how this works.

jeromekelleher · 2024-03-25T09:40:04Z

python/tests/test_ld_matrix.py

+    total_branch_len: int = 0  # cumulative branch length for the current tree
+
+    def __init__(self, ts):
+        self.parents = [tskit.NULL] * ts.num_nodes


May was well use numpy array here?

Also we usually call this array parent

jeromekelleher · 2024-03-25T09:41:46Z

python/tests/test_ld_matrix.py

+
+    def __init__(self, ts):
+        self.parents = [tskit.NULL] * ts.num_nodes
+        self.nodes = [tskit.NULL] * ts.num_nodes


How about self.node_in_tree = np.zeros(ts.num_nodes , dtype=bool)

I don't see the advantage of the -1/+1 encoding, and this is a bit easier to understand?

jeromekelleher · 2024-03-25T12:02:20Z

Note - the specific comments I made are not particularly relevant given the high-level ones. I just made them as I was going, and thought they might be worth keeping anyway. Feel free to ignore.

lkirk · 2024-03-26T03:44:50Z

@jeromekelleher Thanks for taking a look. The TreeState object was just a way of storing state so that I could easily track and duplicate it between comparisons of trees. I wasn't exactly sure how to gather that data effectively, but tsk_tree_position_t is exactly what I was looking for. I can imagine replacing what I have with calls from that API.

EDIT: I read through your comment more carefully and saw the bit about the python tree positioning. I'll give that a try.

lkirk · 2024-03-29T16:52:20Z

@jeromekelleher I did a pass over the code and simplified things a bit. I'm still using my TreeState object for now because it makes copying the state easy. I've switched to using the TreePosition object (this is great, definitely makes edge addition/removal more fail-safe). I've also removed some redundant looping over parents. I tried to make compute_stat_update a bit easier to read, but the bitsets make it complicated.

What do you think about this? I'm imagining a C implementation that uses tsk_tree_position_next to iterate through trees, using the tsk_tree_position_t struct to do what I'm doing in this python prototype -- or is that not an API that I should be consuming? Were you thinking it would be better to use tsk_tree_next? I think it's nice that we can update the stat during edge removal, similar to tsk_treeseq_branch_general_stat.

jeromekelleher · 2024-04-02T09:33:18Z

What do you think about this? I'm imagining a C implementation that uses tsk_tree_position_next to iterate through trees, using the tsk_tree_position_t struct to do what I'm doing in this python prototype -- or is that not an API that I should be consuming? Were you thinking it would be better to use tsk_tree_next? I think it's nice that we can update the stat during edge removal, similar to tsk_treeseq_branch_general_stat.

I like it, nice and clean. I think your sketch sounds about right. Note that you can now find out the edges that have changes after calling tsk_tree_next, so that we don't end up duplicating quite so much logic around tree transitions. I would probably implement the tsk_treeseq_branch_general_stat using this now, if I was doing it again. However, sometimes you do need to update edge-by-edge, and so maintaining the parent is necessary.

As a side-note, I do want to expose the tsk_tree_position_t to Python at some point, and come up with an efficient way of doing these incremental style algorithms in numba as discussed in #2778. Ultimately, I would like to reach the point where doing stuff like these LD calculations can be achieved efficiently using numba, so that tree sequence algorithms don't have to be implemented in C (tree algorithms can totally be done in numba at the moment). This is a key thing I want to illustrate in the paper.

lkirk · 2024-04-02T19:03:45Z

What do you think about this? I'm imagining a C implementation that uses tsk_tree_position_next to iterate through trees, using the tsk_tree_position_t struct to do what I'm doing in this python prototype -- or is that not an API that I should be consuming? Were you thinking it would be better to use tsk_tree_next? I think it's nice that we can update the stat during edge removal, similar to tsk_treeseq_branch_general_stat.

I like it, nice and clean. I think your sketch sounds about right. Note that you can now find out the edges that have changes after calling tsk_tree_next, so that we don't end up duplicating quite so much logic around tree transitions. I would probably implement the tsk_treeseq_branch_general_stat using this now, if I was doing it again. However, sometimes you do need to update edge-by-edge, and so maintaining the parent is necessary.

As a side-note, I do want to expose the tsk_tree_position_t to Python at some point, and come up with an efficient way of doing these incremental style algorithms in numba as discussed in #2778. Ultimately, I would like to reach the point where doing stuff like these LD calculations can be achieved efficiently using numba, so that tree sequence algorithms don't have to be implemented in C (tree algorithms can totally be done in numba at the moment). This is a key thing I want to illustrate in the paper.

Great, I'm glad this is converging on something we're happy with. I'm going to clean this up and integrate it with the rest of the prototype code, adding some realistic tests along the way.

On the subject of the C sketch, I suppose there's a couple of things to consider:

It might be more efficient to update as on the fly instead of using tsk_tree_next, but I'm not sure if any of those additional cycles will be relevant given the computational burden of intersecting sets. As we've discussed in the past, sometimes simple outweighs efficient. That said, I'd like to do a quick sketch with tree_next and tree_position_next to see if it really matters.
I'd be interested in working with numba, but I think I'd like to produce an initial cut with C because I can do that quickly and I have a couple of other downstream analyses I'd really like to get to. That said, I'd be happy to follow up on this and help to re-implement some things in numba.

lkirk · 2024-04-18T00:06:40Z

Okay, things are shaping up here. A couple of notes:

I'm going to do another pass in subsequent PRs to implement positions/sample sets/polarisation.
The "naive_matrix" function is very slow and inefficient. I want to rip it out once I implement the C code. It's there right now as an orthogonal check to the current implementation. However, there are some caveats: 1) it can't cover all of the test cases because it requires MRCAs. 2) It's too slow to be run in the CI right now. I've capped the slowest test at 3 minutes (I hope that's acceptable amount of time. I can also just disable them altogether if that's preferable).

I think this is ready for another round of review, I'm removing the draft status.

jeromekelleher

Nice, LGTM. I'm happy to merge if you'd like to squash

lkirk · 2024-04-23T14:30:05Z

Great, thank you for taking a look. I've squashed, it's ready when you get the chance.

Oh, hm. I'm seeing some cache issues with the tests. -- I couldn't immediately see a way to rerun these.

jeromekelleher · 2024-04-23T14:36:42Z

Feel free to open a fresh PR if that's easier?

Currently, this algorithm creates a matrix of LD, performing a pairwise comparison of all trees in the tree sequence. This implementation lacks windows/positions, sample sets and polarisation. The outputs of the code produce results in units of branch length, needing to be multiplied by mu^2 or divided by product of the total branch length of the two trees. This algorithm works by keeping a running sum of the statistic between two trees, updating each time we encounter a branch addition or removal. The tricky part is that we have to remove or add LD contributed by samples that already existed or that will remain under a given node after the addition/removal of branches. We include a validation against the original formulation of this problem, by including an implementation that was described in McVean 2002. The original formulation computing the covariance of tMRCAs of 2, 3, and 4 samples of individuals on the trees in question. This implementation has several limitations 1) it is very slow and 2) it does not work for trees that are decapitated, because certain samples do not have MRCAs.

benjeffery · 2024-04-24T21:27:49Z

Feel free to open a fresh PR if that's easier?

For future reference - the cache is shared between PRs, so this usually doesn't help.

lkirk force-pushed the two-locus-branch-stats branch from 8081cb2 to 9c41292 Compare March 21, 2024 21:12

jeromekelleher reviewed Mar 25, 2024

View reviewed changes

lkirk marked this pull request as ready for review April 18, 2024 00:08

jeromekelleher approved these changes Apr 23, 2024

View reviewed changes

lkirk force-pushed the two-locus-branch-stats branch from 640c864 to 1761e05 Compare April 23, 2024 14:28

lkirk force-pushed the two-locus-branch-stats branch from 1761e05 to 850600c Compare April 23, 2024 16:11

lkirk closed this Apr 23, 2024

lkirk mentioned this pull request Apr 23, 2024

Two locus branch stats python prototype #2934

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two locus branch stats python prototype #2912

Two locus branch stats python prototype #2912

lkirk commented Mar 21, 2024 •

edited

Loading

codecov bot commented Mar 21, 2024 •

edited

Loading

jeromekelleher left a comment

jeromekelleher Mar 25, 2024

jeromekelleher Mar 25, 2024

jeromekelleher commented Mar 25, 2024

lkirk commented Mar 26, 2024 •

edited

Loading

lkirk commented Mar 29, 2024

jeromekelleher commented Apr 2, 2024

lkirk commented Apr 2, 2024

lkirk commented Apr 18, 2024

jeromekelleher left a comment

lkirk commented Apr 23, 2024 •

edited

Loading

jeromekelleher commented Apr 23, 2024

benjeffery commented Apr 24, 2024

Two locus branch stats python prototype #2912

Two locus branch stats python prototype #2912

Conversation

lkirk commented Mar 21, 2024 • edited Loading

Description

codecov bot commented Mar 21, 2024 • edited Loading

Codecov Report

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Mar 25, 2024

Choose a reason for hiding this comment

jeromekelleher Mar 25, 2024

Choose a reason for hiding this comment

jeromekelleher commented Mar 25, 2024

lkirk commented Mar 26, 2024 • edited Loading

lkirk commented Mar 29, 2024

jeromekelleher commented Apr 2, 2024

lkirk commented Apr 2, 2024

lkirk commented Apr 18, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

lkirk commented Apr 23, 2024 • edited Loading

jeromekelleher commented Apr 23, 2024

benjeffery commented Apr 24, 2024

lkirk commented Mar 21, 2024 •

edited

Loading

codecov bot commented Mar 21, 2024 •

edited

Loading

lkirk commented Mar 26, 2024 •

edited

Loading

lkirk commented Apr 23, 2024 •

edited

Loading