-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two locus branch stats python prototype #2912
Conversation
8081cb2
to
9c41292
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2912 +/- ##
==========================================
- Coverage 89.62% 86.65% -2.97%
==========================================
Files 29 11 -18
Lines 30176 22934 -7242
Branches 5874 4255 -1619
==========================================
- Hits 27044 19874 -7170
+ Misses 1793 1754 -39
+ Partials 1339 1306 -33
Flags with carried forward coverage won't be shown. Click here to find out more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening up the PR nice and early @lkirk, this is really helpful.
So, I guess I don't see the point of TreeState object - couldn't we do most of this with the standard Tree API using traversals? It's not clear to me that the minor savings that your approach would give would amount to much when compared with all the function evaluations etc, and so it may just be needless complexity.
I'd like to see a version of this code that is as simple as possible, which adds as few new things on top of the existing APIs as possible.
Note that the C level tree object now has access to the edges-in and edges-out via the tsk_tree_position_t struct, so getting access to this stuff should be much simpler now.
See the python/tests/test_tree_positioning.py
file for a starting point on this approach, and #2786 (and linked issues) for discussion on how this works.
python/tests/test_ld_matrix.py
Outdated
total_branch_len: int = 0 # cumulative branch length for the current tree | ||
|
||
def __init__(self, ts): | ||
self.parents = [tskit.NULL] * ts.num_nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May was well use numpy array here?
Also we usually call this array parent
python/tests/test_ld_matrix.py
Outdated
|
||
def __init__(self, ts): | ||
self.parents = [tskit.NULL] * ts.num_nodes | ||
self.nodes = [tskit.NULL] * ts.num_nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about self.node_in_tree = np.zeros(ts.num_nodes , dtype=bool)
I don't see the advantage of the -1/+1 encoding, and this is a bit easier to understand?
Note - the specific comments I made are not particularly relevant given the high-level ones. I just made them as I was going, and thought they might be worth keeping anyway. Feel free to ignore. |
@jeromekelleher Thanks for taking a look. The EDIT: I read through your comment more carefully and saw the bit about the python tree positioning. I'll give that a try. |
@jeromekelleher I did a pass over the code and simplified things a bit. I'm still using my What do you think about this? I'm imagining a C implementation that uses |
I like it, nice and clean. I think your sketch sounds about right. Note that you can now find out the edges that have changes after calling As a side-note, I do want to expose the |
Great, I'm glad this is converging on something we're happy with. I'm going to clean this up and integrate it with the rest of the prototype code, adding some realistic tests along the way. On the subject of the C sketch, I suppose there's a couple of things to consider:
|
Okay, things are shaping up here. A couple of notes:
I think this is ready for another round of review, I'm removing the draft status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, LGTM. I'm happy to merge if you'd like to squash
640c864
to
1761e05
Compare
Great, thank you for taking a look. I've squashed, it's ready when you get the chance. Oh, hm. I'm seeing some cache issues with the tests. -- I couldn't immediately see a way to rerun these. |
Feel free to open a fresh PR if that's easier? |
Currently, this algorithm creates a matrix of LD, performing a pairwise comparison of all trees in the tree sequence. This implementation lacks windows/positions, sample sets and polarisation. The outputs of the code produce results in units of branch length, needing to be multiplied by mu^2 or divided by product of the total branch length of the two trees. This algorithm works by keeping a running sum of the statistic between two trees, updating each time we encounter a branch addition or removal. The tricky part is that we have to remove or add LD contributed by samples that already existed or that will remain under a given node after the addition/removal of branches. We include a validation against the original formulation of this problem, by including an implementation that was described in McVean 2002. The original formulation computing the covariance of tMRCAs of 2, 3, and 4 samples of individuals on the trees in question. This implementation has several limitations 1) it is very slow and 2) it does not work for trees that are decapitated, because certain samples do not have MRCAs.
1761e05
to
850600c
Compare
For future reference - the cache is shared between PRs, so this usually doesn't help. |
Description
This is a python prototype for computing two-locus branch statistics. There are a number of things missing from this prototype, notably:
Right now, the code outputs the statistical results in units of branch length, (to get the true stats, one will need to multiply by$\mu^2$ or by the product of the total branch lengths of each compared tree).
Hopefully, I stopped before things got too complicated, though I did integrate this method with bit arrays (soon to be bitsets?). The initial commit for this PR (though undocumented) contains a working algorithm with python sets if that's easier to read.
I implemented this almost exactly as I plan to implement it in C, so we should discuss whether or not this iteration approach / state storage approach works or if there is a more modern way of doing things. I remember discussions about a more modern approach to tree diff iteration in C, but I don't recall where the example is. In any case, it will be easy to update to whatever iteration pattern is preferred.