-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
time windows in statistics #2948
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2948 +/- ##
==========================================
- Coverage 89.85% 89.82% -0.04%
==========================================
Files 29 29
Lines 32128 32217 +89
Branches 5763 5784 +21
==========================================
+ Hits 28868 28938 +70
- Misses 1859 1869 +10
- Partials 1401 1410 +9
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Note: it is not clear how to do this for The proposal will probably be to compute a site stat that sums over mutations, not alleles, but we'll start with |
Next step:
Also maybe:
|
a small nudge here that i mentioned to @petrelharp in passing-- it would be great to have an expectation from theory as to what time stratified quantities like the SFS should be under the (standard, neutral) coalescent |
Some thoughts after working on time windows. After these edits the moment the output of, let's say, the AFS is a still 2D array of windows, same for time_windows, when using either of them individually. However, when using windows and time_windows at the same time, the output is a 3D array, with the following shape: [num_windows][num_time_windows][sample_size]. When windows or time_windows are None, associated dimensions are dropped accordingly. Some ideas:
|
A note on the potential confusion between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! One question about a possible refactor, and suggesting moving the "general stat" stuff to a different PR.
python/tests/test_tree_stats.py
Outdated
for u in tree.nodes() | ||
) | ||
sigma[tree.index, j, :] = s * tree.span | ||
for j in range(1, len(time_windows) - 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for j in range(1, len(time_windows) - 1): | |
for j in range(1, tw): |
python/tests/test_tree_stats.py
Outdated
return windowed_tree_stat(ts, sigma, windows, span_normalise=span_normalise) | ||
out = windowed_tree_stat(ts, sigma, windows, span_normalise=span_normalise) | ||
if drop_time_windows: | ||
# beware: this assumes the first dimension is windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment can be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm but perhaps replaced by
assert len(out.shape) == 3
@@ -144,39 +144,93 @@ def windowed_tree_stat(ts, stat, windows, span_normalise=True): | |||
return A | |||
|
|||
|
|||
# Timewindows test | |||
def naive_branch_general_stat( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function looks good; however, I think this PR is going to be just about the AFS, not general stats - so, maybe this code should be put aside in a separate PR?
if polarised: | ||
s = sum(tree.branch_length(u) * f(x[u]) for u in tree.nodes()) | ||
sigma = np.zeros((ts.num_trees, tw, m)) | ||
for j, upper_time in enumerate(time_windows[1:]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that time_windows[0]
is 0
, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be fixed by setting sigma
not to 0 but to -1 times the value calculated from ts.decapitate(time_windows[0])
.
python/tests/test_tree_stats.py
Outdated
# Warning: when using Windows and TimeWindows, | ||
# the output has three dimensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can delete this
python/tests/test_tree_stats.py
Outdated
c = fold(c, out_dim) | ||
index = tuple([window_index] + list(c)) | ||
result[index] += x | ||
def update_result(window_index, u, right, time_windows): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time_windows
isn't being changed by this function
def update_result(window_index, u, right, time_windows): | |
def update_result(window_index, u, right): |
python/tests/test_tree_stats.py
Outdated
# interval between child and parent inside the window | ||
t_v = branch_length[u] + time[u] | ||
tw_branch_length = min(time_windows[k_tw + 1], t_v) - max( | ||
time_windows[0], time[u] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops?
time_windows[0], time[u] | |
time_windows[k_tw], time[u] |
python/tests/test_tree_stats.py
Outdated
for k_tw, _ in enumerate(time_windows[:-1]): | ||
if 0 < count[u, -1] < ts.num_samples: | ||
# interval between child and parent inside the window | ||
t_v = branch_length[u] + time[u] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we're losing the advantage of cacheing branch_length[u]
; this might as well be time[v]
(with v
passed in also).
Alternatives:
- let
branch_length
be a (num nodes x num time windows) array instead of just a vector, so that we'd have
tw_branch_length = branch_length[u, k_tw]
- Do the calculation down a few lines, something like this (this is not right):
u = edge.child
v = edge.parent
t_c = time[u]
t_p = time[v]
time_window_index = 0
while t_p < time_windows[time_window_index + 1]:
while v != -1:
tw_branch_length = min(time_windows[k_tw + 1], t_p) - max(time_windows[k_tw], t_c)
update_result(window_index, time_window_index, v, t_left, tw_branch_length)
count[v] -= count[u]
t_c = t_p
v = parent[v]
t_p = time[v]
time_window_index += 1
The advantage to this is that computation isn't increased by a factor of (num time windows). The disadvantage might be that the code is harder to understand?
window_index += 1 | ||
tree_index += 1 | ||
|
||
assert window_index == windows.shape[0] - 1 | ||
if span_normalise: | ||
for j in range(num_windows): | ||
result[j] /= windows[j + 1] - windows[j] | ||
|
||
if drop_time_windows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see suggestions above
I've added this work to the next release milestone. Hoping to get a release out in a week or two, if that is too ambitious for this let me know. |
Probably too ambitious, but we might have something in by then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Some suggestions, mostly minor; let's discuss getting the tests in there.
python/tskit/trees.py
Outdated
@@ -7637,6 +7637,7 @@ def parse_windows(self, windows): | |||
# Note: need to make sure windows is a string or we try to compare the | |||
# target with a numpy array elementwise. | |||
if windows is None: | |||
# initiate default spanning windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# initiate default spanning windows |
if strip_win: | ||
stat = stat[0, :, :] | ||
elif strip_timewin: | ||
stat = stat[:, 0, :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like you can't have both, ie windows=None, time_windows=None
?
python/tskit/trees.py
Outdated
if (stat.shape == () and windows is None) or ( | ||
stat.shape == () and time_windows is None | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (stat.shape == () and windows is None) or ( | |
stat.shape == () and time_windows is None | |
): | |
if (stat.shape == () and windows is None and time_windows is None): |
I think the intention of this rule is so that if you do like
ts.diversity([0,1,2])
then you get a single number, not a length-1 array, but if anyone is supplying windows explicitly (or time windows!) then they should get an array with the number of dimensions they expect.
We should write the bit in the docs that includes time windows, so we've got this clear?
python/_tskitmodule.c
Outdated
@@ -9077,7 +9077,7 @@ parse_windows( | |||
npy_intp *shape; | |||
|
|||
windows_array = (PyArrayObject *) PyArray_FROMANY( | |||
windows, NPY_FLOAT64, 1, 1, NPY_ARRAY_IN_ARRAY); | |||
windows, NPY_FLOAT64, 1, 1, NPY_ARRAY_IN_ARRAY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this change (and others like it) done by linting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, or probably an indentation that I made by mistake. I put it back how it was before.
python/_tskitmodule.c
Outdated
"span_normalise", "polarised", NULL }; | ||
PyObject *sample_set_sizes = NULL; | ||
PyObject *sample_sets = NULL; | ||
PyObject *windows = NULL; | ||
char *mode = NULL; | ||
PyObject *time_windows = NULL; | ||
char *mode = "NULL"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
char *mode = "NULL"; | |
char *mode = NULL; |
c/tskit/trees.c
Outdated
} | ||
increment_nd_array_value(afs, num_sample_sets, result_dims, coordinate, x); | ||
if (parent[u] != -1){ | ||
t_v = time[parent[u]]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps also t_u
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no t_u variable, do you think that's necessary? I'm using time[u] here.
c/tskit/trees.c
Outdated
if (!polarised){ | ||
fold(coordinate, result_dims, num_sample_sets); | ||
} | ||
tw_branch_length = MIN(time_windows[time_window_index + 1], t_v) - MAX(time_windows[0], time[u]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be
tw_branch_length = MIN(time_windows[time_window_index + 1], t_v) - MAX(time_windows[0], time[u]); | |
tw_branch_length = MIN(time_windows[time_window_index + 1], t_v) - MAX(time_windows[time_window_index], time[u]); |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm - the tests below should be catching this if it is indeed wrong, but it sure looks wrong to me - I'm not sure what's going on?
c/tskit/trees.c
Outdated
if (parent[u] != -1){ | ||
t_v = time[parent[u]]; | ||
if (0 < all_samples && all_samples < self->num_samples) { | ||
for (time_window_index = 0; time_window_index < num_time_windows; time_window_index++){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of edges are recent, so we might avoid substantial work if we do like
time_window_index = 0;
while (time_window_index < num_time_windows && time_windows[time_window_index] < t_v){
...
time_window_index++;
}
|
||
|
||
class TestTimeWindows(TestBranchAlleleFrequencySpectrum): | ||
def test_four_taxa_test_case(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't really be in this class, since it's testing general_stat
, not the AFS; perhaps leave a comment? Or move it along with the general_stat
code above to a new PR for future work?
) | ||
self.assertArrayAlmostEqual(x, true_x) | ||
|
||
def test_afs_branch(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very useful, but it's hard to tell exactly what's being tested. For instance, there's no call to ts.allele_frequency_spectrum
here, I think? Perhaps this could be rearranged? Simplified? Commented? There's also some references to self.mode
, which might be confusing since this is branch-only?
@@ -1258,23 +1270,24 @@ verify_afs(tsk_treeseq_t *ts) | |||
sample_set_sizes[0] = n - 2; | |||
sample_set_sizes[1] = 2; | |||
ret = tsk_treeseq_allele_frequency_spectrum( | |||
ts, 2, sample_set_sizes, samples, 0, NULL, 0, NULL, 0, result); | |||
ts, 2, sample_set_sizes, samples, 0, NULL, 0, NULL, 0, result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh-oh, are these tabs?
c/tskit/trees.c
Outdated
ret = tsk_treeseq_check_time_windows( | ||
num_time_windows, time_windows); | ||
if (stat_site | ||
&& tsk_memcmp(time_windows, default_time_windows, sizeof(double)) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this is a bit awkward - what if instead we used num_time_windows=0
to mean "default/no time windows"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But time_windows are always initialized by default as [0, inf], so num_time_windows=2, comparing to the default was the clearest I found for now. But maybe the problem lies in the initialization caused by the parsing of the windows in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wait - we're already in the else
clause where we know that time_windows != NULL
. So I don't think we need to check this at all - just throw the error?
&& tsk_memcmp(time_windows, default_time_windows, sizeof(double)) != 0) { | |
) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if someone explicitly specifies time_windows = [0, np.inf], mode="node"
then they'll get the error, but that's okay - the error message says "you can't specify time windows", not "time windows must be [0, Inf)" (I think that's what it says anyhow).
Here @tforest and I are starting in on adding time windows to statistics. We're starting with what was sketched out in #683, and will explain things in more detail here when we're farther along (ignore this for now).