-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two Locus Statistics C API #2805
Conversation
|
||
tsk_treeseq_free(&ts); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
tsk_treeseq_free(&ts); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c/tskit/core.c
Outdated
offset = (int64_t)(array[lower] < value); | ||
return (tsk_size_t)(lower + offset); | ||
offset = (int64_t) (array[lower] < value); | ||
return (tsk_size_t) (lower + offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why clang format is adding this space, but I'm fairly certain that my editor is picking up the clang-format config.
c/tskit/trees.c
Outdated
double norm[state_dim]; | ||
double *hap_weight_row; | ||
tsk_size_t row_len = num_alleles[site_b] * state_dim; | ||
// TODO: is this stack allocation dangerous?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beyond using stack memory like this, it seems as though windows doesn't like my use of VLAs. I'm not sure if this is an inherent issue with MSVC or an issue with the c standard definition in the build. I see ignoring unknown option '-std=c99'
in the build logs here: https://github.com/tskit-dev/tskit/actions/runs/5592422530/job/15156184424?pr=2805#step:12:35
It does seem as though MVSC is compliant up until C17, so I'd expect there to be support for VLAs, but maybe I'm not understanding the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, MVSC supports thru at least C11 but without VLAS. Some Googling seems to agree, at least as of 2021: https://www.reddit.com/r/C_Programming/comments/kzskyi/why_cant_use_a_variable_to_declare_an_arrays_size/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to consider: VLAs are legal, but rarely used in C. They are not valid C++ code, and tskit may be compiled with a C++ compiler. Doing so is probably rare in practice, but we don't want to prevent it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good to know. Thanks @molpopgen I think I'll have to rework this a bit to remove the reliance on VLAs. Since it's in a pretty hot loop, I was thinking about pre-allocating some memory and passing it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked in detail at your changes, but we can certainly start simple and run perf
once tests are passing, etc., to see where improvements are likely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 10c is that allocing anything on the stack that isn't a fixed, small size just isn't worth the hassle. It's not portable, unpredictable when it fails and doesn't give much/any perf advantages.
c/tskit/trees.c
Outdated
sample_count_stat_params_t args = *(sample_count_stat_params_t *) params; | ||
double n; | ||
const double *state_row; | ||
for (tsk_size_t j = 0; j < state_dim; j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (tsk_size_t j = 0; j < state_dim; j++) { | |
tsk_bug_assert(result_dim == state_dim); | |
for (tsk_size_t j = 0; j < state_dim; j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should all of the summary functions be making this assertion? I was a little confused about result vs state dims when looking over the codebase. My understanding is that these might not be the same if we're computing a branch vs site statistic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, well, the result and state dims aren't different between branch and site stats. And, maybe we don't want to put in this assert for performance reasons here, but should at least comment that we're making that assumption? I haven't gone to check other summary functions.
c/tskit/trees.c
Outdated
for (tsk_size_t t = 0; t < self->num_trees; t++) { | ||
for (tsk_size_t s = 0; s < self->tree_sites_length[t]; s++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could just loop over the sites - the sites in the table are in the same order as encountered when iterating over (trees; sites within trees).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is good to know, I will make this change as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://tskit.dev/tskit/docs/latest/data-model.html#valid-tree-sequence-requirements (and https://tskit.dev/tskit/docs/latest/data-model.html#table-definitions ) for ordering requirements on the tables.
c/tskit/trees.c
Outdated
} | ||
|
||
// Number of pairs w/ replacement (sites) | ||
*result_size = (num_sites * (1 + num_sites)) >> (tsk_size_t) 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good reason to divide by two by bitshifting? It's more readable IMO to write / 2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize until recently that you could do truncated division with unsigned integers. It seems that this works with c99 on clang and gcc, I can change this to division so it's more readable. Here's a simple playground session showing my findings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this - any compiler worth it's salt will replace with a bitshift if there's a literal integer_value / 2
in there.
c/tskit/trees.c
Outdated
if (ret != 0) { | ||
goto out; | ||
} | ||
result_offset += state_dim; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be result_dim
, no? (as written, state_dim
always equals result_dim
, but doing so doesn't make things simpler, I don't think?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. I will change this.
I'm working through this. You've done an impressive amount! There are some very interesting things, e.g. the encoding of descendant samples as bitstrings.
again, nice work! |
@petrelharp thank you for your review, I appreciate all of the comments. I'll address one at a time below:
You're right, weights are always 1 or 0. I agree, this is more of a sample set stat, I didn't really think about the distinction until you brought this up, I can rename to
Page 6 of my design document defines these statistics. Is there another way I should be defining these to be more clear?
I see, would this be a scenario where a sample was deleted from a single tree? It's good to hear about these types of edge cases, I didn't think it was possible to do something like that.
Yes, I'm gathering all samples with a specific allele state before computing the statistics. I've organized things this way so that we only have to do one pass over the tree sequence before computing statistics. This also means one tree traversal per tree in the tree sequence, so the computational complexity should be roughly the same as single-site statistics (O(n + p (log n)^2)). I ran this code on the entire chromosome 1 for the SGDP dataset and it the alleles X samples matrix took ~175MB of memory. That's a dataset with 554 samples and 1,198,947 mutations. In total, ~750MB of memory was used before proceeding to the computation of the stats matrix. In contrast, the amount of data needed to store the upper triangle of the statistics for that dataset is about 5.7TB (1198947 * (1198947 + 1) / 2 * 8 / 1e12) -- This number has me second guessing my math, but I am referring to 718 billion pairwise statistics, each represented by a double. That said, I think that once windows are implemented, I'd only compute the matrix for the data specified in the window, which would drop the size even more. Sorry if that was long winded. In other words, if we want to know how much memory the allele matrix occupies, we can do the following calculation (RoundUp is a mathematica function):
I'm still not sure what we consider very large, but in my opinion, the size of this intermediate data pales in comparison to the size of the stats output.
We tried a variety of different norming methods for obtaining reasonable results from the various two-locus statistics. One indicator of correctness for using "total" for non-ratio statistics is that the sum of non-polarized D adds up to 0. If we norm by the haplotype frequency, and compute non-polarized D, we end up with a non-zero value. In this manner, we went through and validated our norming methods with the expected extremes of each two-locus statistic. I don't think we found any papers suggesting to normalize things this way, but it's the only method that produces results that appear correct for our handful of test cases. I can provide a more detailed notebook that outlines these results a bit better if that would help. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will batch all of the suggested changes in the next few days. Thanks again for all of the feedback.
c/tskit/trees.c
Outdated
for (tsk_size_t t = 0; t < self->num_trees; t++) { | ||
for (tsk_size_t s = 0; s < self->tree_sites_length[t]; s++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is good to know, I will make this change as well
c/tskit/trees.c
Outdated
if (ret != 0) { | ||
goto out; | ||
} | ||
result_offset += state_dim; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. I will change this.
c/tskit/trees.c
Outdated
sample_count_stat_params_t args = *(sample_count_stat_params_t *) params; | ||
double n; | ||
const double *state_row; | ||
for (tsk_size_t j = 0; j < state_dim; j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should all of the summary functions be making this assertion? I was a little confused about result vs state dims when looking over the codebase. My understanding is that these might not be the same if we're computing a branch vs site statistic.
Excellent, thanks! Let's see:
Well, do we have a use case that uses non-0/1 weights? If not, then we should not do the general case.
Right, I saw that, and that is helpful, but it still leaves questions - it's not precise enough for someone to compute exactly the same thing from scratch (especially, someone who doesn't have an idea beforehand of what is supposed to be computed). For instance, how does one deal with multiallelic sites, especially for the statistics with ratios (eg. r)? For an example, see the docstring for Y3: it says "The average density of sites at which a differs from b and c, per unit of chromosome length.". This is (or tries to be :) clear enough that it's easy and obvious how to compute the statistic in a naive way. (The definitions for your stats won't be so concise, but oh well.)
This is how we represent missing data: https://tskit.dev/tskit/docs/latest/data-model.html#sec-data-model-missing-data i.e., a sample that wasn't genotyped for a stretch of chromosome (we haven't dealt with this in the usual stats API yet, but would like to).
Oh, not long-winded at all, detail is good. =) So, I guess the downside here is that computing the LD matrix for a fixed number of sites scales (in memory) with the number of samples. Hm, so, the internal state is comparable to the (full) output when the number of samples is comparable to the number of sites (in a window). Good to know. To be clear, I think the approach here is very clever, and I don't think you should change it; we just want to be clear about the implications. It wouldn't scale to Biobank-scale data (~1M individuals), but that's okay; it's probably faster for smaller datasets than other approaches, and moving forward we could switch off to another algorithm for big computations. This does make me think about something else we might want in the API, though: in practice, won't people want to specify a specific set of SNPs to compute this with? For instance, just removing the singletons would reduce the size of the output substantially. Perhaps we want to pass in a "site mask" that says which sites to use?
That would help - I am still fuzzy on exactly how the normalization work (at least, I need the definition?). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, well done @lkirk!
I've made a light review pass and noted a few implementation details around the use of bitsets and memory allocations that I think will need to be changed. I've not got deep into all the details, I'll try and get my head around it a bit better in the next pass.
c/tskit/trees.c
Outdated
@@ -489,7 +489,7 @@ tsk_treeseq_init( | |||
|
|||
if (tsk_treeseq_get_time_units_length(self) == strlen(TSK_TIME_UNITS_UNCALIBRATED) | |||
&& !strncmp(tsk_treeseq_get_time_units(self), TSK_TIME_UNITS_UNCALIBRATED, | |||
strlen(TSK_TIME_UNITS_UNCALIBRATED))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you've got the wrong version of clang-format here. See here
(There's a long paper trail of us complaining that clang-format has such terrible compatibility 🤮 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw a thread here that discusses upgrading. Are we considering an upgrade? If not, I can install version 6 and reformat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplest to install 6 I think, we're not going to change this in the short term. Anyway, whatever version of clang-format we'd settle on would probably be slightly different to the version you're using and you'd need to change anyway.
c/tskit/trees.c
Outdated
const tsk_bit_array_t *A_samples, *B_samples; | ||
tsk_size_t w_A = 0, w_B = 0, w_AB = 0; | ||
tsk_bit_array_t *ss_row; // ss_ prefix refers to a sample set | ||
tsk_bit_array_t ss_A_samples[num_sample_chunks], ss_B_samples[num_sample_chunks], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can allocate on the stack like this, we'll eventually overflow for large sample sizes, and it's very hard to tell in advance when it's going to happen. It's just not worth the trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(You might notice that we do a lot of allocing from the stack in the test suite - but we can get away with a lot of stuff there if it makes the code easier to write and read because the inputs are fixed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my refactoring of tsk_bit_array_t
, I got rid of all of the VLAs here. We should be clear of the stack allocated VLAs now.
c/tskit/trees.c
Outdated
double norm[state_dim]; | ||
double *hap_weight_row; | ||
tsk_size_t row_len = num_alleles[site_b] * state_dim; | ||
// TODO: is this stack allocation dangerous?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 10c is that allocing anything on the stack that isn't a fixed, small size just isn't worth the hassle. It's not portable, unpredictable when it fails and doesn't give much/any perf advantages.
c/tskit/trees.c
Outdated
} | ||
|
||
// Number of pairs w/ replacement (sites) | ||
*result_size = (num_sites * (1 + num_sites)) >> (tsk_size_t) 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this - any compiler worth it's salt will replace with a bitshift if there's a literal integer_value / 2
in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored tsk_bit_array_t
, taken care of the VLAs, and have addressed most of the comments, which I will mark as done to clean things up.
c/tskit/trees.c
Outdated
@@ -489,7 +489,7 @@ tsk_treeseq_init( | |||
|
|||
if (tsk_treeseq_get_time_units_length(self) == strlen(TSK_TIME_UNITS_UNCALIBRATED) | |||
&& !strncmp(tsk_treeseq_get_time_units(self), TSK_TIME_UNITS_UNCALIBRATED, | |||
strlen(TSK_TIME_UNITS_UNCALIBRATED))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw a thread here that discusses upgrading. Are we considering an upgrade? If not, I can install version 6 and reformat.
c/tskit/trees.c
Outdated
const tsk_bit_array_t *A_samples, *B_samples; | ||
tsk_size_t w_A = 0, w_B = 0, w_AB = 0; | ||
tsk_bit_array_t *ss_row; // ss_ prefix refers to a sample set | ||
tsk_bit_array_t ss_A_samples[num_sample_chunks], ss_B_samples[num_sample_chunks], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my refactoring of tsk_bit_array_t
, I got rid of all of the VLAs here. We should be clear of the stack allocated VLAs now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments on memory safety. Haven't had a chance to go through fully.
c/tskit/trees.c
Outdated
for (mutation_index = 0; mutation_index < site->mutations_length; mutation_index++) { | ||
mutation = site->mutations[mutation_index]; | ||
/* Compute the allele index for this derived state value. */ | ||
allele = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For-loop a bit more concise?
for (allele = 0; allele < num_alleles; allele++) {
if (mutation.derived_state_length == allele_lengths[allele]...
break;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it would be. I copied this from get_allele_weights
FWIW, could fix this in that function too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See point about using tsk_safe_free
- safety first when it comes to handling memory!
Sorry I'm a bit late in responding to your comments @petrelharp.
I do remember discussing something with @apragsdale, but I don't think we'd discussed anything concrete. Let's discuss when we meet, but I'm leaning towards simplifying things to counts-only for now, then expanding once we have a clear need.
I've created a small demo notebook that might help illustrate how exactly we're doing the normalization. It can be found here. In addition, please see my explanation at the bottom. Hopefully this helps with understanding.
The solution you've proposed seems straightforward, I'm not sure how it would tie into normalization, I'd have to do some testing to see what makes sense.
I'm not exactly sure what you mean by the "full" output here. If we have a set with 1M samples/1M sites/1M mutations (meaning we're dealing w/ biallelic data)... We expect to use 250GB to store the states, but we'd still need 4TB to store the actual statistics. In any case, I think a potential solution to scaling this is in the windowing/masking. If you could compute these stats in chunks, parallelized on a cluster, that'd probably be somewhat ideal. Especially given that windowed computation wouldn't see much overhead because you could limit many of the operations to the bounds of the window/mask. An iterative algo could also work but would be much slower (I think).
I think that's what I was getting at with the windows part of the design (hinted in some of the code, but not actually implemented). However, the one issue with providing a window is that we'd need to return the sites back to the user once we've intersected their coordinates with the window coordinates. Otherwise the user wouldn't know which sites are making up the results matrix. (And I think a single window makes sense here, ideally we can chop out an NxM matrix from the entire results matrix, reducing the memory/computational requirements to a manageable size). So, what you're saying about a site mask sounds more like what we want to do. By using a sites mask, the user knows what sites are going into the matrix. The problem comes down to the api, how much work do we reasonably want to put onto the user when it comes to intersecting/storing/manipulating sites. And, how do we create a sites mask that is reasonably efficient in terms of memory? (I suppose we could use bit masks to store sites in a mask as well....).
The notebook is linked above, but I'll also explain with actual words: "Total" normalization: multiply the result of each comparison between alleles by the proportion of alleles. For example, if we are comparing an A locus with 3 alleles and a B locus with 3 alleles, we'd compute each statistic for A1-B1/A1-B2/A1-B3/A2-B1/... and we'd multiply them by 1/9 "Hap Weighted" normalization: We multiply the result of each comparison between alleles by the proportion of haplotypes for each comparison. Using the same A/B locus as above, let's say that we are comparing 10 samples. If A1-B1 have no counts, we will multiply the stat result by 0. If A1-B2 have 5 counts, then we'll multiply the stat result by 1/2 (5/10). |
We met over zoom and decided that the best course of action is to wrap this up (it's getting a bit messy). In order to wrap things up, we're prioritizing the steps necessary for merge. I'll list out what I'll be doing when I get back from vacation:
|
Ah, that makes sense. And, the notebook is helpful (and, a great start for test cases!). And, okay. My feeling is that "hap weighted" or maybe "allele freq weighted" makes much more sense than "total", since "total" can be strongly affected by rare alleles (eg sequencing error). And in this case, the weighting is just a function of the frequencies, so can be computed within Want me to open an issue? |
Interesting, I hadn't considered the point about rare variants. I think the main draw to using "total" is the fact that it seems to be correct for D and other statistics (demo'd a bit at the end of that notebook). In any case, please feel free to open an issue, I agree things could be simplified, I think @apragsdale might have an opinion here as well. I'll get to writing other issues when I return. |
b56d9b8
to
b3a005b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, which I didn't get a chance to finish. More later!
c/tskit/trees.c
Outdated
const tsk_treeseq_t *restrict ts = self->tree_sequence; | ||
|
||
const tsk_size_t num_samples = tsk_treeseq_get_num_samples(ts); | ||
const tsk_size_t num_tree_sites = ts->tree_sites_length[self->index]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are stored in the tree
(prob not much point in storing in a varible)
const tsk_size_t num_tree_sites = ts->tree_sites_length[self->index]; | |
const tsk_size_t num_tree_sites = self->sites_length | |
const tsk_site_t *restrict tree_sites = self->sites; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I eliminated that variable entirely.
c/tskit/trees.c
Outdated
} | ||
|
||
static int | ||
get_mutation_samples(const tsk_tree_t *self, const tsk_size_t *site_offsets, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor style point: if the first arg is self
, then we'd usually give the function a "method name" like tsk_tree_get_mutation_samples
, just so it's a bit easier to follow what "self" is. For stuff like this you might just want to rename self
to tree
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this will be moving to stats.h
, the I guess it's better to think of it as a utility function that has tree
as an arg, rather than a method of the Tree
class. But it's just splitting hairs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be tough to use this function in isolation, I would consider it to be an internal function until decided otherwise. In that vain, I've changed this parameter to tree
instead of self
.
c/tskit/trees.c
Outdated
tsk_size_t num_mutations, mut_samples_offset, mut_offset; | ||
tsk_size_t s, m, n; | ||
|
||
tsk_memset(&mut_samples, 0, sizeof(mut_samples)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is code in the middle of declarations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think I overcorrected a bit, I moved them before any goto
call, but after all declarations.
c/tskit/trees.c
Outdated
|
||
// Get the number of mutations | ||
num_mutations = 0; | ||
for (s = 0; s < num_tree_sites; s++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to get a pointer, could do something like
for (s = 0; s < tree->num_sites; s++) {
num_mutations += tree->sites[s].mutations_length;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
91c485c
to
aafb78f
Compare
Codecov Report
@@ Coverage Diff @@
## main #2805 +/- ##
==========================================
- Coverage 89.79% 89.76% -0.03%
==========================================
Files 30 30
Lines 29474 29892 +418
Branches 5737 5803 +66
==========================================
+ Hits 26465 26832 +367
- Misses 1726 1753 +27
- Partials 1283 1307 +24
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think the implementation is still overly complicated, but using the in-built mutation tracking in the tree code should simplify quite a bit.
I think we'll want to evolve the implementation a bit, but I've created some issues to track. We're good to merge after this round of comments, IMO.
c/tskit/trees.c
Outdated
get_mutation_samples(const tsk_tree_t *tree, const tsk_size_t *site_offsets, | ||
tsk_bit_array_t *allele_samples, tsk_size_t **num_alleles) | ||
{ | ||
int ret = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick, but not sure this whitespace within the declarations section helps understanding? Usually all declarations are done in one chunk, which makes it easier to see where the code begins (and you start thinking about memory safety and the potential state of variables after error-gotos)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I have a habit of breaking things up with whitespace, but sometimes I can go overboard. I've cleaned things up.
c/tskit/trees.c
Outdated
// We currently perform one preorder traversal per mutation, but there | ||
// might be more clever ways to do this. The preorder traversals begin at | ||
// the focal mutation's node. | ||
for (m = 0; m < num_mutations; m++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why you need to use the arrays here: why not use the site and mutation objects within the tree struct?
for (s = 0; s < tree->sites_length; s++) {
site = tree->sites[s];
for (m = 0; m < site.mutations_length; m++) {
node = site.mutations[m].node;
// do traversal rooted at node.
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment about offsets, I've cleaned things up, using the suggested loop.
c/tskit/trees.c
Outdated
int ret = 0; | ||
tsk_tree_t tree; | ||
|
||
ret = tsk_tree_init(&tree, self, TSK_TS_INIT_BUILD_INDEXES | TSK_NO_SAMPLE_COUNTS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TSK_TS_INIT_BUILD_INDEXES
isn't a valid option for tree_init.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me the get_mutation_samples
function is more complicated than it needs to be, and it would be clearer to merge it into this function (as it does very little at the moment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a refactor of all of this as you've suggested. It seems much simpler. I think with some of the suggestions about the bit array refactor, this code will be simplified even more.
c/tskit/trees.c
Outdated
const tsk_size_t num_samples = self->num_samples; | ||
const tsk_size_t max_alleles = self->tables->mutations.num_rows + num_sites; | ||
tsk_size_t *restrict site_offsets = tsk_malloc(num_sites * sizeof(*site_offsets)); | ||
tsk_size_t *restrict num_alleles = tsk_malloc(num_sites * sizeof(*num_alleles)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no real point in making this a restrict pointer in this function, we're not doing any tight loops on it, and just passing it on to other functions. Declaring as a regular pointer means we can call tsk_safe_free
and reduce the code a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
c/tskit/trees.c
Outdated
// Initialize the allele_samples with all samples in the ancestral allele | ||
tsk_bit_array_get_row(&allele_samples, num_alleles_cumsum, &allele_samples_row); | ||
tsk_bit_array_add(&allele_samples_row, &all_samples_bits); | ||
// Store the allele offset for each site |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do this stuff with site_offsets
, if we're iterating over the mutations using the inbuilt structs from the tree (as suggested above)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These offsets were put in place to reduce the amount of times I'd have to compute the offsets and to make sure all of the code is in sync. Now that things have been simplified and I'm not using offsets in 3 places, I think it's best to get rid of the malloc'd array of offsets. We still need offsets to index into the array of allele_samples, but now it's loop counters instead of a whole malloc'd array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - let's squash n merge 🎉
(Can you squash down to one commit please @lkirk?)
585955f
to
5fdf850
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need one more update to get test coverage @lkirk - if we merge now these cases will be forgotten and end up never being covered.
Most will be covered by provoking error conditions. The test_simplest_divergence_matrix
function can be a good pattern to follow there for provoking errors.
Let's just remove the code for windows for now. I think the C code, certainly, should have a fixed set of sites provided as input which will simplify things. But let's leave that till next update.
&& tsk_memcmp( | ||
mutation.derived_state, alleles[allele], allele_lengths[allele]) | ||
== 0) { | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be covering this, really. Do we need an example with some extra mutations or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test with a couple of back-mutations, that covers it
all_samples->data[all_samples->size - 1] | ||
= remainder_samples ? ~(all << remainder_samples) : all; | ||
for (i = 0; i < all_samples->size - 1; i++) { | ||
all_samples->data[i] = all; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely be covering this - need a sample size > 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test with 35 samples, which is >32 samples, which covers it
if (tsk_bit_array_contains( | ||
&bits_row, (tsk_bit_array_value_t) sample_index)) { | ||
ret = TSK_ERR_DUPLICATE_SAMPLE; | ||
goto out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be covering this
c/tskit/trees.c
Outdated
|
||
ret = tsk_treeseq_check_sample_sets( | ||
self, num_sample_sets, sample_set_sizes, sample_sets); | ||
if (ret != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this - pass in some bad samples
stat_site = true; | ||
} | ||
/* It's an error to specify more than one mode */ | ||
if (stat_site + stat_branch > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these input errors should be covered
if (result_dim < 1) { | ||
ret = TSK_ERR_BAD_RESULT_DIMS; | ||
goto out; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're not sure about how the windows
thing is going to work I vote we just assert(left_windows == NULL && right_windows == NULL)
; for now, and delete the uncovered code.
c/tskit/trees.c
Outdated
result_size, result); | ||
} else { | ||
// TODO: branch statistics | ||
ret = TSK_ERR_GENERIC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be BAD_STAT_MODE, and should be covered
LGTM, happy to merge when you're satisfied with the code coverage. |
b90dfa1
to
225d38f
Compare
I think I've gotten all of the interesting test cases. It's annoying I'm only able to partially hit this case and the one above it, but I think I'm satisfied with the outcome. Unfortunately, I'm slightly (2%) under the threshold for passing the coverage check. I think that's mostly to do with the number of out of memory errors that this code can throw. Let me know if you disagree. |
Re-squashed and ready |
@Mergifyio rebase |
We also fully implement two-site statistics, mirroring the functionality of the LdCalculator. There are a number of tasks that need to be done before this is production ready, but it is a start. We've opened a number of issues in github for improving the work that was done for this PR and for implementing the python API. The general approach for the two-site statistics is to gather all of the samples below each mutation over the whole tree sequence (we will be implementing windows in the near future). To do this, we iterate over each tree in the tree sequence once and perform an in-order traversal starting at each mutation's node, storing the samples we encounter along the way. We then perform another step that examines the parentage of each mutation and removes samples that exist in child mutations. With the samples for each allele, we perform intersections to get haplotype weights, which are then passed to summary functions that fill out a stats matrix. We also implement a few happy-path tests that test the functionality of sample sets and multi-allelic sites. We also cover all of the non memory-related error cases in the C api. We plan to do more in-depth testing once the python API is implemented. This PR also introduces the concept of a "bit array" in the form of tsk_bit_array_t, which needs some further refinement. In essence, this data structure stores all of the possible samples in a bit set where one sample takes up one bit of information and storage space. This enables a few functions that are critical to this code: 1) it's a relatively memory efficient method of storing sample identities 2) we can efficiently intersect the arrays (I believe much of this is autovectorized by gcc) 3) we can use tricks like SWAR (SIMD within an array) to count the number of samples in a set. It's important to note that this is also where the algorithm spends ~50% of it's time. It can use more improvement in both the API design and the performance characteristics.
✅ Branch has been successfully rebased |
225d38f
to
7ec1ff3
Compare
I'm not sure why mergify doesn't like this, just going to merge manualy. |
Description
As discussed with @jeromekelleher, I've opened up this PR to add the code necessary to create a C API for two locus statistics. This PR implements a general framework for multi-allelic site statistics and summary functions for r, r2, D, D2, D', Dz, pi2 to get things started. I've described this work in a design document contained in my prototype repository. This PR diverges in the main entrypoint (I discussed with @jeromekelleher, he thinks that this should have a separate entrypoint from
tsk_treeseq_general_stat
), but everything else in the document is accurate.I've implemented tests that cover the happy paths. They ensure the correctness of multiple sample sets and the summary functions. I haven't done a deep dive into the test coverage, but I know that there is at least one important branch that is not covered, I'll have to look into that a bit more closely to understand why it's not being hit.
One thing left to implement is windows. Our current thinking for site statistics is that we'd like to have one window on the left and one window on the right so that we can cut any slice out of the correlation matrix. This leaves a few questions in terms of API design that we can get to in another PR. I think this one is large enough as is.
PR Checklist: