Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two Locus Statistics C API #2805

Merged
merged 1 commit into from
Sep 4, 2023

Conversation

lkirk
Copy link
Contributor

@lkirk lkirk commented Jul 18, 2023

Description

As discussed with @jeromekelleher, I've opened up this PR to add the code necessary to create a C API for two locus statistics. This PR implements a general framework for multi-allelic site statistics and summary functions for r, r2, D, D2, D', Dz, pi2 to get things started. I've described this work in a design document contained in my prototype repository. This PR diverges in the main entrypoint (I discussed with @jeromekelleher, he thinks that this should have a separate entrypoint from tsk_treeseq_general_stat), but everything else in the document is accurate.

I've implemented tests that cover the happy paths. They ensure the correctness of multiple sample sets and the summary functions. I haven't done a deep dive into the test coverage, but I know that there is at least one important branch that is not covered, I'll have to look into that a bit more closely to understand why it's not being hit.

One thing left to implement is windows. Our current thinking for site statistics is that we'd like to have one window on the left and one window on the right so that we can cut any slice out of the correlation matrix. This leaves a few questions in terms of API design that we can get to in another PR. I think this one is large enough as is.

PR Checklist:

  • Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate. <- I'll probably cover this when I implement the python API.
  • Changelogs, if there are API changes. <- I'll need a bit of help with this

@lkirk lkirk changed the title Two Locus tats c api Two Locus Statistics C API Jul 18, 2023

tsk_treeseq_free(&ts);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the tree for the correlated test
image


tsk_treeseq_free(&ts);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the tree for the uncorrelated test
image

c/tskit/core.c Outdated
offset = (int64_t)(array[lower] < value);
return (tsk_size_t)(lower + offset);
offset = (int64_t) (array[lower] < value);
return (tsk_size_t) (lower + offset);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why clang format is adding this space, but I'm fairly certain that my editor is picking up the clang-format config.

c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Show resolved Hide resolved
@petrelharp
Copy link
Contributor

For easy reference on github: see #432 and #1900.

c/tskit/trees.c Outdated
double norm[state_dim];
double *hap_weight_row;
tsk_size_t row_len = num_alleles[site_b] * state_dim;
// TODO: is this stack allocation dangerous??
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond using stack memory like this, it seems as though windows doesn't like my use of VLAs. I'm not sure if this is an inherent issue with MSVC or an issue with the c standard definition in the build. I see ignoring unknown option '-std=c99' in the build logs here: https://github.com/tskit-dev/tskit/actions/runs/5592422530/job/15156184424?pr=2805#step:12:35
It does seem as though MVSC is compliant up until C17, so I'd expect there to be support for VLAs, but maybe I'm not understanding the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, MVSC supports thru at least C11 but without VLAS. Some Googling seems to agree, at least as of 2021: https://www.reddit.com/r/C_Programming/comments/kzskyi/why_cant_use_a_variable_to_declare_an_arrays_size/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to consider: VLAs are legal, but rarely used in C. They are not valid C++ code, and tskit may be compiled with a C++ compiler. Doing so is probably rare in practice, but we don't want to prevent it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good to know. Thanks @molpopgen I think I'll have to rework this a bit to remove the reliance on VLAs. Since it's in a pretty hot loop, I was thinking about pre-allocating some memory and passing it in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked in detail at your changes, but we can certainly start simple and run perf once tests are passing, etc., to see where improvements are likely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 10c is that allocing anything on the stack that isn't a fixed, small size just isn't worth the hassle. It's not portable, unpredictable when it fails and doesn't give much/any perf advantages.

c/tskit/trees.c Outdated
sample_count_stat_params_t args = *(sample_count_stat_params_t *) params;
double n;
const double *state_row;
for (tsk_size_t j = 0; j < state_dim; j++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (tsk_size_t j = 0; j < state_dim; j++) {
tsk_bug_assert(result_dim == state_dim);
for (tsk_size_t j = 0; j < state_dim; j++) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all of the summary functions be making this assertion? I was a little confused about result vs state dims when looking over the codebase. My understanding is that these might not be the same if we're computing a branch vs site statistic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, well, the result and state dims aren't different between branch and site stats. And, maybe we don't want to put in this assert for performance reasons here, but should at least comment that we're making that assumption? I haven't gone to check other summary functions.

c/tskit/trees.c Outdated
Comment on lines 2594 to 2595
for (tsk_size_t t = 0; t < self->num_trees; t++) {
for (tsk_size_t s = 0; s < self->tree_sites_length[t]; s++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just loop over the sites - the sites in the table are in the same order as encountered when iterating over (trees; sites within trees).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is good to know, I will make this change as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c/tskit/trees.c Outdated
}

// Number of pairs w/ replacement (sites)
*result_size = (num_sites * (1 + num_sites)) >> (tsk_size_t) 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason to divide by two by bitshifting? It's more readable IMO to write / 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize until recently that you could do truncated division with unsigned integers. It seems that this works with c99 on clang and gcc, I can change this to division so it's more readable. Here's a simple playground session showing my findings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this - any compiler worth it's salt will replace with a bitshift if there's a literal integer_value / 2 in there.

c/tskit/trees.c Outdated
if (ret != 0) {
goto out;
}
result_offset += state_dim;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be result_dim, no? (as written, state_dim always equals result_dim, but doing so doesn't make things simpler, I don't think?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. I will change this.

@petrelharp
Copy link
Contributor

I'm working through this. You've done an impressive amount! There are some very interesting things, e.g. the encoding of descendant samples as bitstrings.

  • IIUC (and I might not), the weights are always effectively 0/1, or at least it's only using whether a particular weight is 0 or nonzero, right? This makes it a bit misleading to have the weights being double, as we don't use the values, and means that these aren't "general stats" but rather "sample set stats". Have I got this right? If so and if the generic function is exposed, it should probably be tsk_treeseq_two_locus_sample_set_stat not general_stat.

  • It would help me to have the precise definitions of these seven statistics written down somewhere (indeed, this could be a draft for the docstrings). In particular,

  • Perhaps we're being too clever in only keeping track of AB, Ab, and aB? If we also kept track of ab and then computed n = AB + Ab + aB + ab for each site instead of using the total sample size passed in through args, then we would be able to handle missing data, something that we'd like to be able to do.

  • I see you're filling out the entire (num mutations) x (num samples) array first then using it to compute each pairwise stat. Thanks to the clever binary representation, this isn't terrible, memory-wise. I suppose you thought about taking the other approach of iterating over pairs of trees - could you provide some background on your thinking on that? This is a speed-memory tradeoff; how much memory would it take to do this for a reasonably large example? How about in very large examples where we don't want the entire LD matrix, but only, say, the diagonal?

  • Nice job turning up the references on how to compute multiallelic LD. However, do we really need a special normalization, which is adding some complexity to the code? For instance, couldn't r2 be computed with no normalization if we just modify r2_summary_func to multiply the result by pAB? I am probably missing something, though: in particular, the "total" normalization scheme doesn't fit into this. However, I'm not convinced that "total" normalization (i.e., averaging over all haplotypes) is the way to go - has anyone suggested doing things this way? What's the rationale here?

again, nice work!

@lkirk
Copy link
Contributor Author

lkirk commented Jul 24, 2023

@petrelharp thank you for your review, I appreciate all of the comments. I'll address one at a time below:

IIUC (and I might not), the weights are always effectively 0/1, or at least it's only using whether a particular weight is 0 or nonzero, right? This makes it a bit misleading to have the weights being double, as we don't use the values, and means that these aren't "general stats" but rather "sample set stats". Have I got this right? If so and if the generic function is exposed, it should probably be tsk_treeseq_two_locus_sample_set_stat not general_stat.

You're right, weights are always 1 or 0. I agree, this is more of a sample set stat, I didn't really think about the distinction until you brought this up, I can rename to tsk_treeseq_two_locus_sample_set_stat. I designed the API to be general enough to accept weights if we wanted to specify them in some other statistic. That said, we'd need to alter the implementation in compute_general_two_site_stat_result to properly utilize the weights. So, as it stands right now, the implementation for weights is half done and misleading in terms of data types. I'm happy to finish the job of implementing weights in a way that they'd be useful later on (especially since compute_general_two_site_stat_result needs some reworking anyways -- per my discussion with molpopgen). I'm also happy to simplify the API and remove the weights as doubles.

It would help me to have the precise definitions of these seven statistics written down somewhere (indeed, this could be a draft for the docstrings). In particular,

Page 6 of my design document defines these statistics. Is there another way I should be defining these to be more clear?

Perhaps we're being too clever in only keeping track of AB, Ab, and aB? If we also kept track of ab and then computed n = AB + Ab + aB + ab for each site instead of using the total sample size passed in through args, then we would be able to handle missing data, something that we'd like to be able to do.

I see, would this be a scenario where a sample was deleted from a single tree? It's good to hear about these types of edge cases, I didn't think it was possible to do something like that.

I see you're filling out the entire (num mutations) x (num samples) array first then using it to compute each pairwise stat. Thanks to the clever binary representation, this isn't terrible, memory-wise. I suppose you thought about taking the other approach of iterating over pairs of trees - could you provide some background on your thinking on that? This is a speed-memory tradeoff; how much memory would it take to do this for a reasonably large example? How about in very large examples where we don't want the entire LD matrix, but only, say, the diagonal?

Yes, I'm gathering all samples with a specific allele state before computing the statistics. I've organized things this way so that we only have to do one pass over the tree sequence before computing statistics. This also means one tree traversal per tree in the tree sequence, so the computational complexity should be roughly the same as single-site statistics (O(n + p (log n)^2)). I ran this code on the entire chromosome 1 for the SGDP dataset and it the alleles X samples matrix took ~175MB of memory. That's a dataset with 554 samples and 1,198,947 mutations. In total, ~750MB of memory was used before proceeding to the computation of the stats matrix. In contrast, the amount of data needed to store the upper triangle of the statistics for that dataset is about 5.7TB (1198947 * (1198947 + 1) / 2 * 8 / 1e12) -- This number has me second guessing my math, but I am referring to 718 billion pairwise statistics, each represented by a double.

That said, I think that once windows are implemented, I'd only compute the matrix for the data specified in the window, which would drop the size even more.

Sorry if that was long winded. In other words, if we want to know how much memory the allele matrix occupies, we can do the following calculation (RoundUp is a mathematica function):

(RoundUp[num_samples / 32]) * (num_mutations + num_sites)  / 1e6 * 4

I'm still not sure what we consider very large, but in my opinion, the size of this intermediate data pales in comparison to the size of the stats output.

Nice job turning up the references on how to compute multiallelic LD. However, do we really need a special normalization, which is adding some complexity to the code? For instance, couldn't r2 be computed with no normalization if we just modify r2_summary_func to multiply the result by pAB? I am probably missing something, though: in particular, the "total" normalization scheme doesn't fit into this. However, I'm not convinced that "total" normalization (i.e., averaging over all haplotypes) is the way to go - has anyone suggested doing things this way? What's the rationale here?

We tried a variety of different norming methods for obtaining reasonable results from the various two-locus statistics. One indicator of correctness for using "total" for non-ratio statistics is that the sum of non-polarized D adds up to 0. If we norm by the haplotype frequency, and compute non-polarized D, we end up with a non-zero value. In this manner, we went through and validated our norming methods with the expected extremes of each two-locus statistic. I don't think we found any papers suggesting to normalize things this way, but it's the only method that produces results that appear correct for our handful of test cases. I can provide a more detailed notebook that outlines these results a bit better if that would help.

Copy link
Contributor Author

@lkirk lkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will batch all of the suggested changes in the next few days. Thanks again for all of the feedback.

c/tskit/trees.c Show resolved Hide resolved
c/tskit/trees.c Outdated
Comment on lines 2594 to 2595
for (tsk_size_t t = 0; t < self->num_trees; t++) {
for (tsk_size_t s = 0; s < self->tree_sites_length[t]; s++) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is good to know, I will make this change as well

c/tskit/trees.c Outdated
if (ret != 0) {
goto out;
}
result_offset += state_dim;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. I will change this.

c/tskit/trees.c Outdated
sample_count_stat_params_t args = *(sample_count_stat_params_t *) params;
double n;
const double *state_row;
for (tsk_size_t j = 0; j < state_dim; j++) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all of the summary functions be making this assertion? I was a little confused about result vs state dims when looking over the codebase. My understanding is that these might not be the same if we're computing a branch vs site statistic.

@petrelharp
Copy link
Contributor

Excellent, thanks! Let's see:

I'm happy to finish the job of implementing weights in a way that they'd be useful later on (especially since compute_general_two_site_stat_result needs some reworking anyways -- per my discussion with molpopgen). I'm also happy to simplify the API and remove the weights as doubles.

Well, do we have a use case that uses non-0/1 weights? If not, then we should not do the general case.

It would help me to have the precise definitions of these seven statistics written down somewhere (indeed, this could be a draft for the docstrings).

Page 6 of my design document defines these statistics. Is there another way I should be defining these to be more clear?

Right, I saw that, and that is helpful, but it still leaves questions - it's not precise enough for someone to compute exactly the same thing from scratch (especially, someone who doesn't have an idea beforehand of what is supposed to be computed). For instance, how does one deal with multiallelic sites, especially for the statistics with ratios (eg. r)? For an example, see the docstring for Y3: it says "The average density of sites at which a differs from b and c, per unit of chromosome length.". This is (or tries to be :) clear enough that it's easy and obvious how to compute the statistic in a naive way. (The definitions for your stats won't be so concise, but oh well.)

I see, would this be a scenario where a sample was deleted from a single tree?

This is how we represent missing data: https://tskit.dev/tskit/docs/latest/data-model.html#sec-data-model-missing-data i.e., a sample that wasn't genotyped for a stretch of chromosome (we haven't dealt with this in the usual stats API yet, but would like to).

Sorry if that was long winded. In other words, if we want to know how much memory the allele matrix occupies, we can do the following calculation (RoundUp is a mathematica function):

(RoundUp[num_samples / 32]) * (num_mutations + num_sites) / 1e6 * 4

Oh, not long-winded at all, detail is good. =) So, I guess the downside here is that computing the LD matrix for a fixed number of sites scales (in memory) with the number of samples. Hm, so, the internal state is comparable to the (full) output when the number of samples is comparable to the number of sites (in a window). Good to know.

To be clear, I think the approach here is very clever, and I don't think you should change it; we just want to be clear about the implications. It wouldn't scale to Biobank-scale data (~1M individuals), but that's okay; it's probably faster for smaller datasets than other approaches, and moving forward we could switch off to another algorithm for big computations.

This does make me think about something else we might want in the API, though: in practice, won't people want to specify a specific set of SNPs to compute this with? For instance, just removing the singletons would reduce the size of the output substantially. Perhaps we want to pass in a "site mask" that says which sites to use?

... normalization... I can provide a more detailed notebook that outlines these results a bit better if that would help.

That would help - I am still fuzzy on exactly how the normalization work (at least, I need the definition?).

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, well done @lkirk!

I've made a light review pass and noted a few implementation details around the use of bitsets and memory allocations that I think will need to be changed. I've not got deep into all the details, I'll try and get my head around it a bit better in the next pass.

c/tskit/core.c Outdated Show resolved Hide resolved
c/tskit/core.h Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated
@@ -489,7 +489,7 @@ tsk_treeseq_init(

if (tsk_treeseq_get_time_units_length(self) == strlen(TSK_TIME_UNITS_UNCALIBRATED)
&& !strncmp(tsk_treeseq_get_time_units(self), TSK_TIME_UNITS_UNCALIBRATED,
strlen(TSK_TIME_UNITS_UNCALIBRATED))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you've got the wrong version of clang-format here. See here

(There's a long paper trail of us complaining that clang-format has such terrible compatibility 🤮 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw a thread here that discusses upgrading. Are we considering an upgrade? If not, I can install version 6 and reformat.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplest to install 6 I think, we're not going to change this in the short term. Anyway, whatever version of clang-format we'd settle on would probably be slightly different to the version you're using and you'd need to change anyway.

c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated
const tsk_bit_array_t *A_samples, *B_samples;
tsk_size_t w_A = 0, w_B = 0, w_AB = 0;
tsk_bit_array_t *ss_row; // ss_ prefix refers to a sample set
tsk_bit_array_t ss_A_samples[num_sample_chunks], ss_B_samples[num_sample_chunks],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can allocate on the stack like this, we'll eventually overflow for large sample sizes, and it's very hard to tell in advance when it's going to happen. It's just not worth the trouble.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(You might notice that we do a lot of allocing from the stack in the test suite - but we can get away with a lot of stuff there if it makes the code easier to write and read because the inputs are fixed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my refactoring of tsk_bit_array_t, I got rid of all of the VLAs here. We should be clear of the stack allocated VLAs now.

c/tskit/trees.c Outdated
double norm[state_dim];
double *hap_weight_row;
tsk_size_t row_len = num_alleles[site_b] * state_dim;
// TODO: is this stack allocation dangerous??
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 10c is that allocing anything on the stack that isn't a fixed, small size just isn't worth the hassle. It's not portable, unpredictable when it fails and doesn't give much/any perf advantages.

c/tskit/trees.c Outdated
}

// Number of pairs w/ replacement (sites)
*result_size = (num_sites * (1 + num_sites)) >> (tsk_size_t) 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this - any compiler worth it's salt will replace with a bitshift if there's a literal integer_value / 2 in there.

Copy link
Contributor Author

@lkirk lkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored tsk_bit_array_t, taken care of the VLAs, and have addressed most of the comments, which I will mark as done to clean things up.

c/tskit/trees.c Outdated
@@ -489,7 +489,7 @@ tsk_treeseq_init(

if (tsk_treeseq_get_time_units_length(self) == strlen(TSK_TIME_UNITS_UNCALIBRATED)
&& !strncmp(tsk_treeseq_get_time_units(self), TSK_TIME_UNITS_UNCALIBRATED,
strlen(TSK_TIME_UNITS_UNCALIBRATED))) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw a thread here that discusses upgrading. Are we considering an upgrade? If not, I can install version 6 and reformat.

c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Show resolved Hide resolved
c/tskit/core.c Outdated Show resolved Hide resolved
c/tskit/core.h Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated
const tsk_bit_array_t *A_samples, *B_samples;
tsk_size_t w_A = 0, w_B = 0, w_AB = 0;
tsk_bit_array_t *ss_row; // ss_ prefix refers to a sample set
tsk_bit_array_t ss_A_samples[num_sample_chunks], ss_B_samples[num_sample_chunks],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my refactoring of tsk_bit_array_t, I got rid of all of the VLAs here. We should be clear of the stack allocated VLAs now.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments on memory safety. Haven't had a chance to go through fully.

c/tskit/core.c Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated
for (mutation_index = 0; mutation_index < site->mutations_length; mutation_index++) {
mutation = site->mutations[mutation_index];
/* Compute the allele index for this derived state value. */
allele = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For-loop a bit more concise?

for (allele = 0; allele < num_alleles; allele++) {
      if (mutation.derived_state_length == allele_lengths[allele]...
             break;
      }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it would be. I copied this from get_allele_weights FWIW, could fix this in that function too.

c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See point about using tsk_safe_free - safety first when it comes to handling memory!

c/tskit/core.c Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
@lkirk
Copy link
Contributor Author

lkirk commented Aug 10, 2023

Sorry I'm a bit late in responding to your comments @petrelharp.

Well, do we have a use case that uses non-0/1 weights? If not, then we should not do the general case.

I do remember discussing something with @apragsdale, but I don't think we'd discussed anything concrete. Let's discuss when we meet, but I'm leaning towards simplifying things to counts-only for now, then expanding once we have a clear need.

Right, I saw that, and that is helpful, but it still leaves questions - it's not precise enough for someone to compute exactly the same thing from scratch (especially, someone who doesn't have an idea beforehand of what is supposed to be computed).

I've created a small demo notebook that might help illustrate how exactly we're doing the normalization. It can be found here. In addition, please see my explanation at the bottom. Hopefully this helps with understanding.

This is how we represent missing data: https://tskit.dev/tskit/docs/latest/data-model.html#sec-data-model-missing-data i.e., a sample that wasn't genotyped for a stretch of chromosome (we haven't dealt with this in the usual stats API yet, but would like to).

The solution you've proposed seems straightforward, I'm not sure how it would tie into normalization, I'd have to do some testing to see what makes sense.

Oh, not long-winded at all, detail is good. =) So, I guess the downside here is that computing the LD matrix for a fixed number of sites scales (in memory) with the number of samples.

Hm, so, the internal state is comparable to the (full) output when the number of samples is comparable to the number of sites (in a window). Good to know.

To be clear, I think the approach here is very clever, and I don't think you should change it; we just want to be clear about the implications. It wouldn't scale to Biobank-scale data (~1M individuals), but that's okay; it's probably faster for smaller datasets than other approaches, and moving forward we could switch off to another algorithm for big computations.

I'm not exactly sure what you mean by the "full" output here. If we have a set with 1M samples/1M sites/1M mutations (meaning we're dealing w/ biallelic data)... We expect to use 250GB to store the states, but we'd still need 4TB to store the actual statistics. In any case, I think a potential solution to scaling this is in the windowing/masking. If you could compute these stats in chunks, parallelized on a cluster, that'd probably be somewhat ideal. Especially given that windowed computation wouldn't see much overhead because you could limit many of the operations to the bounds of the window/mask. An iterative algo could also work but would be much slower (I think).

This does make me think about something else we might want in the API, though: in practice, won't people want to specify a specific set of SNPs to compute this with? For instance, just removing the singletons would reduce the size of the output substantially. Perhaps we want to pass in a "site mask" that says which sites to use?

I think that's what I was getting at with the windows part of the design (hinted in some of the code, but not actually implemented). However, the one issue with providing a window is that we'd need to return the sites back to the user once we've intersected their coordinates with the window coordinates. Otherwise the user wouldn't know which sites are making up the results matrix. (And I think a single window makes sense here, ideally we can chop out an NxM matrix from the entire results matrix, reducing the memory/computational requirements to a manageable size).

So, what you're saying about a site mask sounds more like what we want to do. By using a sites mask, the user knows what sites are going into the matrix. The problem comes down to the api, how much work do we reasonably want to put onto the user when it comes to intersecting/storing/manipulating sites. And, how do we create a sites mask that is reasonably efficient in terms of memory? (I suppose we could use bit masks to store sites in a mask as well....).

That would help - I am still fuzzy on exactly how the normalization work (at least, I need the definition?).

The notebook is linked above, but I'll also explain with actual words:

"Total" normalization: multiply the result of each comparison between alleles by the proportion of alleles.

For example, if we are comparing an A locus with 3 alleles and a B locus with 3 alleles, we'd compute each statistic for A1-B1/A1-B2/A1-B3/A2-B1/... and we'd multiply them by 1/9

"Hap Weighted" normalization: We multiply the result of each comparison between alleles by the proportion of haplotypes for each comparison.

Using the same A/B locus as above, let's say that we are comparing 10 samples. If A1-B1 have no counts, we will multiply the stat result by 0. If A1-B2 have 5 counts, then we'll multiply the stat result by 1/2 (5/10).

@lkirk
Copy link
Contributor Author

lkirk commented Aug 10, 2023

We met over zoom and decided that the best course of action is to wrap this up (it's getting a bit messy). In order to wrap things up, we're prioritizing the steps necessary for merge. I'll list out what I'll be doing when I get back from vacation:

  1. Rebase with main so that the circleci:build-32 test passes (see guide in stdpopsim)
  2. Finish up test coverage, covering bit arrays (I think the line that hits the nonbinary samples case will be eliminated when I adopt api functions).
  3. Use api functions for iteration + built in preorder
  4. Address memory safety issues
  5. Final bit array counting algo (the portable one -- it offers significant performance gains)
  6. Capture the remaining discussion points/concerns in github issues.

@petrelharp
Copy link
Contributor

"Total" normalization: multiply the result of each comparison between alleles by the proportion of alleles.

For example, if we are comparing an A locus with 3 alleles and a B locus with 3 alleles, we'd compute each statistic for A1-B1/A1-B2/A1-B3/A2-B1/... and we'd multiply them by 1/9

"Hap Weighted" normalization: We multiply the result of each comparison between alleles by the proportion of haplotypes for each comparison.

Ah, that makes sense. And, the notebook is helpful (and, a great start for test cases!).

And, okay. My feeling is that "hap weighted" or maybe "allele freq weighted" makes much more sense than "total", since "total" can be strongly affected by rare alleles (eg sequencing error). And in this case, the weighting is just a function of the frequencies, so can be computed within f instead of outside of it, which would remove the need for doing norm_f separately. But, normalization isn't adding a lot to the complexity of the code, so we can discuss this later. (Maybe you all have a use case where we need 'total' normalization?)

Want me to open an issue?

@lkirk
Copy link
Contributor Author

lkirk commented Aug 10, 2023

"Total" normalization: multiply the result of each comparison between alleles by the proportion of alleles.
For example, if we are comparing an A locus with 3 alleles and a B locus with 3 alleles, we'd compute each statistic for A1-B1/A1-B2/A1-B3/A2-B1/... and we'd multiply them by 1/9
"Hap Weighted" normalization: We multiply the result of each comparison between alleles by the proportion of haplotypes for each comparison.

Ah, that makes sense. And, the notebook is helpful (and, a great start for test cases!).

And, okay. My feeling is that "hap weighted" or maybe "allele freq weighted" makes much more sense than "total", since "total" can be strongly affected by rare alleles (eg sequencing error). And in this case, the weighting is just a function of the frequencies, so can be computed within f instead of outside of it, which would remove the need for doing norm_f separately. But, normalization isn't adding a lot to the complexity of the code, so we can discuss this later. (Maybe you all have a use case where we need 'total' normalization?)

Want me to open an issue?

Interesting, I hadn't considered the point about rare variants. I think the main draw to using "total" is the fact that it seems to be correct for D and other statistics (demo'd a bit at the end of that notebook). In any case, please feel free to open an issue, I agree things could be simplified, I think @apragsdale might have an opinion here as well. I'll get to writing other issues when I return.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, which I didn't get a chance to finish. More later!

c/tskit/trees.c Outdated
const tsk_treeseq_t *restrict ts = self->tree_sequence;

const tsk_size_t num_samples = tsk_treeseq_get_num_samples(ts);
const tsk_size_t num_tree_sites = ts->tree_sites_length[self->index];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are stored in the tree
(prob not much point in storing in a varible)

Suggested change
const tsk_size_t num_tree_sites = ts->tree_sites_length[self->index];
const tsk_size_t num_tree_sites = self->sites_length
const tsk_site_t *restrict tree_sites = self->sites;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I eliminated that variable entirely.

c/tskit/trees.c Outdated
}

static int
get_mutation_samples(const tsk_tree_t *self, const tsk_size_t *site_offsets,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor style point: if the first arg is self, then we'd usually give the function a "method name" like tsk_tree_get_mutation_samples, just so it's a bit easier to follow what "self" is. For stuff like this you might just want to rename self to tree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this will be moving to stats.h, the I guess it's better to think of it as a utility function that has tree as an arg, rather than a method of the Tree class. But it's just splitting hairs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be tough to use this function in isolation, I would consider it to be an internal function until decided otherwise. In that vain, I've changed this parameter to tree instead of self.

c/tskit/trees.c Outdated
tsk_size_t num_mutations, mut_samples_offset, mut_offset;
tsk_size_t s, m, n;

tsk_memset(&mut_samples, 0, sizeof(mut_samples));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is code in the middle of declarations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I overcorrected a bit, I moved them before any goto call, but after all declarations.

c/tskit/trees.c Outdated

// Get the number of mutations
num_mutations = 0;
for (s = 0; s < num_tree_sites; s++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to get a pointer, could do something like

for (s = 0; s < tree->num_sites; s++) {
    num_mutations += tree->sites[s].mutations_length;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@lkirk lkirk force-pushed the two-site-stats-c-api branch 2 times, most recently from 91c485c to aafb78f Compare August 29, 2023 18:22
@codecov
Copy link

codecov bot commented Aug 29, 2023

Codecov Report

Merging #2805 (b90dfa1) into main (d57ec83) will decrease coverage by 0.03%.
The diff coverage is 87.79%.

❗ Current head b90dfa1 differs from pull request most recent head 7ec1ff3. Consider uploading reports for the commit 7ec1ff3 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2805      +/-   ##
==========================================
- Coverage   89.79%   89.76%   -0.03%     
==========================================
  Files          30       30              
  Lines       29474    29892     +418     
  Branches     5737     5803      +66     
==========================================
+ Hits        26465    26832     +367     
- Misses       1726     1753      +27     
- Partials     1283     1307      +24     
Flag Coverage Δ
c-tests 86.15% <87.79%> (+0.05%) ⬆️
lwt-tests 80.78% <ø> (ø)
python-c-tests 67.88% <ø> (ø)
python-tests 99.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
c/tskit/core.h 100.00% <ø> (ø)
c/tskit/trees.c 90.58% <87.36%> (-0.33%) ⬇️
c/tskit/core.c 95.58% <92.10%> (-0.21%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d57ec83...7ec1ff3. Read the comment docs.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think the implementation is still overly complicated, but using the in-built mutation tracking in the tree code should simplify quite a bit.

I think we'll want to evolve the implementation a bit, but I've created some issues to track. We're good to merge after this round of comments, IMO.

c/tskit/core.c Show resolved Hide resolved
c/tskit/trees.c Outdated
get_mutation_samples(const tsk_tree_t *tree, const tsk_size_t *site_offsets,
tsk_bit_array_t *allele_samples, tsk_size_t **num_alleles)
{
int ret = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, but not sure this whitespace within the declarations section helps understanding? Usually all declarations are done in one chunk, which makes it easier to see where the code begins (and you start thinking about memory safety and the potential state of variables after error-gotos)

Copy link
Contributor Author

@lkirk lkirk Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I have a habit of breaking things up with whitespace, but sometimes I can go overboard. I've cleaned things up.

c/tskit/trees.c Outdated
// We currently perform one preorder traversal per mutation, but there
// might be more clever ways to do this. The preorder traversals begin at
// the focal mutation's node.
for (m = 0; m < num_mutations; m++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why you need to use the arrays here: why not use the site and mutation objects within the tree struct?

for (s = 0; s < tree->sites_length; s++) {
     site = tree->sites[s];
     for (m = 0; m < site.mutations_length; m++) {
           node = site.mutations[m].node;
           // do traversal rooted at node. 
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment about offsets, I've cleaned things up, using the suggested loop.

c/tskit/trees.c Outdated
int ret = 0;
tsk_tree_t tree;

ret = tsk_tree_init(&tree, self, TSK_TS_INIT_BUILD_INDEXES | TSK_NO_SAMPLE_COUNTS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TSK_TS_INIT_BUILD_INDEXES isn't a valid option for tree_init.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the get_mutation_samples function is more complicated than it needs to be, and it would be clearer to merge it into this function (as it does very little at the moment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a refactor of all of this as you've suggested. It seems much simpler. I think with some of the suggestions about the bit array refactor, this code will be simplified even more.

c/tskit/trees.c Outdated
const tsk_size_t num_samples = self->num_samples;
const tsk_size_t max_alleles = self->tables->mutations.num_rows + num_sites;
tsk_size_t *restrict site_offsets = tsk_malloc(num_sites * sizeof(*site_offsets));
tsk_size_t *restrict num_alleles = tsk_malloc(num_sites * sizeof(*num_alleles));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no real point in making this a restrict pointer in this function, we're not doing any tight loops on it, and just passing it on to other functions. Declaring as a regular pointer means we can call tsk_safe_free and reduce the code a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

c/tskit/trees.c Outdated
// Initialize the allele_samples with all samples in the ancestral allele
tsk_bit_array_get_row(&allele_samples, num_alleles_cumsum, &allele_samples_row);
tsk_bit_array_add(&allele_samples_row, &all_samples_bits);
// Store the allele offset for each site
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do this stuff with site_offsets, if we're iterating over the mutations using the inbuilt structs from the tree (as suggested above)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These offsets were put in place to reduce the amount of times I'd have to compute the offsets and to make sure all of the code is in sync. Now that things have been simplified and I'm not using offsets in 3 places, I think it's best to get rid of the malloc'd array of offsets. We still need offsets to index into the array of allele_samples, but now it's loop counters instead of a whole malloc'd array.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - let's squash n merge 🎉

(Can you squash down to one commit please @lkirk?)

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need one more update to get test coverage @lkirk - if we merge now these cases will be forgotten and end up never being covered.

Most will be covered by provoking error conditions. The test_simplest_divergence_matrix function can be a good pattern to follow there for provoking errors.

Let's just remove the code for windows for now. I think the C code, certainly, should have a fixed set of sites provided as input which will simplify things. But let's leave that till next update.

&& tsk_memcmp(
mutation.derived_state, alleles[allele], allele_lengths[allele])
== 0) {
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be covering this, really. Do we need an example with some extra mutations or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test with a couple of back-mutations, that covers it

all_samples->data[all_samples->size - 1]
= remainder_samples ? ~(all << remainder_samples) : all;
for (i = 0; i < all_samples->size - 1; i++) {
all_samples->data[i] = all;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely be covering this - need a sample size > 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test with 35 samples, which is >32 samples, which covers it

if (tsk_bit_array_contains(
&bits_row, (tsk_bit_array_value_t) sample_index)) {
ret = TSK_ERR_DUPLICATE_SAMPLE;
goto out;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be covering this

c/tskit/trees.c Outdated

ret = tsk_treeseq_check_sample_sets(
self, num_sample_sets, sample_set_sizes, sample_sets);
if (ret != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this - pass in some bad samples

stat_site = true;
}
/* It's an error to specify more than one mode */
if (stat_site + stat_branch > 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these input errors should be covered

if (result_dim < 1) {
ret = TSK_ERR_BAD_RESULT_DIMS;
goto out;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're not sure about how the windows thing is going to work I vote we just assert(left_windows == NULL && right_windows == NULL); for now, and delete the uncovered code.

c/tskit/trees.c Outdated
result_size, result);
} else {
// TODO: branch statistics
ret = TSK_ERR_GENERIC;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be BAD_STAT_MODE, and should be covered

@jeromekelleher
Copy link
Member

LGTM, happy to merge when you're satisfied with the code coverage.

@lkirk
Copy link
Contributor Author

lkirk commented Sep 4, 2023

I think I've gotten all of the interesting test cases. It's annoying I'm only able to partially hit this case and the one above it, but I think I'm satisfied with the outcome. Unfortunately, I'm slightly (2%) under the threshold for passing the coverage check. I think that's mostly to do with the number of out of memory errors that this code can throw. Let me know if you disagree.

@lkirk
Copy link
Contributor Author

lkirk commented Sep 4, 2023

Re-squashed and ready

@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Sep 4, 2023
@jeromekelleher
Copy link
Member

@Mergifyio rebase

We also fully implement two-site statistics, mirroring the functionality
of the LdCalculator.

There are a number of tasks that need to be done before this is
production ready, but it is a start. We've opened a number of issues in
github for improving the work that was done for this PR and for
implementing the python API.

The general approach for the two-site statistics is to gather all of the
samples below each mutation over the whole tree sequence (we will be
implementing windows in the near future). To do this, we iterate over
each tree in the tree sequence once and perform an in-order traversal
starting at each mutation's node, storing the samples we encounter along
the way. We then perform another step that examines the parentage of
each mutation and removes samples that exist in child mutations. With
the samples for each allele, we perform intersections to get haplotype
weights, which are then passed to summary functions that fill out a
stats matrix.

We also implement a few happy-path tests that test the functionality of
sample sets and multi-allelic sites. We also cover all of the non
memory-related error cases in the C api. We plan to do more in-depth
testing once the python API is implemented.

This PR also introduces the concept of a "bit array" in the form of
tsk_bit_array_t, which needs some further refinement. In essence, this
data structure stores all of the possible samples in a bit set where one
sample takes up one bit of information and storage space. This enables a
few functions that are critical to this code: 1) it's a relatively
memory efficient method of storing sample identities 2) we can
efficiently intersect the arrays (I believe much of this is
autovectorized by gcc) 3) we can use tricks like SWAR (SIMD within an
array) to count the number of samples in a set. It's important to note
that this is also where the algorithm spends ~50% of it's time. It can
use more improvement in both the API design and the performance
characteristics.
@mergify
Copy link
Contributor

mergify bot commented Sep 4, 2023

rebase

✅ Branch has been successfully rebased

@jeromekelleher
Copy link
Member

I'm not sure why mergify doesn't like this, just going to merge manualy.

@jeromekelleher jeromekelleher merged commit d7acf1d into tskit-dev:main Sep 4, 2023
19 of 21 checks passed
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants