Two Locus Statistics C API #2805

lkirk · 2023-07-18T21:08:54Z

Description

As discussed with @jeromekelleher, I've opened up this PR to add the code necessary to create a C API for two locus statistics. This PR implements a general framework for multi-allelic site statistics and summary functions for r, r2, D, D2, D', Dz, pi2 to get things started. I've described this work in a design document contained in my prototype repository. This PR diverges in the main entrypoint (I discussed with @jeromekelleher, he thinks that this should have a separate entrypoint from tsk_treeseq_general_stat), but everything else in the document is accurate.

I've implemented tests that cover the happy paths. They ensure the correctness of multiple sample sets and the summary functions. I haven't done a deep dive into the test coverage, but I know that there is at least one important branch that is not covered, I'll have to look into that a bit more closely to understand why it's not being hit.

One thing left to implement is windows. Our current thinking for site statistics is that we'd like to have one window on the left and one window on the right so that we can cut any slice out of the correlation matrix. This leaves a few questions in terms of API design that we can get to in another PR. I think this one is large enough as is.

PR Checklist:

Tests that fully cover new/changed functionality.
Documentation including tutorial content if appropriate. <- I'll probably cover this when I implement the python API.
Changelogs, if there are API changes. <- I'll need a bit of help with this

lkirk · 2023-07-19T14:32:01Z

c/tests/test_stats.c

+
+    tsk_treeseq_free(&ts);
+}
+


Here's the tree for the correlated test

lkirk · 2023-07-19T14:32:23Z

c/tests/test_stats.c

+
+    tsk_treeseq_free(&ts);
+}
+


Here's the tree for the uncorrelated test

lkirk · 2023-07-19T14:41:04Z

c/tskit/core.c

-    offset = (int64_t)(array[lower] < value);
-    return (tsk_size_t)(lower + offset);
+    offset = (int64_t) (array[lower] < value);
+    return (tsk_size_t) (lower + offset);


I'm not sure why clang format is adding this space, but I'm fairly certain that my editor is picking up the clang-format config.

c/tskit/trees.c

petrelharp · 2023-07-19T21:21:59Z

For easy reference on github: see #432 and #1900.

lkirk · 2023-07-21T17:34:29Z

c/tskit/trees.c

+    double norm[state_dim];
+    double *hap_weight_row;
+    tsk_size_t row_len = num_alleles[site_b] * state_dim;
+    // TODO: is this stack allocation dangerous??


Beyond using stack memory like this, it seems as though windows doesn't like my use of VLAs. I'm not sure if this is an inherent issue with MSVC or an issue with the c standard definition in the build. I see ignoring unknown option '-std=c99' in the build logs here: https://github.com/tskit-dev/tskit/actions/runs/5592422530/job/15156184424?pr=2805#step:12:35
It does seem as though MVSC is compliant up until C17, so I'd expect there to be support for VLAs, but maybe I'm not understanding the issue.

AFAIK, MVSC supports thru at least C11 but without VLAS. Some Googling seems to agree, at least as of 2021: https://www.reddit.com/r/C_Programming/comments/kzskyi/why_cant_use_a_variable_to_declare_an_arrays_size/

Another thing to consider: VLAs are legal, but rarely used in C. They are not valid C++ code, and tskit may be compiled with a C++ compiler. Doing so is probably rare in practice, but we don't want to prevent it.

Ah, good to know. Thanks @molpopgen I think I'll have to rework this a bit to remove the reliance on VLAs. Since it's in a pretty hot loop, I was thinking about pre-allocating some memory and passing it in.

I haven't looked in detail at your changes, but we can certainly start simple and run perf once tests are passing, etc., to see where improvements are likely.

My 10c is that allocing anything on the stack that isn't a fixed, small size just isn't worth the hassle. It's not portable, unpredictable when it fails and doesn't give much/any perf advantages.

petrelharp · 2023-07-22T19:34:50Z

c/tskit/trees.c

+    sample_count_stat_params_t args = *(sample_count_stat_params_t *) params;
+    double n;
+    const double *state_row;
+    for (tsk_size_t j = 0; j < state_dim; j++) {


Suggested change

for (tsk_size_t j = 0; j < state_dim; j++) {

tsk_bug_assert(result_dim == state_dim);

for (tsk_size_t j = 0; j < state_dim; j++) {

Should all of the summary functions be making this assertion? I was a little confused about result vs state dims when looking over the codebase. My understanding is that these might not be the same if we're computing a branch vs site statistic.

Hm, well, the result and state dims aren't different between branch and site stats. And, maybe we don't want to put in this assert for performance reasons here, but should at least comment that we're making that assumption? I haven't gone to check other summary functions.

petrelharp · 2023-07-22T19:46:07Z

c/tskit/trees.c

+    for (tsk_size_t t = 0; t < self->num_trees; t++) {
+        for (tsk_size_t s = 0; s < self->tree_sites_length[t]; s++) {


This could just loop over the sites - the sites in the table are in the same order as encountered when iterating over (trees; sites within trees).

Oh, this is good to know, I will make this change as well

See https://tskit.dev/tskit/docs/latest/data-model.html#valid-tree-sequence-requirements (and https://tskit.dev/tskit/docs/latest/data-model.html#table-definitions ) for ordering requirements on the tables.

petrelharp · 2023-07-22T19:47:22Z

c/tskit/trees.c

+    }
+
+    // Number of pairs w/ replacement (sites)
+    *result_size = (num_sites * (1 + num_sites)) >> (tsk_size_t) 1;


Is there a good reason to divide by two by bitshifting? It's more readable IMO to write / 2.

I didn't realize until recently that you could do truncated division with unsigned integers. It seems that this works with c99 on clang and gcc, I can change this to division so it's more readable. Here's a simple playground session showing my findings.

+1 on this - any compiler worth it's salt will replace with a bitshift if there's a literal integer_value / 2 in there.

petrelharp · 2023-07-22T19:49:52Z

c/tskit/trees.c

+            if (ret != 0) {
+                goto out;
+            }
+            result_offset += state_dim;


This should be result_dim, no? (as written, state_dim always equals result_dim, but doing so doesn't make things simpler, I don't think?)

Yes, you're right. I will change this.

c/tskit/trees.c

petrelharp · 2023-07-22T20:22:59Z

I'm working through this. You've done an impressive amount! There are some very interesting things, e.g. the encoding of descendant samples as bitstrings.

IIUC (and I might not), the weights are always effectively 0/1, or at least it's only using whether a particular weight is 0 or nonzero, right? This makes it a bit misleading to have the weights being double, as we don't use the values, and means that these aren't "general stats" but rather "sample set stats". Have I got this right? If so and if the generic function is exposed, it should probably be tsk_treeseq_two_locus_sample_set_stat not general_stat.
It would help me to have the precise definitions of these seven statistics written down somewhere (indeed, this could be a draft for the docstrings). In particular,
Perhaps we're being too clever in only keeping track of AB, Ab, and aB? If we also kept track of ab and then computed n = AB + Ab + aB + ab for each site instead of using the total sample size passed in through args, then we would be able to handle missing data, something that we'd like to be able to do.
I see you're filling out the entire (num mutations) x (num samples) array first then using it to compute each pairwise stat. Thanks to the clever binary representation, this isn't terrible, memory-wise. I suppose you thought about taking the other approach of iterating over pairs of trees - could you provide some background on your thinking on that? This is a speed-memory tradeoff; how much memory would it take to do this for a reasonably large example? How about in very large examples where we don't want the entire LD matrix, but only, say, the diagonal?
Nice job turning up the references on how to compute multiallelic LD. However, do we really need a special normalization, which is adding some complexity to the code? For instance, couldn't r2 be computed with no normalization if we just modify r2_summary_func to multiply the result by pAB? I am probably missing something, though: in particular, the "total" normalization scheme doesn't fit into this. However, I'm not convinced that "total" normalization (i.e., averaging over all haplotypes) is the way to go - has anyone suggested doing things this way? What's the rationale here?

again, nice work!

lkirk · 2023-07-24T20:22:03Z

@petrelharp thank you for your review, I appreciate all of the comments. I'll address one at a time below:

IIUC (and I might not), the weights are always effectively 0/1, or at least it's only using whether a particular weight is 0 or nonzero, right? This makes it a bit misleading to have the weights being double, as we don't use the values, and means that these aren't "general stats" but rather "sample set stats". Have I got this right? If so and if the generic function is exposed, it should probably be tsk_treeseq_two_locus_sample_set_stat not general_stat.

You're right, weights are always 1 or 0. I agree, this is more of a sample set stat, I didn't really think about the distinction until you brought this up, I can rename to tsk_treeseq_two_locus_sample_set_stat. I designed the API to be general enough to accept weights if we wanted to specify them in some other statistic. That said, we'd need to alter the implementation in compute_general_two_site_stat_result to properly utilize the weights. So, as it stands right now, the implementation for weights is half done and misleading in terms of data types. I'm happy to finish the job of implementing weights in a way that they'd be useful later on (especially since compute_general_two_site_stat_result needs some reworking anyways -- per my discussion with molpopgen). I'm also happy to simplify the API and remove the weights as doubles.

It would help me to have the precise definitions of these seven statistics written down somewhere (indeed, this could be a draft for the docstrings). In particular,

Page 6 of my design document defines these statistics. Is there another way I should be defining these to be more clear?

Perhaps we're being too clever in only keeping track of AB, Ab, and aB? If we also kept track of ab and then computed n = AB + Ab + aB + ab for each site instead of using the total sample size passed in through args, then we would be able to handle missing data, something that we'd like to be able to do.

I see, would this be a scenario where a sample was deleted from a single tree? It's good to hear about these types of edge cases, I didn't think it was possible to do something like that.

I see you're filling out the entire (num mutations) x (num samples) array first then using it to compute each pairwise stat. Thanks to the clever binary representation, this isn't terrible, memory-wise. I suppose you thought about taking the other approach of iterating over pairs of trees - could you provide some background on your thinking on that? This is a speed-memory tradeoff; how much memory would it take to do this for a reasonably large example? How about in very large examples where we don't want the entire LD matrix, but only, say, the diagonal?

Yes, I'm gathering all samples with a specific allele state before computing the statistics. I've organized things this way so that we only have to do one pass over the tree sequence before computing statistics. This also means one tree traversal per tree in the tree sequence, so the computational complexity should be roughly the same as single-site statistics (O(n + p (log n)^2)). I ran this code on the entire chromosome 1 for the SGDP dataset and it the alleles X samples matrix took ~175MB of memory. That's a dataset with 554 samples and 1,198,947 mutations. In total, ~750MB of memory was used before proceeding to the computation of the stats matrix. In contrast, the amount of data needed to store the upper triangle of the statistics for that dataset is about 5.7TB (1198947 * (1198947 + 1) / 2 * 8 / 1e12) -- This number has me second guessing my math, but I am referring to 718 billion pairwise statistics, each represented by a double.

That said, I think that once windows are implemented, I'd only compute the matrix for the data specified in the window, which would drop the size even more.

Sorry if that was long winded. In other words, if we want to know how much memory the allele matrix occupies, we can do the following calculation (RoundUp is a mathematica function):

(RoundUp[num_samples / 32]) * (num_mutations + num_sites)  / 1e6 * 4

I'm still not sure what we consider very large, but in my opinion, the size of this intermediate data pales in comparison to the size of the stats output.

Nice job turning up the references on how to compute multiallelic LD. However, do we really need a special normalization, which is adding some complexity to the code? For instance, couldn't r2 be computed with no normalization if we just modify r2_summary_func to multiply the result by pAB? I am probably missing something, though: in particular, the "total" normalization scheme doesn't fit into this. However, I'm not convinced that "total" normalization (i.e., averaging over all haplotypes) is the way to go - has anyone suggested doing things this way? What's the rationale here?

We tried a variety of different norming methods for obtaining reasonable results from the various two-locus statistics. One indicator of correctness for using "total" for non-ratio statistics is that the sum of non-polarized D adds up to 0. If we norm by the haplotype frequency, and compute non-polarized D, we end up with a non-zero value. In this manner, we went through and validated our norming methods with the expected extremes of each two-locus statistic. I don't think we found any papers suggesting to normalize things this way, but it's the only method that produces results that appear correct for our handful of test cases. I can provide a more detailed notebook that outlines these results a bit better if that would help.

lkirk

I will batch all of the suggested changes in the next few days. Thanks again for all of the feedback.

c/tskit/trees.c

lkirk · 2023-07-24T20:26:30Z

c/tskit/trees.c

+    for (tsk_size_t t = 0; t < self->num_trees; t++) {
+        for (tsk_size_t s = 0; s < self->tree_sites_length[t]; s++) {


Oh, this is good to know, I will make this change as well

lkirk · 2023-07-24T20:27:18Z

c/tskit/trees.c

+            if (ret != 0) {
+                goto out;
+            }
+            result_offset += state_dim;


Yes, you're right. I will change this.

lkirk · 2023-07-24T20:28:59Z

c/tskit/trees.c

+    sample_count_stat_params_t args = *(sample_count_stat_params_t *) params;
+    double n;
+    const double *state_row;
+    for (tsk_size_t j = 0; j < state_dim; j++) {


Should all of the summary functions be making this assertion? I was a little confused about result vs state dims when looking over the codebase. My understanding is that these might not be the same if we're computing a branch vs site statistic.

petrelharp · 2023-08-01T15:28:27Z

Excellent, thanks! Let's see:

I'm happy to finish the job of implementing weights in a way that they'd be useful later on (especially since compute_general_two_site_stat_result needs some reworking anyways -- per my discussion with molpopgen). I'm also happy to simplify the API and remove the weights as doubles.

Well, do we have a use case that uses non-0/1 weights? If not, then we should not do the general case.

It would help me to have the precise definitions of these seven statistics written down somewhere (indeed, this could be a draft for the docstrings).

Page 6 of my design document defines these statistics. Is there another way I should be defining these to be more clear?

Right, I saw that, and that is helpful, but it still leaves questions - it's not precise enough for someone to compute exactly the same thing from scratch (especially, someone who doesn't have an idea beforehand of what is supposed to be computed). For instance, how does one deal with multiallelic sites, especially for the statistics with ratios (eg. r)? For an example, see the docstring for Y3: it says "The average density of sites at which a differs from b and c, per unit of chromosome length.". This is (or tries to be :) clear enough that it's easy and obvious how to compute the statistic in a naive way. (The definitions for your stats won't be so concise, but oh well.)

I see, would this be a scenario where a sample was deleted from a single tree?

This is how we represent missing data: https://tskit.dev/tskit/docs/latest/data-model.html#sec-data-model-missing-data i.e., a sample that wasn't genotyped for a stretch of chromosome (we haven't dealt with this in the usual stats API yet, but would like to).

Sorry if that was long winded. In other words, if we want to know how much memory the allele matrix occupies, we can do the following calculation (RoundUp is a mathematica function):

(RoundUp[num_samples / 32]) * (num_mutations + num_sites) / 1e6 * 4

Oh, not long-winded at all, detail is good. =) So, I guess the downside here is that computing the LD matrix for a fixed number of sites scales (in memory) with the number of samples. Hm, so, the internal state is comparable to the (full) output when the number of samples is comparable to the number of sites (in a window). Good to know.

To be clear, I think the approach here is very clever, and I don't think you should change it; we just want to be clear about the implications. It wouldn't scale to Biobank-scale data (~1M individuals), but that's okay; it's probably faster for smaller datasets than other approaches, and moving forward we could switch off to another algorithm for big computations.

This does make me think about something else we might want in the API, though: in practice, won't people want to specify a specific set of SNPs to compute this with? For instance, just removing the singletons would reduce the size of the output substantially. Perhaps we want to pass in a "site mask" that says which sites to use?

... normalization... I can provide a more detailed notebook that outlines these results a bit better if that would help.

That would help - I am still fuzzy on exactly how the normalization work (at least, I need the definition?).

jeromekelleher

This is great, well done @lkirk!

I've made a light review pass and noted a few implementation details around the use of bitsets and memory allocations that I think will need to be changed. I've not got deep into all the details, I'll try and get my head around it a bit better in the next pass.

c/tskit/core.c

c/tskit/core.h

jeromekelleher · 2023-08-03T09:00:41Z

c/tskit/trees.c

@@ -489,7 +489,7 @@ tsk_treeseq_init(

    if (tsk_treeseq_get_time_units_length(self) == strlen(TSK_TIME_UNITS_UNCALIBRATED)
        && !strncmp(tsk_treeseq_get_time_units(self), TSK_TIME_UNITS_UNCALIBRATED,
-               strlen(TSK_TIME_UNITS_UNCALIBRATED))) {


Looks like you've got the wrong version of clang-format here. See here

(There's a long paper trail of us complaining that clang-format has such terrible compatibility 🤮 )

I saw a thread here that discusses upgrading. Are we considering an upgrade? If not, I can install version 6 and reformat.

Simplest to install 6 I think, we're not going to change this in the short term. Anyway, whatever version of clang-format we'd settle on would probably be slightly different to the version you're using and you'd need to change anyway.

c/tskit/trees.c

jeromekelleher · 2023-08-03T09:05:15Z

c/tskit/trees.c

+    const tsk_bit_array_t *A_samples, *B_samples;
+    tsk_size_t w_A = 0, w_B = 0, w_AB = 0;
+    tsk_bit_array_t *ss_row; // ss_ prefix refers to a sample set
+    tsk_bit_array_t ss_A_samples[num_sample_chunks], ss_B_samples[num_sample_chunks],


I don't think we can allocate on the stack like this, we'll eventually overflow for large sample sizes, and it's very hard to tell in advance when it's going to happen. It's just not worth the trouble.

(You might notice that we do a lot of allocing from the stack in the test suite - but we can get away with a lot of stuff there if it makes the code easier to write and read because the inputs are fixed)

In my refactoring of tsk_bit_array_t, I got rid of all of the VLAs here. We should be clear of the stack allocated VLAs now.

jeromekelleher · 2023-08-03T09:07:39Z

c/tskit/trees.c

+    double norm[state_dim];
+    double *hap_weight_row;
+    tsk_size_t row_len = num_alleles[site_b] * state_dim;
+    // TODO: is this stack allocation dangerous??


My 10c is that allocing anything on the stack that isn't a fixed, small size just isn't worth the hassle. It's not portable, unpredictable when it fails and doesn't give much/any perf advantages.

jeromekelleher · 2023-08-03T09:09:31Z

c/tskit/trees.c

+    }
+
+    // Number of pairs w/ replacement (sites)
+    *result_size = (num_sites * (1 + num_sites)) >> (tsk_size_t) 1;


+1 on this - any compiler worth it's salt will replace with a bitshift if there's a literal integer_value / 2 in there.

lkirk

I've refactored tsk_bit_array_t, taken care of the VLAs, and have addressed most of the comments, which I will mark as done to clean things up.

lkirk · 2023-08-07T19:34:15Z

c/tskit/trees.c

@@ -489,7 +489,7 @@ tsk_treeseq_init(

    if (tsk_treeseq_get_time_units_length(self) == strlen(TSK_TIME_UNITS_UNCALIBRATED)
        && !strncmp(tsk_treeseq_get_time_units(self), TSK_TIME_UNITS_UNCALIBRATED,
-               strlen(TSK_TIME_UNITS_UNCALIBRATED))) {


I saw a thread here that discusses upgrading. Are we considering an upgrade? If not, I can install version 6 and reformat.

c/tskit/trees.c

c/tskit/core.c

c/tskit/core.h

lkirk · 2023-08-08T03:39:06Z

c/tskit/trees.c

+    const tsk_bit_array_t *A_samples, *B_samples;
+    tsk_size_t w_A = 0, w_B = 0, w_AB = 0;
+    tsk_bit_array_t *ss_row; // ss_ prefix refers to a sample set
+    tsk_bit_array_t ss_A_samples[num_sample_chunks], ss_B_samples[num_sample_chunks],


In my refactoring of tsk_bit_array_t, I got rid of all of the VLAs here. We should be clear of the stack allocated VLAs now.

jeromekelleher

A few minor comments on memory safety. Haven't had a chance to go through fully.

c/tskit/core.c

c/tskit/trees.c

jeromekelleher · 2023-08-08T09:57:25Z

c/tskit/trees.c

+    for (mutation_index = 0; mutation_index < site->mutations_length; mutation_index++) {
+        mutation = site->mutations[mutation_index];
+        /* Compute the allele index for this derived state value. */
+        allele = 0;


For-loop a bit more concise?

for (allele = 0; allele < num_alleles; allele++) { if (mutation.derived_state_length == allele_lengths[allele]... break; } }

Indeed it would be. I copied this from get_allele_weights FWIW, could fix this in that function too.

c/tskit/trees.c

jeromekelleher

See point about using tsk_safe_free - safety first when it comes to handling memory!

c/tskit/core.c

c/tskit/trees.c

lkirk · 2023-08-10T07:45:52Z

Sorry I'm a bit late in responding to your comments @petrelharp.

Well, do we have a use case that uses non-0/1 weights? If not, then we should not do the general case.

I do remember discussing something with @apragsdale, but I don't think we'd discussed anything concrete. Let's discuss when we meet, but I'm leaning towards simplifying things to counts-only for now, then expanding once we have a clear need.

Right, I saw that, and that is helpful, but it still leaves questions - it's not precise enough for someone to compute exactly the same thing from scratch (especially, someone who doesn't have an idea beforehand of what is supposed to be computed).

I've created a small demo notebook that might help illustrate how exactly we're doing the normalization. It can be found here. In addition, please see my explanation at the bottom. Hopefully this helps with understanding.

This is how we represent missing data: https://tskit.dev/tskit/docs/latest/data-model.html#sec-data-model-missing-data i.e., a sample that wasn't genotyped for a stretch of chromosome (we haven't dealt with this in the usual stats API yet, but would like to).

The solution you've proposed seems straightforward, I'm not sure how it would tie into normalization, I'd have to do some testing to see what makes sense.

Oh, not long-winded at all, detail is good. =) So, I guess the downside here is that computing the LD matrix for a fixed number of sites scales (in memory) with the number of samples.

Hm, so, the internal state is comparable to the (full) output when the number of samples is comparable to the number of sites (in a window). Good to know.

To be clear, I think the approach here is very clever, and I don't think you should change it; we just want to be clear about the implications. It wouldn't scale to Biobank-scale data (~1M individuals), but that's okay; it's probably faster for smaller datasets than other approaches, and moving forward we could switch off to another algorithm for big computations.

I'm not exactly sure what you mean by the "full" output here. If we have a set with 1M samples/1M sites/1M mutations (meaning we're dealing w/ biallelic data)... We expect to use 250GB to store the states, but we'd still need 4TB to store the actual statistics. In any case, I think a potential solution to scaling this is in the windowing/masking. If you could compute these stats in chunks, parallelized on a cluster, that'd probably be somewhat ideal. Especially given that windowed computation wouldn't see much overhead because you could limit many of the operations to the bounds of the window/mask. An iterative algo could also work but would be much slower (I think).

This does make me think about something else we might want in the API, though: in practice, won't people want to specify a specific set of SNPs to compute this with? For instance, just removing the singletons would reduce the size of the output substantially. Perhaps we want to pass in a "site mask" that says which sites to use?

I think that's what I was getting at with the windows part of the design (hinted in some of the code, but not actually implemented). However, the one issue with providing a window is that we'd need to return the sites back to the user once we've intersected their coordinates with the window coordinates. Otherwise the user wouldn't know which sites are making up the results matrix. (And I think a single window makes sense here, ideally we can chop out an NxM matrix from the entire results matrix, reducing the memory/computational requirements to a manageable size).

So, what you're saying about a site mask sounds more like what we want to do. By using a sites mask, the user knows what sites are going into the matrix. The problem comes down to the api, how much work do we reasonably want to put onto the user when it comes to intersecting/storing/manipulating sites. And, how do we create a sites mask that is reasonably efficient in terms of memory? (I suppose we could use bit masks to store sites in a mask as well....).

That would help - I am still fuzzy on exactly how the normalization work (at least, I need the definition?).

The notebook is linked above, but I'll also explain with actual words:

"Total" normalization: multiply the result of each comparison between alleles by the proportion of alleles.

For example, if we are comparing an A locus with 3 alleles and a B locus with 3 alleles, we'd compute each statistic for A1-B1/A1-B2/A1-B3/A2-B1/... and we'd multiply them by 1/9

"Hap Weighted" normalization: We multiply the result of each comparison between alleles by the proportion of haplotypes for each comparison.

Using the same A/B locus as above, let's say that we are comparing 10 samples. If A1-B1 have no counts, we will multiply the stat result by 0. If A1-B2 have 5 counts, then we'll multiply the stat result by 1/2 (5/10).

lkirk · 2023-08-10T15:40:41Z

We met over zoom and decided that the best course of action is to wrap this up (it's getting a bit messy). In order to wrap things up, we're prioritizing the steps necessary for merge. I'll list out what I'll be doing when I get back from vacation:

Rebase with main so that the circleci:build-32 test passes (see guide in stdpopsim)
Finish up test coverage, covering bit arrays (I think the line that hits the nonbinary samples case will be eliminated when I adopt api functions).
Use api functions for iteration + built in preorder
Address memory safety issues
Final bit array counting algo (the portable one -- it offers significant performance gains)
Capture the remaining discussion points/concerns in github issues.

petrelharp · 2023-08-10T16:43:01Z

"Total" normalization: multiply the result of each comparison between alleles by the proportion of alleles.

For example, if we are comparing an A locus with 3 alleles and a B locus with 3 alleles, we'd compute each statistic for A1-B1/A1-B2/A1-B3/A2-B1/... and we'd multiply them by 1/9

"Hap Weighted" normalization: We multiply the result of each comparison between alleles by the proportion of haplotypes for each comparison.

Ah, that makes sense. And, the notebook is helpful (and, a great start for test cases!).

And, okay. My feeling is that "hap weighted" or maybe "allele freq weighted" makes much more sense than "total", since "total" can be strongly affected by rare alleles (eg sequencing error). And in this case, the weighting is just a function of the frequencies, so can be computed within f instead of outside of it, which would remove the need for doing norm_f separately. But, normalization isn't adding a lot to the complexity of the code, so we can discuss this later. (Maybe you all have a use case where we need 'total' normalization?)

Want me to open an issue?

lkirk · 2023-08-10T17:18:16Z

"Total" normalization: multiply the result of each comparison between alleles by the proportion of alleles.
For example, if we are comparing an A locus with 3 alleles and a B locus with 3 alleles, we'd compute each statistic for A1-B1/A1-B2/A1-B3/A2-B1/... and we'd multiply them by 1/9
"Hap Weighted" normalization: We multiply the result of each comparison between alleles by the proportion of haplotypes for each comparison.

Ah, that makes sense. And, the notebook is helpful (and, a great start for test cases!).

And, okay. My feeling is that "hap weighted" or maybe "allele freq weighted" makes much more sense than "total", since "total" can be strongly affected by rare alleles (eg sequencing error). And in this case, the weighting is just a function of the frequencies, so can be computed within f instead of outside of it, which would remove the need for doing norm_f separately. But, normalization isn't adding a lot to the complexity of the code, so we can discuss this later. (Maybe you all have a use case where we need 'total' normalization?)

Want me to open an issue?

Interesting, I hadn't considered the point about rare variants. I think the main draw to using "total" is the fact that it seems to be correct for D and other statistics (demo'd a bit at the end of that notebook). In any case, please feel free to open an issue, I agree things could be simplified, I think @apragsdale might have an opinion here as well. I'll get to writing other issues when I return.

jeromekelleher

Some minor comments, which I didn't get a chance to finish. More later!

jeromekelleher · 2023-08-28T09:15:08Z

c/tskit/trees.c

+    const tsk_treeseq_t *restrict ts = self->tree_sequence;
+
+    const tsk_size_t num_samples = tsk_treeseq_get_num_samples(ts);
+    const tsk_size_t num_tree_sites = ts->tree_sites_length[self->index];


These are stored in the tree
(prob not much point in storing in a varible)

Suggested change

const tsk_size_t num_tree_sites = ts->tree_sites_length[self->index];

const tsk_size_t num_tree_sites = self->sites_length

const tsk_site_t *restrict tree_sites = self->sites;

Good point, I eliminated that variable entirely.

jeromekelleher · 2023-08-28T09:16:43Z

c/tskit/trees.c

+}
+
+static int
+get_mutation_samples(const tsk_tree_t *self, const tsk_size_t *site_offsets,


A minor style point: if the first arg is self, then we'd usually give the function a "method name" like tsk_tree_get_mutation_samples, just so it's a bit easier to follow what "self" is. For stuff like this you might just want to rename self to tree.

As this will be moving to stats.h, the I guess it's better to think of it as a utility function that has tree as an arg, rather than a method of the Tree class. But it's just splitting hairs.

I think it'd be tough to use this function in isolation, I would consider it to be an internal function until decided otherwise. In that vain, I've changed this parameter to tree instead of self.

jeromekelleher · 2023-08-28T09:18:31Z

c/tskit/trees.c

+    tsk_size_t num_mutations, mut_samples_offset, mut_offset;
+    tsk_size_t s, m, n;
+
+    tsk_memset(&mut_samples, 0, sizeof(mut_samples));


This is code in the middle of declarations.

Ah, I think I overcorrected a bit, I moved them before any goto call, but after all declarations.

jeromekelleher · 2023-08-28T09:20:53Z

c/tskit/trees.c

+
+    // Get the number of mutations
+    num_mutations = 0;
+    for (s = 0; s < num_tree_sites; s++) {


Don't need to get a pointer, could do something like

for (s = 0; s < tree->num_sites; s++) { num_mutations += tree->sites[s].mutations_length; }

codecov · 2023-08-29T18:27:48Z

Codecov Report

Merging #2805 (b90dfa1) into main (d57ec83) will decrease coverage by 0.03%.
The diff coverage is 87.79%.

❗ Current head b90dfa1 differs from pull request most recent head 7ec1ff3. Consider uploading reports for the commit 7ec1ff3 to get more accurate results

@@            Coverage Diff             @@
##             main    #2805      +/-   ##
==========================================
- Coverage   89.79%   89.76%   -0.03%     
==========================================
  Files          30       30              
  Lines       29474    29892     +418     
  Branches     5737     5803      +66     
==========================================
+ Hits        26465    26832     +367     
- Misses       1726     1753      +27     
- Partials     1283     1307      +24

Flag	Coverage Δ
c-tests	`86.15% <87.79%> (+0.05%)`	⬆️
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`67.88% <ø> (ø)`
python-tests	`99.00% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
c/tskit/core.h	`100.00% <ø> (ø)`
c/tskit/trees.c	`90.58% <87.36%> (-0.33%)`	⬇️
c/tskit/core.c	`95.58% <92.10%> (-0.21%)`	⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d57ec83...7ec1ff3. Read the comment docs.

jeromekelleher

LGTM. I think the implementation is still overly complicated, but using the in-built mutation tracking in the tree code should simplify quite a bit.

I think we'll want to evolve the implementation a bit, but I've created some issues to track. We're good to merge after this round of comments, IMO.

c/tskit/core.c

jeromekelleher · 2023-08-30T10:02:38Z

c/tskit/trees.c

+get_mutation_samples(const tsk_tree_t *tree, const tsk_size_t *site_offsets,
+    tsk_bit_array_t *allele_samples, tsk_size_t **num_alleles)
+{
+    int ret = 0;


Nitpick, but not sure this whitespace within the declarations section helps understanding? Usually all declarations are done in one chunk, which makes it easier to see where the code begins (and you start thinking about memory safety and the potential state of variables after error-gotos)

You're right, I have a habit of breaking things up with whitespace, but sometimes I can go overboard. I've cleaned things up.

jeromekelleher · 2023-08-30T10:10:51Z

c/tskit/trees.c

+    // We currently perform one preorder traversal per mutation, but there
+    // might be more clever ways to do this. The preorder traversals begin at
+    // the focal mutation's node.
+    for (m = 0; m < num_mutations; m++) {


I don't get why you need to use the arrays here: why not use the site and mutation objects within the tree struct?

for (s = 0; s < tree->sites_length; s++) { site = tree->sites[s]; for (m = 0; m < site.mutations_length; m++) { node = site.mutations[m].node; // do traversal rooted at node. } }

See my comment about offsets, I've cleaned things up, using the suggested loop.

jeromekelleher · 2023-08-30T10:16:07Z

c/tskit/trees.c

+    int ret = 0;
+    tsk_tree_t tree;
+
+    ret = tsk_tree_init(&tree, self, TSK_TS_INIT_BUILD_INDEXES | TSK_NO_SAMPLE_COUNTS);


TSK_TS_INIT_BUILD_INDEXES isn't a valid option for tree_init.

To me the get_mutation_samples function is more complicated than it needs to be, and it would be clearer to merge it into this function (as it does very little at the moment)

I did a refactor of all of this as you've suggested. It seems much simpler. I think with some of the suggestions about the bit array refactor, this code will be simplified even more.

jeromekelleher · 2023-08-30T10:21:05Z

c/tskit/trees.c

+    const tsk_size_t num_samples = self->num_samples;
+    const tsk_size_t max_alleles = self->tables->mutations.num_rows + num_sites;
+    tsk_size_t *restrict site_offsets = tsk_malloc(num_sites * sizeof(*site_offsets));
+    tsk_size_t *restrict num_alleles = tsk_malloc(num_sites * sizeof(*num_alleles));


There's no real point in making this a restrict pointer in this function, we're not doing any tight loops on it, and just passing it on to other functions. Declaring as a regular pointer means we can call tsk_safe_free and reduce the code a bit.

jeromekelleher · 2023-08-30T10:22:41Z

c/tskit/trees.c

+        // Initialize the allele_samples with all samples in the ancestral allele
+        tsk_bit_array_get_row(&allele_samples, num_alleles_cumsum, &allele_samples_row);
+        tsk_bit_array_add(&allele_samples_row, &all_samples_bits);
+        // Store the allele offset for each site


Do we need to do this stuff with site_offsets, if we're iterating over the mutations using the inbuilt structs from the tree (as suggested above)?

These offsets were put in place to reduce the amount of times I'd have to compute the offsets and to make sure all of the code is in sync. Now that things have been simplified and I'm not using offsets in 3 places, I think it's best to get rid of the malloc'd array of offsets. We still need offsets to index into the array of allele_samples, but now it's loop counters instead of a whole malloc'd array.

jeromekelleher

LGTM - let's squash n merge 🎉

(Can you squash down to one commit please @lkirk?)

jeromekelleher

I think we need one more update to get test coverage @lkirk - if we merge now these cases will be forgotten and end up never being covered.

Most will be covered by provoking error conditions. The test_simplest_divergence_matrix function can be a good pattern to follow there for provoking errors.

Let's just remove the code for windows for now. I think the C code, certainly, should have a fixed set of sites provided as input which will simplify things. But let's leave that till next update.

jeromekelleher · 2023-09-01T08:16:16Z

c/tskit/trees.c

+                && tsk_memcmp(
+                       mutation.derived_state, alleles[allele], allele_lengths[allele])
+                       == 0) {
+                break;


We should be covering this, really. Do we need an example with some extra mutations or something?

I added a test with a couple of back-mutations, that covers it

jeromekelleher · 2023-09-01T08:16:50Z

c/tskit/trees.c

+    all_samples->data[all_samples->size - 1]
+        = remainder_samples ? ~(all << remainder_samples) : all;
+    for (i = 0; i < all_samples->size - 1; i++) {
+        all_samples->data[i] = all;


We should definitely be covering this - need a sample size > 2?

I added a test with 35 samples, which is >32 samples, which covers it

jeromekelleher · 2023-09-01T08:21:09Z

c/tskit/trees.c

+            if (tsk_bit_array_contains(
+                    &bits_row, (tsk_bit_array_value_t) sample_index)) {
+                ret = TSK_ERR_DUPLICATE_SAMPLE;
+                goto out;


We should be covering this

jeromekelleher · 2023-09-01T08:21:54Z

c/tskit/trees.c

+
+    ret = tsk_treeseq_check_sample_sets(
+        self, num_sample_sets, sample_set_sizes, sample_sets);
+    if (ret != 0) {


And this - pass in some bad samples

jeromekelleher · 2023-09-01T08:22:15Z

c/tskit/trees.c

+        stat_site = true;
+    }
+    /* It's an error to specify more than one mode */
+    if (stat_site + stat_branch > 1) {


All these input errors should be covered

jeromekelleher · 2023-09-01T08:24:19Z

c/tskit/trees.c

+    if (result_dim < 1) {
+        ret = TSK_ERR_BAD_RESULT_DIMS;
+        goto out;
+    }


Since we're not sure about how the windows thing is going to work I vote we just assert(left_windows == NULL && right_windows == NULL); for now, and delete the uncovered code.

jeromekelleher · 2023-09-01T08:24:41Z

c/tskit/trees.c

+            result_size, result);
+    } else {
+        // TODO: branch statistics
+        ret = TSK_ERR_GENERIC;


should be BAD_STAT_MODE, and should be covered

jeromekelleher · 2023-09-04T08:29:09Z

LGTM, happy to merge when you're satisfied with the code coverage.

lkirk · 2023-09-04T08:43:03Z

I think I've gotten all of the interesting test cases. It's annoying I'm only able to partially hit this case and the one above it, but I think I'm satisfied with the outcome. Unfortunately, I'm slightly (2%) under the threshold for passing the coverage check. I think that's mostly to do with the number of out of memory errors that this code can throw. Let me know if you disagree.

lkirk · 2023-09-04T08:43:17Z

Re-squashed and ready

jeromekelleher · 2023-09-04T14:01:57Z

@Mergifyio rebase

We also fully implement two-site statistics, mirroring the functionality of the LdCalculator. There are a number of tasks that need to be done before this is production ready, but it is a start. We've opened a number of issues in github for improving the work that was done for this PR and for implementing the python API. The general approach for the two-site statistics is to gather all of the samples below each mutation over the whole tree sequence (we will be implementing windows in the near future). To do this, we iterate over each tree in the tree sequence once and perform an in-order traversal starting at each mutation's node, storing the samples we encounter along the way. We then perform another step that examines the parentage of each mutation and removes samples that exist in child mutations. With the samples for each allele, we perform intersections to get haplotype weights, which are then passed to summary functions that fill out a stats matrix. We also implement a few happy-path tests that test the functionality of sample sets and multi-allelic sites. We also cover all of the non memory-related error cases in the C api. We plan to do more in-depth testing once the python API is implemented. This PR also introduces the concept of a "bit array" in the form of tsk_bit_array_t, which needs some further refinement. In essence, this data structure stores all of the possible samples in a bit set where one sample takes up one bit of information and storage space. This enables a few functions that are critical to this code: 1) it's a relatively memory efficient method of storing sample identities 2) we can efficiently intersect the arrays (I believe much of this is autovectorized by gcc) 3) we can use tricks like SWAR (SIMD within an array) to count the number of samples in a set. It's important to note that this is also where the algorithm spends ~50% of it's time. It can use more improvement in both the API design and the performance characteristics.

mergify · 2023-09-04T14:02:05Z

rebase