Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implement a two-locus C API for count statistics.
We also fully implement two-site statistics, mirroring the functionality of the LdCalculator. There are a number of tasks that need to be done before this is production ready, but it is a start. We've opened a number of issues in github for improving the work that was done for this PR and for implementing the python API. The general approach for the two-site statistics is to gather all of the samples below each mutation over the whole tree sequence (we will be implementing windows in the near future). To do this, we iterate over each tree in the tree sequence once and perform an in-order traversal starting at each mutation's node, storing the samples we encounter along the way. We then perform another step that examines the parentage of each mutation and removes samples that exist in child mutations. With the samples for each allele, we perform intersections to get haplotype weights, which are then passed to summary functions that fill out a stats matrix. We also implement a few happy-path tests that test the functionality of sample sets and multi-allelic sites. We plan to do more in-depth testing once the python API is implemented. This PR also introduces the concept of a "bit array" in the form of tsk_bit_array_t, which needs some further refinement. In essence, this data structure stores all of the possible samples in a bit set where one sample takes up one bit of information and storage space. This enables a few functions that are critical to this code: 1) it's a relatively memory efficient method of storing sample identities 2) we can efficiently intersect the arrays (I believe much of this is autovectorized by gcc) 3) we can use tricks like SWAR (SIMD within an array) to count the number of samples in a set. It's important to note that this is also where the algorithm spends ~50% of it's time. It can use more improvement in both the API design and the performance characteristics.
- Loading branch information