We also fully implement two-site statistics, mirroring the functionality
of the LdCalculator.
There are a number of tasks that need to be done before this is
production ready, but it is a start. We've opened a number of issues in
github for improving the work that was done for this PR and for
implementing the python API.
The general approach for the two-site statistics is to gather all of the
samples below each mutation over the whole tree sequence (we will be
implementing windows in the near future). To do this, we iterate over
each tree in the tree sequence once and perform an in-order traversal
starting at each mutation's node, storing the samples we encounter along
the way. We then perform another step that examines the parentage of
each mutation and removes samples that exist in child mutations. With
the samples for each allele, we perform intersections to get haplotype
weights, which are then passed to summary functions that fill out a
stats matrix.
We also implement a few happy-path tests that test the functionality of
sample sets and multi-allelic sites. We also cover all of the non
memory-related error cases in the C api. We plan to do more in-depth
testing once the python API is implemented.
This PR also introduces the concept of a "bit array" in the form of
tsk_bit_array_t, which needs some further refinement. In essence, this
data structure stores all of the possible samples in a bit set where one
sample takes up one bit of information and storage space. This enables a
few functions that are critical to this code: 1) it's a relatively
memory efficient method of storing sample identities 2) we can
efficiently intersect the arrays (I believe much of this is
autovectorized by gcc) 3) we can use tricks like SWAR (SIMD within an
array) to count the number of samples in a set. It's important to note
that this is also where the algorithm spends ~50% of it's time. It can
use more improvement in both the API design and the performance
characteristics.