Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two Locus Statistics C API #2805

Merged
merged 1 commit into from
Sep 4, 2023

Commits on Sep 4, 2023

  1. Implement a two-locus C API for count statistics.

    We also fully implement two-site statistics, mirroring the functionality
    of the LdCalculator.
    
    There are a number of tasks that need to be done before this is
    production ready, but it is a start. We've opened a number of issues in
    github for improving the work that was done for this PR and for
    implementing the python API.
    
    The general approach for the two-site statistics is to gather all of the
    samples below each mutation over the whole tree sequence (we will be
    implementing windows in the near future). To do this, we iterate over
    each tree in the tree sequence once and perform an in-order traversal
    starting at each mutation's node, storing the samples we encounter along
    the way. We then perform another step that examines the parentage of
    each mutation and removes samples that exist in child mutations. With
    the samples for each allele, we perform intersections to get haplotype
    weights, which are then passed to summary functions that fill out a
    stats matrix.
    
    We also implement a few happy-path tests that test the functionality of
    sample sets and multi-allelic sites. We also cover all of the non
    memory-related error cases in the C api. We plan to do more in-depth
    testing once the python API is implemented.
    
    This PR also introduces the concept of a "bit array" in the form of
    tsk_bit_array_t, which needs some further refinement. In essence, this
    data structure stores all of the possible samples in a bit set where one
    sample takes up one bit of information and storage space. This enables a
    few functions that are critical to this code: 1) it's a relatively
    memory efficient method of storing sample identities 2) we can
    efficiently intersect the arrays (I believe much of this is
    autovectorized by gcc) 3) we can use tricks like SWAR (SIMD within an
    array) to count the number of samples in a set. It's important to note
    that this is also where the algorithm spends ~50% of it's time. It can
    use more improvement in both the API design and the performance
    characteristics.
    lkirk authored and jeromekelleher committed Sep 4, 2023
    Configuration menu
    Copy the full SHA
    7ec1ff3 View commit details
    Browse the repository at this point in the history