Skip to content

Commit

Permalink
Implement a two-locus C API for count statistics.
Browse files Browse the repository at this point in the history
We also fully implement two-site statistics, mirroring the functionality
of the LdCalculator.

There are a number of tasks that need to be done before this is
production ready, but it is a start. We've opened a number of issues in
github for improving the work that was done for this PR and for
implementing the python API.

The general approach for the two-site statistics is to gather all of the
samples below each mutation over the whole tree sequence (we will be
implementing windows in the near future). To do this, we iterate over
each tree in the tree sequence once and perform an in-order traversal
starting at each mutation's node, storing the samples we encounter along
the way. We then perform another step that examines the parentage of
each mutation and removes samples that exist in child mutations. With
the samples for each allele, we perform intersections to get haplotype
weights, which are then passed to summary functions that fill out a
stats matrix.

We also implement a few happy-path tests that test the functionality of
sample sets and multi-allelic sites. We plan to do more in-depth testing
once the python API is implemented.

This PR also introduces the concept of a "bit array" in the form of
tsk_bit_array_t, which needs some further refinement. In essence, this
data structure stores all of the possible samples in a bit set where one
sample takes up one bit of information and storage space. This enables a
few functions that are critical to this code: 1) it's a relatively
memory efficient method of storing sample identities 2) we can
efficiently intersect the arrays (I believe much of this is
autovectorized by gcc) 3) we can use tricks like SWAR (SIMD within an
array) to count the number of samples in a set. It's important to note
that this is also where the algorithm spends ~50% of it's time. It can
use more improvement in both the API design and the performance
characteristics.
  • Loading branch information
lkirk committed Sep 1, 2023
1 parent d6c9cf7 commit 5fdf850
Show file tree
Hide file tree
Showing 6 changed files with 1,380 additions and 1 deletion.
79 changes: 79 additions & 0 deletions c/tests/test_core.c
Original file line number Diff line number Diff line change
Expand Up @@ -526,6 +526,84 @@ test_avl_random(void)
validate_avl(sizeof(keys) / sizeof(*keys), keys);
}

static void
test_bit_arrays(void)
{
// NB: This test is only valid for the 32 bit implementation of bit arrays. If we
// were to change the chunk size of a bit array, we'd need to update these tests
tsk_bit_array_t arr;
tsk_bit_array_init(&arr, 90, 1);
for (tsk_bit_array_value_t i = 0; i < 20; i++) {
tsk_bit_array_add_bit(&arr, i);
}
tsk_bit_array_add_bit(&arr, 63);
tsk_bit_array_add_bit(&arr, 65);

// these assertions are only valid for 32-bit values
CU_ASSERT_EQUAL_FATAL(arr.data[0], 1048575);
CU_ASSERT_EQUAL_FATAL(arr.data[1], 2147483648);
CU_ASSERT_EQUAL_FATAL(arr.data[2], 2);

// verify our assumptions about bit array counting
CU_ASSERT_EQUAL_FATAL(tsk_bit_array_count(&arr), 22);

tsk_bit_array_free(&arr);

// create a length-2 array with 64 bit capacity
tsk_bit_array_init(&arr, 64, 2);
tsk_bit_array_t arr_row1, arr_row2;

// select the first and second row
tsk_bit_array_get_row(&arr, 0, &arr_row1);
tsk_bit_array_get_row(&arr, 1, &arr_row2);

// fill the first 50 bits of the first row
for (tsk_bit_array_value_t i = 0; i < 50; i++) {
tsk_bit_array_add_bit(&arr_row1, i);
}
// fill bits 20-40 of the second row
for (tsk_bit_array_value_t i = 20; i < 40; i++) {
tsk_bit_array_add_bit(&arr_row2, i);
}

// verify our assumptions about row selection
CU_ASSERT_EQUAL_FATAL(arr.data[0], 4294967295);
CU_ASSERT_EQUAL_FATAL(arr.data[1], 262143);
CU_ASSERT_EQUAL_FATAL(arr_row1.data[0], 4294967295);
CU_ASSERT_EQUAL_FATAL(arr_row1.data[1], 262143);

CU_ASSERT_EQUAL_FATAL(arr.data[2], 4293918720);
CU_ASSERT_EQUAL_FATAL(arr.data[3], 255);
CU_ASSERT_EQUAL_FATAL(arr_row2.data[0], 4293918720);
CU_ASSERT_EQUAL_FATAL(arr_row2.data[1], 255);

// subtract the second from the first row, store in first
tsk_bit_array_subtract(&arr_row1, &arr_row2);

// verify our assumptions about subtraction
CU_ASSERT_EQUAL_FATAL(arr_row1.data[0], 1048575);
CU_ASSERT_EQUAL_FATAL(arr_row1.data[1], 261888);

tsk_bit_array_t int_result;
tsk_bit_array_init(&int_result, 64, 1);

// their intersection should be zero
tsk_bit_array_intersect(&arr_row1, &arr_row2, &int_result);
CU_ASSERT_EQUAL_FATAL(int_result.data[0], 0);
CU_ASSERT_EQUAL_FATAL(int_result.data[1], 0);

// now, add them back together, storing back in a
tsk_bit_array_add(&arr_row1, &arr_row2);

// now, their intersection should be the subtracted chunk (20-40)
tsk_bit_array_intersect(&arr_row1, &arr_row2, &int_result);
CU_ASSERT_EQUAL_FATAL(int_result.data[0], 4293918720);
CU_ASSERT_EQUAL_FATAL(int_result.data[1], 255);

tsk_bit_array_free(&int_result);
tsk_bit_array_free(&arr);
}

static void
test_meson_version(void)
{
Expand Down Expand Up @@ -554,6 +632,7 @@ main(int argc, char **argv)
{ "test_avl_sequential", test_avl_sequential },
{ "test_avl_interleaved", test_avl_interleaved },
{ "test_avl_random", test_avl_random },
{ "test_bit_arrays", test_bit_arrays },
{ "test_meson_version", test_meson_version },
{ NULL, NULL },
};
Expand Down
Loading

0 comments on commit 5fdf850

Please sign in to comment.