Handle missing data in two-locus statistics #2831

lkirk · 2023-08-30T02:23:35Z

As mentioned here by @petrelharp during the review of #2805, we'd like a better treatment of missing data. As implemented, we compute $w_{AB}$, $w_{Ab}$, $w_{aB}$, but use the total number of samples in the tree sequence as $n$. If there's missing data, $n$ will not be correct. We should implement $n$ as $n=w_{AB}+w_{Ab}+w_{aB}+w_{ab}$ so that we can properly account for missing data. This means that $n$ will be the minimum number of samples intersecting with the sample set at the left locus and the right locus.

This will require a bit of restructuring because we will either need to intersect all samples with the samples of the current valid tree on the left and right or we'll want to seed the algorithm that propagates sample bit arrays across alleles.

jeromekelleher · 2023-08-30T08:07:13Z

We also don't really handle missing data in the standard tree stats API, so I'd be happy to kick this (tricky) can down the road.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle missing data in two-locus statistics #2831

Handle missing data in two-locus statistics #2831

lkirk commented Aug 30, 2023

jeromekelleher commented Aug 30, 2023

Handle missing data in two-locus statistics #2831

Handle missing data in two-locus statistics #2831

Comments

lkirk commented Aug 30, 2023

jeromekelleher commented Aug 30, 2023