Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle missing data in two-locus statistics #2831

Open
lkirk opened this issue Aug 30, 2023 · 1 comment
Open

Handle missing data in two-locus statistics #2831

lkirk opened this issue Aug 30, 2023 · 1 comment

Comments

@lkirk
Copy link
Contributor

lkirk commented Aug 30, 2023

As mentioned here by @petrelharp during the review of #2805, we'd like a better treatment of missing data. As implemented, we compute $w_{AB}$, $w_{Ab}$, $w_{aB}$, but use the total number of samples in the tree sequence as $n$. If there's missing data, $n$ will not be correct. We should implement $n$ as $n=w_{AB}+w_{Ab}+w_{aB}+w_{ab}$ so that we can properly account for missing data. This means that $n$ will be the minimum number of samples intersecting with the sample set at the left locus and the right locus.

This will require a bit of restructuring because we will either need to intersect all samples with the samples of the current valid tree on the left and right or we'll want to seed the algorithm that propagates sample bit arrays across alleles.

@jeromekelleher
Copy link
Member

We also don't really handle missing data in the standard tree stats API, so I'd be happy to kick this (tricky) can down the road.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants