Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalization for two-locus, multiallelic stats? #2816

Open
petrelharp opened this issue Aug 11, 2023 · 1 comment
Open

normalization for two-locus, multiallelic stats? #2816

petrelharp opened this issue Aug 11, 2023 · 1 comment

Comments

@petrelharp
Copy link
Contributor

petrelharp commented Aug 11, 2023

Over at #2805, @lkirk has implemented various two-locus stats, e.g. r^2, D, etcetera. The strategy for computing these is to sum something over all pairs of alleles (one allele from each at the two loci). As currently implemented, there is a "normalization" function so that the value for a given pair of loci is

  \sum_i \sum_j F_{ij} W_{ij}

where F_{ij} is the summary function calculated for the pair of alleles, and W_{ij} is a weighting factor (the normalization).

@lkirk has proposed several weightings:

  1. uniform weighting (aka "total"); so W_{ij} = 1/(number of pairs of alleles)
  2. product of frequencies; so W_{ij} = p_i p_j, where p_i is the frequency of allele i
  3. haplotype frequency; so W_{ij} = p_{ij}, where p_{ij}is the frequency of the combination of allelesiandj`.

To this let me add one more:
4. unweighted; so W_{ij} = 1.

I suspect we don't actually need the weights at all. For either (2) or (3), we can just incorporate the weight into the summary function (and, this is how the one-locus stats work). Uniform weighting seems like it has undesireable properties - for instance, adding a single new allele as a result of genotyping error could make the resulting value change by quite a lot. However, @lkirk's reports (see this notebook) that using uniform weighting gets the right answer for some statistics, while other ones do not. I haven't dug down into what's going on, so am filing the issue for us to think about later.

I guess my first question will be: in the example where we needed uniform weighting to get the right answer, can we just change the summary function so that "unweighted" gets the right answer?

And: @lkirk, please correct me if I've got some of this wrong!

@lkirk
Copy link
Contributor

lkirk commented Aug 30, 2023

Thanks @petrelharp all of this is accurate. I agree that we can reduce some complexity by incorporating normalization directly into the summary functions. I'd love to figure out a way to get around uniform weighting, we should discuss this further in detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants