normalization for two-locus, multiallelic stats? #2816

petrelharp · 2023-08-11T18:34:45Z

Over at #2805, @lkirk has implemented various two-locus stats, e.g. r^2, D, etcetera. The strategy for computing these is to sum something over all pairs of alleles (one allele from each at the two loci). As currently implemented, there is a "normalization" function so that the value for a given pair of loci is

  \sum_i \sum_j F_{ij} W_{ij}

where F_{ij} is the summary function calculated for the pair of alleles, and W_{ij} is a weighting factor (the normalization).

@lkirk has proposed several weightings:

uniform weighting (aka "total"); so W_{ij} = 1/(number of pairs of alleles)
product of frequencies; so W_{ij} = p_i p_j, where p_i is the frequency of allele i
haplotype frequency; so W_{ij} = p_{ij}, where p_{ij}is the frequency of the combination of allelesiandj`.

To this let me add one more:
4. unweighted; so W_{ij} = 1.

I suspect we don't actually need the weights at all. For either (2) or (3), we can just incorporate the weight into the summary function (and, this is how the one-locus stats work). Uniform weighting seems like it has undesireable properties - for instance, adding a single new allele as a result of genotyping error could make the resulting value change by quite a lot. However, @lkirk's reports (see this notebook) that using uniform weighting gets the right answer for some statistics, while other ones do not. I haven't dug down into what's going on, so am filing the issue for us to think about later.

I guess my first question will be: in the example where we needed uniform weighting to get the right answer, can we just change the summary function so that "unweighted" gets the right answer?

And: @lkirk, please correct me if I've got some of this wrong!

The text was updated successfully, but these errors were encountered:

lkirk · 2023-08-30T02:15:46Z

Thanks @petrelharp all of this is accurate. I agree that we can reduce some complexity by incorporating normalization directly into the summary functions. I'd love to figure out a way to get around uniform weighting, we should discuss this further in detail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

normalization for two-locus, multiallelic stats? #2816

normalization for two-locus, multiallelic stats? #2816

petrelharp commented Aug 11, 2023 •

edited

Loading

lkirk commented Aug 30, 2023

normalization for two-locus, multiallelic stats? #2816

normalization for two-locus, multiallelic stats? #2816

Comments

petrelharp commented Aug 11, 2023 • edited Loading

lkirk commented Aug 30, 2023

petrelharp commented Aug 11, 2023 •

edited

Loading