-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TreeSequence.f2 is not symmetric with multiallelic sites #2919
Comments
Hm, you're right. This was certainly not intentional. And, it's true for
Here's the issue: the definition we gave for f4 (and by extension, f3 and f2) is
This sounds symmetric under switching (a,b) with (c,d), and indeed it is for biallelic loci, since
is the same for biallelic loci as
and hence the same as
However, with more than two alleles, we can have b and d differing yet still neither agreeing with a and c. So, I see two options:
I'm in favor of (2) because: However, if you know of a reason that the symmetrized version (or something else?) is a more natural estimator or something, then that's worth considering? FYI, we've defined all the statistics in a way that makes sense with more-than-biallelic sites: the strategy taken is to treat each allele as a binary split between "the allele" and "all other alleles", and compute statistics in a way that sums a summary over these splits (see the paper for more discussion). I'm not aware of any other issues like this one that arises from multiallelic sites: for instance, |
Hi Peter, Thanks a lot for your thorough investigation and explanation of the issue. Perhaps a lot of the difference in intuition (and why this behavior of tskit was surprising to me) is that I tend to think of However, I think from the
In addition, most applications of In summary, I think the symmetric version is just the nicer, more convenient and, to me, more intuitive extension of Ben |
Thanks very much - this is compelling (but I'm not yet sure). I think we considered the However, I think that our current version, symmetrized (i.e., Thanks for helping get this right. p.s. I ran across the thread in which we worked out the multiallelic thing; it would have been nice to have your input at the time! |
The implementation of the$f_2$ -statistic for data with multiallelic sites is not symmetric (as would be expected from the definition of the statistic), I narrowed it down to the following minimal example:
Digging a bit deeper, I found the issue is very likely due to the presence of a multiallelic site; coming from
ms
I assumedmsprime
would generate infinite sites by default, when it does not. Usingfixes the problem
As F-statistics are not commonly run (or even defined for) multiallelic sites, this might not be a great practical problem, but I decided to post it anyways since it appears like the code for many statistics is shared.
Other info:
The data set generated is
Versions used:
The text was updated successfully, but these errors were encountered: