-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test_imputation.py file #2802
Comments
I have the following code produces a toy ts with 101 individuals, which were then split into a ref. panel of 100 individuals and a query set of one individual, and then dumps the nodes, edges, sites, mutations, and individuals tables to separate text files. When I ran it, I get a ref. panel with 8 variable sites and a query with 4 variable sites. I think that it should be good enough as a test case? I'll keep the code here for record keeping purposes.
|
Should we put those files in a new directory named, say, |
For input to BEAGLE, we have two gzipped VCF files, which we don't need to keep here, I think. |
On a second thought, we might want to keep those VCF files, just to have a record of the input files to BEAGLE used to generate the forward and backward matrices. |
The input tree sequences should be sufficiently reproducible, we don't need to store the VCFs |
I'd like the unit tests we do for unit tests to be even smaller - 3 or 4 samples and < 10 sites. |
Might be simpler to place the mutations "by hand" here on a small tree to start off with, like ``Tree.generate_balanced" |
For the fwd and bwd matrices, BEAGLE has scaled and shifted values, e.g., |
How about this? |
When we get the fwd and bwd matrices, we want one run with all the sites and then another run with chip-like sites, right? If so, probably we want like 9 variable sites and set 4 to missing for the latter, I think. |
Ah, actually, scrap the above comment. BEAGLE just computes the values for the sites in the target VCF file. The values for the in-between sites are interpolated thereafer. |
I've run BEAGLE 4.1 using a ref. panel of 2 diploid individuals (BEAGLE doesn't take haploid individuals, it seems) and a query set consisting of 1 diploid individual. This toy dataset has 5 sites in the ref. panel and 4 sites in the query. m = number of sites in ref. panel = 5 Forward prob. matrix of size (m, n) = (5, 4) Total of log10(sum of forward prob.) of all sites in query |
I'm storing everything in text and then convert to a Pandas dataframe or tree sequence. |
Should we add an example with a simple genetic map? By default, BEAGLE 4.1 assumes that 1 Mb = 1 cM. |
We should probably also add the state probability matrix that combines the forward and backward probabilities. |
I've rerun BEAGLE using different query haplotypes by setting Ne to 10. BEAGLE sets Ne to 1 million by default. I don't fully understand why they set it so big, but I suspect it has to do with defining the switch probabilities wrt to the ratio between Ne and ref. panel size. When Ne is set to 1 million and the ref. panel is so small as in the toy example, switch probability is always 1; intuitively, it is saying that the ref. panel is so small that it is a very poor representation of the genetic diversity in the population with large Ne, so it should always be favouring switching over mismatch. Below are the matrices.
The BEAGLE values in the backward matrix now vary and make more sense. Do the I think we should next tweak the parametrization and/or mimic BEAGLE's scaling and shifting? |
In case I forget, just jogging down some notes about BEAGLE 4.1's approach to calculating allele probabilities, which are used to get the MAP alleles at each imputed site of the query haplotypes. h = number of haplotypes in ref. panel
|
I think I understand the difference between state probability matrix and allele probability matrix. The former is what you get from forward-backward algorithm, and it contains the allele probabilities at the genotyped markers (excluding the ungenotyped markers that we want to impute). The latter contains allele probabilities interpolated using the allele probabilities of the flanking genotyped markers. |
So, we should mimic what BEAGLE is doing, right? And interpolate the allele probabilities of the ungenotyped markers? So, we are really implementing BEAGLE 4.1's algorithm in tskit. |
Then, getting the MAP alleles at each ungenotyped marker site for each query haplotype is straightforward. |
Great, thanks @szhan. Implementing BEAGLE 4.1's algorithm would ideal I guess - it's very good. |
I've added a separate file ( |
Some notes about some concepts used in the BEAGLE 4.1 algorithm.
There are nested indices to keep track of at each level of organisation, as far as I can tell. |
Most of the implementation is there. I'm corresponding with Brian Browning to understand how reference haplotype segments are indiced. |
In the BEAGLE 4.1, the reference haplotype segments used to index the HMM state probability matrix are defined in the function In Suppose that the first and last genotyped markers are not the first and last markers in the reference panel, respectively. Also suppose that BEAGLE input parameters are set such that each segment is defined by two individual markers rather than aggregated markers. The following steps define the left (L) and right (R) indices of the segments:
|
It's simpler if we don't follow their way of normalizing the probability values in the matrices. We really just want to use Equation 1 in BB2016, but in their implementation they follow Equation 2. |
It doesn't seem that BEAGLE uses the interpolated allele probabilities directly to get the imputed alleles. It computes the genotype probability (simply equal to the product of the probability of allele 0 and probability of allele 1) and then gets the genotype with the highest probability, I think. Also, I'm not getting the scaling of the probabilities right... |
We want to test some specific examples against the new backward matrix code in #2799. To be certain that we're interpreting the model correctly, we can output the forward and backward matrices from Beagle, and compare them with the results we get from running the local LS hmm code.
We don't want to rely on msprime simulations producing the same results over time, so what we can do is to run a small simulation, and output to text (using
ts.dump_text()
), storing the results in the file as examples.Then, we run BEAGLE on these same simulations, with a few different parameter combinations and capture the forward and backward matrices, as well as the final imputation output. We can save these outputs into the file as python arrays (or text, if that's simpler). Then we compare with our own output.
We can start with a simple example, but we should try and get a few awkward corner cases in there as well.
Note that we don't currently have a sensible high-level API for the LS model in tskit, so we should wrap the low-level API in functions locally in this file, for now.
The text was updated successfully, but these errors were encountered: