-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to deal with unknown rates when running simulations #1604
Comments
In the flanks, it's pretty clear that we just have a rate of 0, so that we don't bother simulating any breakpoints, which will just be thrown away anyway. This is true regardless of the size of the flanks: it doesn't just have to be telemetric regions. So if we want to simulate 1Mb on one arm of chr1 in humans, we can just If, however, there's a missing rate in the middle of a chromosome, it's much more complicated. Firstly, note that this is likely to be actually pretty rare, because even across the centromere we usually have a valid and reasonable rate, calculated by looking at the probability of switching using one allele to the right and another to the left of the long centromeric region. The problem is that whatever rate we chose here will influence the linkage between regions to the left and to the right of the missing region. (At least) 3 rates could be used:
Finally, note that we probably don't want to alter the recombination map to invoice its side effect of masking out e.g. the centromeric region, because then we would lose the (accurate) recombination rate over the centromere. One possibility is to create a RateMap for gene conversion (see #1212) which has an unknown rate in the centromere. This is a bit ugly, e.g. in the case of no gene conversion we would implement a gene_conversion RateMap that consists of zeros and NaNs, simply to mask out the centromere. But it has the benefit of being biologically correct: we really don't know the GC rate in the centromere, which is why it gets excluded in the output TS. |
One thing to note about the centromeres. I agree we have accurate information about the recombination across them which we shouldn't lose, but we don't have accurate recombination data within them. Consider this example:
Suppose the centromere spans from 1000 to 1100. We have a rate associated with the interval, but it's really only measured from the variants at either end. In the middle, we just don't know. If we simulate with these rates, we'll get recombination breakpoints and trees within the centromere, which we don't know anything about. What we want to do is to mark the bit that we don't know about as missing data so that we don't simulate any recombination in that interval, but also so that we maintain the correct recombination rate across the centromere. I'm not quite sure how we'd do that, but that's the aim anyway. There's two reasons we want to make sure that recombination rate within the centromere is 0 when we run the simulation:
So, TLDR: I agree we should have the correct recombination rate across the centromere, but we should have zero recombination within the centromere. |
I'm not 100% convinced. It think it depends on whether there really is (biologically) recombination within the centromere, which I understand might be the case. If there is, then perhaps we might want to simulate it, even if (later on) the mutational process is such that we can't actually spot the recombination event. I guess the problem here is that there will be (unknown) levels of recombination within the centromere, but probably low near the middle and higher nearer the edges. Because we don't actually know the variation in rates inside the centromere, we're stuck between modelling it as a zero rate with a sudden step at (each?) end, or a constant rate across the whole thing. Neither is correct. But one might be a better approximation than another. And actually, the same applies for hard-to-sequence regions like the tandem rDNA repeats on the tips of the acrocentrics. Eventually I assume that T2T sequencing technology will mean that we can sequence across these repeat regions anyway, so we might want to think what the right thing to do here would be if we actually had the true sequence in those regions. |
If the rate in a recombination map is calculated from the genetic distance, there will be no rate=NaN regions (except perhaps the flanks), right? I don't see how missingness information in internal regions of the map would be incorporated into the recombination map file without breaking the (presumably desirable) relationship between the rates and the genetic distance. So the missingness info needs to be provided from an additional source anyhow. What if the user gave a missingness mask to |
#1599 introduced the idea of an "unknown" rate within a RateMap using NaNs. This solves a lot of problems, but it doesn't tell us how we should run the actual simulations, which will need some rate in the regions.
As @hyanwong points out, a sensible first pass here would be to raise an error if someone puts unknown values anywhere but the flanks (where we know how to deal with them).
@hyanwong, can you summarise the state of play otherwise please?
The text was updated successfully, but these errors were encountered: