-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RateMap API update #1599
RateMap API update #1599
Conversation
I had the same thought in #1598, but used "from" and "to". I guess "left" and "right" are fine too. I slightly prefer "from" and "to" because I associate "left" and "right" with edges, but perhaps that's not a problem. Either is fine by me. |
Codecov Report
@@ Coverage Diff @@
## main #1599 +/- ##
==========================================
+ Coverage 91.19% 91.25% +0.05%
==========================================
Files 20 20
Lines 10123 10184 +61
Branches 2130 2134 +4
==========================================
+ Hits 9232 9293 +61
Misses 452 452
Partials 439 439
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Note that I think we should use (at least) |
OK, thanks @hyanwong. Let's use |
28978e5
to
7043daf
Compare
This has morphed into a substantial update of the RateMap API, but a necessary one I think. Unknown valuesThe idea is generalise from the current approach of treating the flanking left and right regions specially using the Initially here we just mark the start and end of chromosomes read by The plan then would be to follow up with a parameter The one remaining question would be what recombination rate one should use in the unknown regions for simulation. Zero is the obvious and easy choice, but I guess you could argue it might be free recombination too. I guess the right interface for specifying that is something we need to think about. Implement the mapping protocolIt seems like a good idea to implement the Mapping protocol where we view the RateMap as a mapping from Intervals to rates. I've made some changes in there to reflect this. Pinging @hyanwong, @grahamgower, @petrelharp for thoughts here? |
Ha. I actually wondered if we should have some sort of marker for unknown flanks rather than using 0. But I hadn't though about extending the idea to regions in the middle of the chromosome. I think this is a great idea. The only think that would make me think otherwise is if we actually wanted to provide telomere/centromere functionality in a completely different way, by having a "mask" that we overlay onto the TS somehow. That would make it easier to "mask out" regions in general (e.g. for recombination, gene conversion, and mutations, as well as for statistical calculations), rather than having to implement something specific to recombination rate maps, then again to mutation rate maps, then again to stats calculations. I guess if |
It doesn't matter at the ends of the chromosomes. So you could easily implement the NaN behaviour to work for the ends of the chromosomes, and for the moment raise a NotImplementedError when there's a NaN in the middle of the chromosome (until we decide what the right behaviour is). Incidentally, if it's free recombination (which I suspect is the right thing), then that provides an easy way to create multiple chromosomes, assuming that we can add a NaN region of negligible width between one integer site and the next. |
Ah, I see that my earlier text wasn't clear:
I meant to say "snip out the regions of the output tree sequence in the regions of the input recombination map that were marked unknown". This is basically what you're saying above, isn't it? So, by default we snip out any regions that are unknown from the output tree sequence before returning it to the user. This means that everything downstream works like it should - if the region was marked as unknown, then it becomes missing data in the output ts. In terms of combining different unknown regions from recomb and gene conversion, I'd imagine we just take the union. If the regions is unknown for any of the processes, then it's missing data in the output. |
Right. This does sound sensible. One thing: in the stats documentation, it says
So if we intend this to work for stats, we should ensure the behaviour in the "empty" cases works as expected.
Yes, I guess so. What happens when a GC event extends into an unknown region? There may be various oddities here that we need to consider. And we should explain this all somehow to the user. It might be that when explaining how regions are "masked out" like this, some use-cases emerge that we haven't thought about. Having said which, I can't see a problem with replacing the current behaviour, invisibly, with the NaN behaviour. So perhaps better to do it now and iron out wrinkles later. |
I think zero is the right default: it's closer to correct for centromeres, and it results in less work. As implemented, "free recombination" would be a rate of log(2) per bp, right? which would tremendously slow down simulations if any missing gaps were longer than 1bp. |
Or, for recombination, we could fill in the "missing" region with the average rate for the time being? So that's three options! Either way, the "ends" are special, as it doesn't matter what we put there, and we might as well put zero, whatever the decision is for intermediate missing sections. |
Oh, I just thought. We do usually have a sensible recombination rate in centromeric sections, don't we (which might be different from the average)? That's because we can work it out using normal genetic mapping tools (rather than fine mapping stuff). In fact, the centromeric region is always present (with reliable, if quite averaged-out, values) in genetic maps. So this way of "snipping out" a region isn't quite right. Or at least, we might want an alternative way of marking that an intermediate (e.g. centromeric) region should be excluded from the resulting tree sequence, while still keeping the passed-in recombination rates in that region. Hmm, this is tricky. |
OK great - looks like we all think this is the right way to go. Regarding the correct default rate within unknown regions, let's make it zero for the initial pass, and defer discussion as to what the correct value is to an issue. (I think probably free recombination is probably the right option, but it's not as simple as filling in the array with log(2) or whatever. We want a one-base pair segment of free recombination at the end of the unknown region.) I'll open an issue. |
As I mentioned above, I think we should probably raise an error for the moment if there are NaNs within the map (but not if there are NaNs at either end), until we decide what to do with these. After all, we don't support this at the moment anyway. |
And I've just realised the (seemingly hacky but) probably correct way to do centromeres. We keep the recombination map with its (correct) values in centromeric regions, but we set the gene conversion map (when it is implemented) to NaN there. After all, we definitely don't know the GC rate within those regions. Then when we take the union, the centromeric region will be end up being deleted from the TS. This does mean that we need to check that recombination can occur normally, as specified by the recombination map, in regions marked as NaN by the GC map. And the interval deletion comes later. So it's mainly a question of doing things in the right order. The reason why the recombination_map is special is that its effects are not local, but across the whole simulated chromosome. So we need to be careful when rubbing out regions (which is the point about what rate to use in a snipped out section). |
64339fa
to
1267795
Compare
- Include basic support for unknown intervals - Implement Mapping protocol - Add str, repr and HTML display Closes tskit-dev#1590
1267795
to
bcb3394
Compare
bcb3394
to
ed90211
Compare
This isn't quite finished, but it could use some review please @hyanwong, @petrelharp, @grahamgower. The basic idea is that NaN values in the Maybe we should change the terminology to "missing data" or "missing" rather than "unknown"? Missing data is really what we have, and it's dealing with that correctly is the crux. So, I guess we could use numpy mask arrays or something as the "right way" to deal with the various operations. |
"Missing" sounds good to me. I would be (slightly) disinclined to use "missing data", as we could easily have other data in those regions (say a gene_conversion rate, or genetic data if we are doing inference using the RateMap functionality), it's just that the rate is missing. As I said elsewhere, if the rate data is calculated from a cumulative measure, like recombination, then I don't think we will usually have missing rates in the middle of a sequence. But it's as well to allow for this possibility in the design for other cases when rate is calculated in a different manner. Otherwise, this all looks good to me. It's nice that the missing flanks of the map end up as empty regions of the tree sequence. We should have a word or phrase for an empty region in a tree sequence (one with no edges), that is subtly different from the case when we have edges but not any that are connected to samples (see tskit-dev/tskit#1285). As you can see, I've been using "empty regions" for this. |
P.s. I do think, in the long term, it would be tidier to have |
Yep, I agree. Minus the RecombinationMap baggage, obvs. |
- Bump matplotlib requirement Closes tskit-dev#1419
f037f7b
to
0bf985f
Compare
Nobody has complained about this, so I'm going to go ahead and merge. We can still tweak the semantics of things a bit, but I think the overall picture is the right way to go. |
Yes, I agree. Well done on pushing this through under time pressure! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm coming late to this, sorry, but it LGTM. I'm curious though, is there a specific use case in mind that wants/needs the full generality of the mapping protocol?
Well, you get a lot of useful functionality for free and I find it helps to clarify the semantics. Since a rate map is a mapping from intervals to rate values, then it makes sense to formally treat them as such following the mapping semantics. Otherwise I've found that we end up adding in bits of functionality piecemeal via various dunder methods, and it can be quite hard to keep track of the semantics with respect to Python protocols. By grasping the nettle at the start, these questions are settled. |
Thinking about how best to display this it seemed a good idea to introduce some
left
andright
arrays for the intervals. I consideredstart
andend
as well, but these are a bit overloaded with the map_start etc stuff already, so thought left and right are simplest. Plus, fits well with the tskit conventions.What do you think @hyanwong?