Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RateMap API update #1599

Merged
merged 3 commits into from
Mar 31, 2021
Merged

Conversation

jeromekelleher
Copy link
Member

Thinking about how best to display this it seemed a good idea to introduce some left and right arrays for the intervals. I considered start and end as well, but these are a bit overloaded with the map_start etc stuff already, so thought left and right are simplest. Plus, fits well with the tskit conventions.

What do you think @hyanwong?

┌─────────────────────┐
│left  │right  │  rate│
├─────────────────────┤
│0     │ 1     │   0.1│
│1     │ 2     │   0.2│
└─────────────────────┘

@jeromekelleher jeromekelleher marked this pull request as draft March 28, 2021 17:01
@hyanwong
Copy link
Member

I had the same thought in #1598, but used "from" and "to". I guess "left" and "right" are fine too. I slightly prefer "from" and "to" because I associate "left" and "right" with edges, but perhaps that's not a problem. Either is fine by me.

@hyanwong
Copy link
Member

Here's what mine looks like:

Screenshot 2021-03-28 at 18 07 08

@codecov
Copy link

codecov bot commented Mar 28, 2021

Codecov Report

Merging #1599 (f037f7b) into main (0b05c86) will increase coverage by 0.05%.
The diff coverage is 100.00%.

❗ Current head f037f7b differs from pull request most recent head 0bf985f. Consider uploading reports for the commit 0bf985f to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1599      +/-   ##
==========================================
+ Coverage   91.19%   91.25%   +0.05%     
==========================================
  Files          20       20              
  Lines       10123    10184      +61     
  Branches     2130     2134       +4     
==========================================
+ Hits         9232     9293      +61     
  Misses        452      452              
  Partials      439      439              
Flag Coverage Δ
C 91.25% <100.00%> (+0.05%) ⬆️
python 98.40% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
msprime/ancestry.py 98.69% <100.00%> (+0.04%) ⬆️
msprime/cli.py 98.16% <100.00%> (ø)
msprime/intervals.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b05c86...0bf985f. Read the comment docs.

@hyanwong
Copy link
Member

Note that I think we should use (at least) "{:.9g}" for the positions, which should display the full position in non-scientific notation up to 100Mb. Actually, I think I should probably use "{:.10g}" at a minimum.

@jeromekelleher
Copy link
Member Author

OK, thanks @hyanwong. Let's use left and right - I'll add the properties here and the text display, and then we can rebase your HTML version on this. I'll also add a mid property, which will be handy for plotting.

@jeromekelleher jeromekelleher changed the title Initial draft for text/html display RateMap API update Mar 29, 2021
@jeromekelleher
Copy link
Member Author

jeromekelleher commented Mar 29, 2021

This has morphed into a substantial update of the RateMap API, but a necessary one I think.

Unknown values

The idea is generalise from the current approach of treating the flanking left and right regions specially using the map_start and map_end values to introducing the idea of having "unknown" rates anywhere in the map. These are indicated by NaN values. This is much better because it allows us to deal with centromeres etc in a general way, without having to layer on more and more complexity.

Initially here we just mark the start and end of chromosomes read by read_hapmap and the flanks of slice as unknown, but we can imagine adding methods later (or others doing so) that could look at recombination maps and make decisions about which bits look fishy and should be marked "unknown".

The plan then would be to follow up with a parameter trim_unknown=True to sim_ancestry. This would then snip out the regions of the output tree sequence that are marked as unknown in the input recombination map, unless told not to. This would generally be quite useful, and, more fundamentally, is the correct thing to do.

The one remaining question would be what recombination rate one should use in the unknown regions for simulation. Zero is the obvious and easy choice, but I guess you could argue it might be free recombination too. I guess the right interface for specifying that is something we need to think about.

Implement the mapping protocol

It seems like a good idea to implement the Mapping protocol where we view the RateMap as a mapping from Intervals to rates. I've made some changes in there to reflect this.

Pinging @hyanwong, @grahamgower, @petrelharp for thoughts here?

@hyanwong
Copy link
Member

Ha. I actually wondered if we should have some sort of marker for unknown flanks rather than using 0. But I hadn't though about extending the idea to regions in the middle of the chromosome. I think this is a great idea.

The only think that would make me think otherwise is if we actually wanted to provide telomere/centromere functionality in a completely different way, by having a "mask" that we overlay onto the TS somehow. That would make it easier to "mask out" regions in general (e.g. for recombination, gene conversion, and mutations, as well as for statistical calculations), rather than having to implement something specific to recombination rate maps, then again to mutation rate maps, then again to stats calculations.

I guess if sim_ancestry called delete_intervals on any regions that had NaN in the recombination map (or if there is NaN in the gene conversion map, when that gets implemented), then there won't be any edges in those regions for sim_mutate to use, and (hopefully) the stats calculations will work sensibly too, so maybe what @jeromekelleher is suggesting is akin to a mask anyway.

@hyanwong
Copy link
Member

The one remaining question would be what recombination rate one should use in the unknown regions for simulation. Zero is the obvious and easy choice, but I guess you could argue it might be free recombination too. I guess the right interface for specifying that is something we need to think about.

It doesn't matter at the ends of the chromosomes. So you could easily implement the NaN behaviour to work for the ends of the chromosomes, and for the moment raise a NotImplementedError when there's a NaN in the middle of the chromosome (until we decide what the right behaviour is).

Incidentally, if it's free recombination (which I suspect is the right thing), then that provides an easy way to create multiple chromosomes, assuming that we can add a NaN region of negligible width between one integer site and the next.

@jeromekelleher
Copy link
Member Author

Ah, I see that my earlier text wasn't clear:

The plan then would be to follow up with a parameter trim_unknown=True to sim_ancestry. This would then snip out the regions of the input recombination map that were marked as unknown, unless told not to. This would generally be quite useful, and, more fundamentally, is the correct thing to do.

I meant to say "snip out the regions of the output tree sequence in the regions of the input recombination map that were marked unknown".

This is basically what you're saying above, isn't it? So, by default we snip out any regions that are unknown from the output tree sequence before returning it to the user. This means that everything downstream works like it should - if the region was marked as unknown, then it becomes missing data in the output ts.

In terms of combining different unknown regions from recomb and gene conversion, I'd imagine we just take the union. If the regions is unknown for any of the processes, then it's missing data in the output.

@hyanwong
Copy link
Member

hyanwong commented Mar 29, 2021

This is basically what you're saying above, isn't it? So, by default we snip out any regions that are unknown from the output tree sequence before returning it to the user. This means that everything downstream works like it should - if the region was marked as unknown, then it becomes missing data in the output ts.

Right. This does sound sensible. One thing: in the stats documentation, it says

For site statistics, any windows without any sites in them never get touched, so they will have a value of 0. For branch statistics, any windows with no branches will similarly remain 0. That said, you should not rely on the specific behavior of whether 0 or nan is returned for “empty” cases like these: it is subject to change.

So if we intend this to work for stats, we should ensure the behaviour in the "empty" cases works as expected.

In terms of combining different unknown regions from recomb and gene conversion, I'd imagine we just take the union. If the regions is unknown for any of the processes, then it's missing data in the output.

Yes, I guess so. What happens when a GC event extends into an unknown region? There may be various oddities here that we need to consider. And we should explain this all somehow to the user. It might be that when explaining how regions are "masked out" like this, some use-cases emerge that we haven't thought about.

Having said which, I can't see a problem with replacing the current behaviour, invisibly, with the NaN behaviour. So perhaps better to do it now and iron out wrinkles later.

@petrelharp
Copy link
Contributor

The one remaining question would be what recombination rate one should use in the unknown regions for simulation. Zero is the obvious and easy choice, but I guess you could argue it might be free recombination too.

I think zero is the right default: it's closer to correct for centromeres, and it results in less work. As implemented, "free recombination" would be a rate of log(2) per bp, right? which would tremendously slow down simulations if any missing gaps were longer than 1bp.

@hyanwong
Copy link
Member

hyanwong commented Mar 29, 2021

The one remaining question would be what recombination rate one should use in the unknown regions for simulation. Zero is the obvious and easy choice, but I guess you could argue it might be free recombination too.

I think zero is the right default: it's closer to correct for centromeres, and it results in less work. As implemented, "free recombination" would be a rate of log(2) per bp, right? which would tremendously slow down simulations if any missing gaps were longer than 1bp.

Or, for recombination, we could fill in the "missing" region with the average rate for the time being? So that's three options! Either way, the "ends" are special, as it doesn't matter what we put there, and we might as well put zero, whatever the decision is for intermediate missing sections.

@hyanwong
Copy link
Member

hyanwong commented Mar 29, 2021

I think zero is the right default: it's closer to correct for centromeres, and it results in less work. As implemented, "free recombination" would be a rate of log(2) per bp, right? which would tremendously slow down simulations if any missing gaps were longer than 1bp.

Oh, I just thought. We do usually have a sensible recombination rate in centromeric sections, don't we (which might be different from the average)? That's because we can work it out using normal genetic mapping tools (rather than fine mapping stuff). In fact, the centromeric region is always present (with reliable, if quite averaged-out, values) in genetic maps. So this way of "snipping out" a region isn't quite right. Or at least, we might want an alternative way of marking that an intermediate (e.g. centromeric) region should be excluded from the resulting tree sequence, while still keeping the passed-in recombination rates in that region.

Hmm, this is tricky.

@jeromekelleher
Copy link
Member Author

OK great - looks like we all think this is the right way to go. Regarding the correct default rate within unknown regions, let's make it zero for the initial pass, and defer discussion as to what the correct value is to an issue. (I think probably free recombination is probably the right option, but it's not as simple as filling in the array with log(2) or whatever. We want a one-base pair segment of free recombination at the end of the unknown region.)

I'll open an issue.

@hyanwong
Copy link
Member

hyanwong commented Mar 29, 2021

OK great - looks like we all think this is the right way to go. Regarding the correct default rate within unknown regions, let's make it zero for the initial pass, and defer discussion as to what the correct value is to an issue.

As I mentioned above, I think we should probably raise an error for the moment if there are NaNs within the map (but not if there are NaNs at either end), until we decide what to do with these. After all, we don't support this at the moment anyway.

@hyanwong
Copy link
Member

And I've just realised the (seemingly hacky but) probably correct way to do centromeres. We keep the recombination map with its (correct) values in centromeric regions, but we set the gene conversion map (when it is implemented) to NaN there. After all, we definitely don't know the GC rate within those regions. Then when we take the union, the centromeric region will be end up being deleted from the TS.

This does mean that we need to check that recombination can occur normally, as specified by the recombination map, in regions marked as NaN by the GC map. And the interval deletion comes later. So it's mainly a question of doing things in the right order.

The reason why the recombination_map is special is that its effects are not local, but across the whole simulated chromosome. So we need to be careful when rubbing out regions (which is the point about what rate to use in a snipped out section).

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Mar 29, 2021

Soungs good @hyanwong, but let's move that discussion to the issue. Thing will get lost, and this thread is getting distracted from the main point (actually adding the basic functionality). I've opened #1604.

@jeromekelleher jeromekelleher force-pushed the rate_map_show branch 3 times, most recently from 64339fa to 1267795 Compare March 30, 2021 15:02
- Include basic support for unknown intervals
- Implement Mapping protocol
- Add str, repr and HTML display

Closes tskit-dev#1590
msprime/intervals.py Outdated Show resolved Hide resolved
@jeromekelleher jeromekelleher marked this pull request as ready for review March 30, 2021 17:16
@jeromekelleher
Copy link
Member Author

jeromekelleher commented Mar 30, 2021

This isn't quite finished, but it could use some review please @hyanwong, @petrelharp, @grahamgower.

The basic idea is that NaN values in the rate array define "holes" of missing data in the map. The rest is following through on the consequences of this for the semantics of the different operations. Most of them are straightforward enough, I think. The tricky bit will be explaining that the numpy array interface doesn't mask out the missing intervals, but the mapping interface does. We'll also need to provide some interface for accessing the missing/non-missing via numpy arrays, but I think we can push that to future work. What we have here is enough for what we need in the first instance: marking the flanks as missing data and pushing that through to the simulation.

Maybe we should change the terminology to "missing data" or "missing" rather than "unknown"? Missing data is really what we have, and it's dealing with that correctly is the crux. So, I guess we could use numpy mask arrays or something as the "right way" to deal with the various operations.

docs/rate_maps.md Outdated Show resolved Hide resolved
docs/rate_maps.md Outdated Show resolved Hide resolved
@hyanwong
Copy link
Member

hyanwong commented Mar 31, 2021

"Missing" sounds good to me. I would be (slightly) disinclined to use "missing data", as we could easily have other data in those regions (say a gene_conversion rate, or genetic data if we are doing inference using the RateMap functionality), it's just that the rate is missing.

As I said elsewhere, if the rate data is calculated from a cumulative measure, like recombination, then I don't think we will usually have missing rates in the middle of a sequence. But it's as well to allow for this possibility in the design for other cases when rate is calculated in a different manner.

Otherwise, this all looks good to me. It's nice that the missing flanks of the map end up as empty regions of the tree sequence. We should have a word or phrase for an empty region in a tree sequence (one with no edges), that is subtly different from the case when we have edges but not any that are connected to samples (see tskit-dev/tskit#1285). As you can see, I've been using "empty regions" for this.

@hyanwong
Copy link
Member

hyanwong commented Mar 31, 2021

P.s. I do think, in the long term, it would be tidier to have intervals.py as part of tskit rather than msprime, so it might be useful to check that the RateMap code is as "msprime-independent" as possible (while, of course, allowing msprime to depend extensively on RateMap functionality). I do think this is the case at the moment, but it's useful to bear in the back of our minds.

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Mar 31, 2021

P.s. I do think, in the long term, it would be tidier to have intervals.py as part of tskit rather than msprime, so it might be useful to check that the RateMap code is as "msprime-independent" as possible (while, of course, allowing msprime to depend extensively on RateMap functionality). I do think this is the case at the moment, but it's useful to bear in the back of our minds.

Yep, I agree. Minus the RecombinationMap baggage, obvs.

- Bump matplotlib requirement

Closes tskit-dev#1419
@jeromekelleher
Copy link
Member Author

Nobody has complained about this, so I'm going to go ahead and merge. We can still tweak the semantics of things a bit, but I think the overall picture is the right way to go.

@jeromekelleher jeromekelleher merged commit b0155eb into tskit-dev:main Mar 31, 2021
@jeromekelleher jeromekelleher deleted the rate_map_show branch March 31, 2021 14:29
@hyanwong
Copy link
Member

Nobody has complained about this, so I'm going to go ahead and merge. We can still tweak the semantics of things a bit, but I think the overall picture is the right way to go.

Yes, I agree. Well done on pushing this through under time pressure!

Copy link
Member

@grahamgower grahamgower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm coming late to this, sorry, but it LGTM. I'm curious though, is there a specific use case in mind that wants/needs the full generality of the mapping protocol?

@jeromekelleher
Copy link
Member Author

I'm curious though, is there a specific use case in mind that wants/needs the full generality of the mapping protocol?

Well, you get a lot of useful functionality for free and I find it helps to clarify the semantics. Since a rate map is a mapping from intervals to rate values, then it makes sense to formally treat them as such following the mapping semantics. Otherwise I've found that we end up adding in bits of functionality piecemeal via various dunder methods, and it can be quite hard to keep track of the semantics with respect to Python protocols. By grasping the nettle at the start, these questions are settled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants