RateMap API update #1599

jeromekelleher · 2021-03-28T17:01:29Z

Thinking about how best to display this it seemed a good idea to introduce some left and right arrays for the intervals. I considered start and end as well, but these are a bit overloaded with the map_start etc stuff already, so thought left and right are simplest. Plus, fits well with the tskit conventions.

What do you think @hyanwong?

┌─────────────────────┐
│left  │right  │  rate│
├─────────────────────┤
│0     │ 1     │   0.1│
│1     │ 2     │   0.2│
└─────────────────────┘

hyanwong · 2021-03-28T17:04:42Z

I had the same thought in #1598, but used "from" and "to". I guess "left" and "right" are fine too. I slightly prefer "from" and "to" because I associate "left" and "right" with edges, but perhaps that's not a problem. Either is fine by me.

hyanwong · 2021-03-28T17:07:15Z

Here's what mine looks like:

codecov · 2021-03-28T17:07:26Z

Codecov Report

Merging #1599 (f037f7b) into main (0b05c86) will increase coverage by 0.05%.
The diff coverage is 100.00%.

❗ Current head f037f7b differs from pull request most recent head 0bf985f. Consider uploading reports for the commit 0bf985f to get more accurate results

@@            Coverage Diff             @@
##             main    #1599      +/-   ##
==========================================
+ Coverage   91.19%   91.25%   +0.05%     
==========================================
  Files          20       20              
  Lines       10123    10184      +61     
  Branches     2130     2134       +4     
==========================================
+ Hits         9232     9293      +61     
  Misses        452      452              
  Partials      439      439

Flag	Coverage Δ
C	`91.25% <100.00%> (+0.05%)`	⬆️
python	`98.40% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
msprime/ancestry.py	`98.69% <100.00%> (+0.04%)`	⬆️
msprime/cli.py	`98.16% <100.00%> (ø)`
msprime/intervals.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b05c86...0bf985f. Read the comment docs.

hyanwong · 2021-03-28T17:10:11Z

Note that I think we should use (at least) "{:.9g}" for the positions, which should display the full position in non-scientific notation up to 100Mb. Actually, I think I should probably use "{:.10g}" at a minimum.

jeromekelleher · 2021-03-29T09:17:22Z

OK, thanks @hyanwong. Let's use left and right - I'll add the properties here and the text display, and then we can rebase your HTML version on this. I'll also add a mid property, which will be handy for plotting.

jeromekelleher · 2021-03-29T14:31:39Z

This has morphed into a substantial update of the RateMap API, but a necessary one I think.

Unknown values

The idea is generalise from the current approach of treating the flanking left and right regions specially using the map_start and map_end values to introducing the idea of having "unknown" rates anywhere in the map. These are indicated by NaN values. This is much better because it allows us to deal with centromeres etc in a general way, without having to layer on more and more complexity.

Initially here we just mark the start and end of chromosomes read by read_hapmap and the flanks of slice as unknown, but we can imagine adding methods later (or others doing so) that could look at recombination maps and make decisions about which bits look fishy and should be marked "unknown".

The plan then would be to follow up with a parameter trim_unknown=True to sim_ancestry. This would then snip out the regions of the output tree sequence that are marked as unknown in the input recombination map, unless told not to. This would generally be quite useful, and, more fundamentally, is the correct thing to do.

The one remaining question would be what recombination rate one should use in the unknown regions for simulation. Zero is the obvious and easy choice, but I guess you could argue it might be free recombination too. I guess the right interface for specifying that is something we need to think about.

Implement the mapping protocol

It seems like a good idea to implement the Mapping protocol where we view the RateMap as a mapping from Intervals to rates. I've made some changes in there to reflect this.

Pinging @hyanwong, @grahamgower, @petrelharp for thoughts here?

hyanwong · 2021-03-29T14:49:40Z

Ha. I actually wondered if we should have some sort of marker for unknown flanks rather than using 0. But I hadn't though about extending the idea to regions in the middle of the chromosome. I think this is a great idea.

The only think that would make me think otherwise is if we actually wanted to provide telomere/centromere functionality in a completely different way, by having a "mask" that we overlay onto the TS somehow. That would make it easier to "mask out" regions in general (e.g. for recombination, gene conversion, and mutations, as well as for statistical calculations), rather than having to implement something specific to recombination rate maps, then again to mutation rate maps, then again to stats calculations.

I guess if sim_ancestry called delete_intervals on any regions that had NaN in the recombination map (or if there is NaN in the gene conversion map, when that gets implemented), then there won't be any edges in those regions for sim_mutate to use, and (hopefully) the stats calculations will work sensibly too, so maybe what @jeromekelleher is suggesting is akin to a mask anyway.

hyanwong · 2021-03-29T15:03:03Z

The one remaining question would be what recombination rate one should use in the unknown regions for simulation. Zero is the obvious and easy choice, but I guess you could argue it might be free recombination too. I guess the right interface for specifying that is something we need to think about.

It doesn't matter at the ends of the chromosomes. So you could easily implement the NaN behaviour to work for the ends of the chromosomes, and for the moment raise a NotImplementedError when there's a NaN in the middle of the chromosome (until we decide what the right behaviour is).

Incidentally, if it's free recombination (which I suspect is the right thing), then that provides an easy way to create multiple chromosomes, assuming that we can add a NaN region of negligible width between one integer site and the next.

jeromekelleher · 2021-03-29T15:18:57Z

Ah, I see that my earlier text wasn't clear:

The plan then would be to follow up with a parameter trim_unknown=True to sim_ancestry. This would then snip out the regions of the input recombination map that were marked as unknown, unless told not to. This would generally be quite useful, and, more fundamentally, is the correct thing to do.

I meant to say "snip out the regions of the output tree sequence in the regions of the input recombination map that were marked unknown".

This is basically what you're saying above, isn't it? So, by default we snip out any regions that are unknown from the output tree sequence before returning it to the user. This means that everything downstream works like it should - if the region was marked as unknown, then it becomes missing data in the output ts.

In terms of combining different unknown regions from recomb and gene conversion, I'd imagine we just take the union. If the regions is unknown for any of the processes, then it's missing data in the output.

hyanwong · 2021-03-29T15:27:59Z

This is basically what you're saying above, isn't it? So, by default we snip out any regions that are unknown from the output tree sequence before returning it to the user. This means that everything downstream works like it should - if the region was marked as unknown, then it becomes missing data in the output ts.

Right. This does sound sensible. One thing: in the stats documentation, it says

For site statistics, any windows without any sites in them never get touched, so they will have a value of 0. For branch statistics, any windows with no branches will similarly remain 0. That said, you should not rely on the specific behavior of whether 0 or nan is returned for “empty” cases like these: it is subject to change.

So if we intend this to work for stats, we should ensure the behaviour in the "empty" cases works as expected.

In terms of combining different unknown regions from recomb and gene conversion, I'd imagine we just take the union. If the regions is unknown for any of the processes, then it's missing data in the output.

Yes, I guess so. What happens when a GC event extends into an unknown region? There may be various oddities here that we need to consider. And we should explain this all somehow to the user. It might be that when explaining how regions are "masked out" like this, some use-cases emerge that we haven't thought about.

Having said which, I can't see a problem with replacing the current behaviour, invisibly, with the NaN behaviour. So perhaps better to do it now and iron out wrinkles later.

petrelharp · 2021-03-29T15:34:09Z

The one remaining question would be what recombination rate one should use in the unknown regions for simulation. Zero is the obvious and easy choice, but I guess you could argue it might be free recombination too.

I think zero is the right default: it's closer to correct for centromeres, and it results in less work. As implemented, "free recombination" would be a rate of log(2) per bp, right? which would tremendously slow down simulations if any missing gaps were longer than 1bp.

hyanwong · 2021-03-29T15:42:22Z

The one remaining question would be what recombination rate one should use in the unknown regions for simulation. Zero is the obvious and easy choice, but I guess you could argue it might be free recombination too.

I think zero is the right default: it's closer to correct for centromeres, and it results in less work. As implemented, "free recombination" would be a rate of log(2) per bp, right? which would tremendously slow down simulations if any missing gaps were longer than 1bp.

Or, for recombination, we could fill in the "missing" region with the average rate for the time being? So that's three options! Either way, the "ends" are special, as it doesn't matter what we put there, and we might as well put zero, whatever the decision is for intermediate missing sections.

hyanwong · 2021-03-29T15:53:43Z

I think zero is the right default: it's closer to correct for centromeres, and it results in less work. As implemented, "free recombination" would be a rate of log(2) per bp, right? which would tremendously slow down simulations if any missing gaps were longer than 1bp.

Oh, I just thought. We do usually have a sensible recombination rate in centromeric sections, don't we (which might be different from the average)? That's because we can work it out using normal genetic mapping tools (rather than fine mapping stuff). In fact, the centromeric region is always present (with reliable, if quite averaged-out, values) in genetic maps. So this way of "snipping out" a region isn't quite right. Or at least, we might want an alternative way of marking that an intermediate (e.g. centromeric) region should be excluded from the resulting tree sequence, while still keeping the passed-in recombination rates in that region.

Hmm, this is tricky.

jeromekelleher · 2021-03-29T16:01:49Z

OK great - looks like we all think this is the right way to go. Regarding the correct default rate within unknown regions, let's make it zero for the initial pass, and defer discussion as to what the correct value is to an issue. (I think probably free recombination is probably the right option, but it's not as simple as filling in the array with log(2) or whatever. We want a one-base pair segment of free recombination at the end of the unknown region.)

I'll open an issue.

hyanwong · 2021-03-29T16:07:38Z

OK great - looks like we all think this is the right way to go. Regarding the correct default rate within unknown regions, let's make it zero for the initial pass, and defer discussion as to what the correct value is to an issue.

As I mentioned above, I think we should probably raise an error for the moment if there are NaNs within the map (but not if there are NaNs at either end), until we decide what to do with these. After all, we don't support this at the moment anyway.

hyanwong · 2021-03-29T17:05:00Z

And I've just realised the (seemingly hacky but) probably correct way to do centromeres. We keep the recombination map with its (correct) values in centromeric regions, but we set the gene conversion map (when it is implemented) to NaN there. After all, we definitely don't know the GC rate within those regions. Then when we take the union, the centromeric region will be end up being deleted from the TS.

This does mean that we need to check that recombination can occur normally, as specified by the recombination map, in regions marked as NaN by the GC map. And the interval deletion comes later. So it's mainly a question of doing things in the right order.

The reason why the recombination_map is special is that its effects are not local, but across the whole simulated chromosome. So we need to be careful when rubbing out regions (which is the point about what rate to use in a snipped out section).

jeromekelleher · 2021-03-29T17:10:54Z

Soungs good @hyanwong, but let's move that discussion to the issue. Thing will get lost, and this thread is getting distracted from the main point (actually adding the basic functionality). I've opened #1604.

- Include basic support for unknown intervals - Implement Mapping protocol - Add str, repr and HTML display Closes tskit-dev#1590

msprime/intervals.py

jeromekelleher · 2021-03-30T17:19:56Z

This isn't quite finished, but it could use some review please @hyanwong, @petrelharp, @grahamgower.

The basic idea is that NaN values in the rate array define "holes" of missing data in the map. The rest is following through on the consequences of this for the semantics of the different operations. Most of them are straightforward enough, I think. The tricky bit will be explaining that the numpy array interface doesn't mask out the missing intervals, but the mapping interface does. We'll also need to provide some interface for accessing the missing/non-missing via numpy arrays, but I think we can push that to future work. What we have here is enough for what we need in the first instance: marking the flanks as missing data and pushing that through to the simulation.

Maybe we should change the terminology to "missing data" or "missing" rather than "unknown"? Missing data is really what we have, and it's dealing with that correctly is the crux. So, I guess we could use numpy mask arrays or something as the "right way" to deal with the various operations.

docs/rate_maps.md

hyanwong · 2021-03-31T12:53:23Z

"Missing" sounds good to me. I would be (slightly) disinclined to use "missing data", as we could easily have other data in those regions (say a gene_conversion rate, or genetic data if we are doing inference using the RateMap functionality), it's just that the rate is missing.

As I said elsewhere, if the rate data is calculated from a cumulative measure, like recombination, then I don't think we will usually have missing rates in the middle of a sequence. But it's as well to allow for this possibility in the design for other cases when rate is calculated in a different manner.

Otherwise, this all looks good to me. It's nice that the missing flanks of the map end up as empty regions of the tree sequence. We should have a word or phrase for an empty region in a tree sequence (one with no edges), that is subtly different from the case when we have edges but not any that are connected to samples (see tskit-dev/tskit#1285). As you can see, I've been using "empty regions" for this.

hyanwong · 2021-03-31T12:58:54Z

P.s. I do think, in the long term, it would be tidier to have intervals.py as part of tskit rather than msprime, so it might be useful to check that the RateMap code is as "msprime-independent" as possible (while, of course, allowing msprime to depend extensively on RateMap functionality). I do think this is the case at the moment, but it's useful to bear in the back of our minds.

jeromekelleher · 2021-03-31T13:46:23Z

P.s. I do think, in the long term, it would be tidier to have intervals.py as part of tskit rather than msprime, so it might be useful to check that the RateMap code is as "msprime-independent" as possible (while, of course, allowing msprime to depend extensively on RateMap functionality). I do think this is the case at the moment, but it's useful to bear in the back of our minds.

Yep, I agree. Minus the RecombinationMap baggage, obvs.

- Bump matplotlib requirement Closes tskit-dev#1419

jeromekelleher · 2021-03-31T14:21:17Z

Nobody has complained about this, so I'm going to go ahead and merge. We can still tweak the semantics of things a bit, but I think the overall picture is the right way to go.

hyanwong · 2021-03-31T14:31:25Z

Nobody has complained about this, so I'm going to go ahead and merge. We can still tweak the semantics of things a bit, but I think the overall picture is the right way to go.

Yes, I agree. Well done on pushing this through under time pressure!

grahamgower

I'm coming late to this, sorry, but it LGTM. I'm curious though, is there a specific use case in mind that wants/needs the full generality of the mapping protocol?

jeromekelleher · 2021-04-06T09:03:08Z

I'm curious though, is there a specific use case in mind that wants/needs the full generality of the mapping protocol?

Well, you get a lot of useful functionality for free and I find it helps to clarify the semantics. Since a rate map is a mapping from intervals to rate values, then it makes sense to formally treat them as such following the mapping semantics. Otherwise I've found that we end up adding in bits of functionality piecemeal via various dunder methods, and it can be quite hard to keep track of the semantics with respect to Python protocols. By grasping the nettle at the start, these questions are settled.

jeromekelleher marked this pull request as draft March 28, 2021 17:01

jeromekelleher force-pushed the rate_map_show branch from 28978e5 to 7043daf Compare March 29, 2021 14:21

jeromekelleher changed the title ~~Initial draft for text/html display~~ RateMap API update Mar 29, 2021

jeromekelleher mentioned this pull request Mar 29, 2021

How to deal with unknown rates when running simulations #1604

Open

jeromekelleher force-pushed the rate_map_show branch 3 times, most recently from 64339fa to 1267795 Compare March 30, 2021 15:02

Update the RateMap API.

3928078

- Include basic support for unknown intervals - Implement Mapping protocol - Add str, repr and HTML display Closes tskit-dev#1590

jeromekelleher mentioned this pull request Mar 30, 2021

HTML display for ratemap #1598

Closed

RateMap doc text

4dbefa6

hyanwong reviewed Mar 30, 2021

View reviewed changes

msprime/intervals.py Outdated Show resolved Hide resolved

jeromekelleher force-pushed the rate_map_show branch from 1267795 to bcb3394 Compare March 30, 2021 15:42

jeromekelleher mentioned this pull request Mar 30, 2021

RateMap doc text #1602

Closed

jeromekelleher force-pushed the rate_map_show branch from bcb3394 to ed90211 Compare March 30, 2021 17:15

jeromekelleher marked this pull request as ready for review March 30, 2021 17:16

hyanwong reviewed Mar 30, 2021

View reviewed changes

docs/rate_maps.md Outdated Show resolved Hide resolved

hyanwong reviewed Mar 30, 2021

View reviewed changes

docs/rate_maps.md Outdated Show resolved Hide resolved

jeromekelleher mentioned this pull request Mar 31, 2021

Limit rows returned in RateMap display #1607

Closed

Close up issues around new RatMap API.

0bf985f

- Bump matplotlib requirement Closes tskit-dev#1419

jeromekelleher force-pushed the rate_map_show branch from f037f7b to 0bf985f Compare March 31, 2021 14:17

jeromekelleher merged commit b0155eb into tskit-dev:main Mar 31, 2021

jeromekelleher deleted the rate_map_show branch March 31, 2021 14:29

hyanwong mentioned this pull request Mar 31, 2021

Don't draw empty flanks tskit-dev/tskit#1288

Merged

3 tasks

grahamgower reviewed Apr 6, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RateMap API update #1599

RateMap API update #1599

jeromekelleher commented Mar 28, 2021

hyanwong commented Mar 28, 2021

hyanwong commented Mar 28, 2021

codecov bot commented Mar 28, 2021 •

edited

Loading

hyanwong commented Mar 28, 2021

jeromekelleher commented Mar 29, 2021

jeromekelleher commented Mar 29, 2021 •

edited

Loading

hyanwong commented Mar 29, 2021

hyanwong commented Mar 29, 2021

jeromekelleher commented Mar 29, 2021

hyanwong commented Mar 29, 2021 •

edited

Loading

petrelharp commented Mar 29, 2021

hyanwong commented Mar 29, 2021 •

edited

Loading

hyanwong commented Mar 29, 2021 •

edited

Loading

jeromekelleher commented Mar 29, 2021

hyanwong commented Mar 29, 2021 •

edited

Loading

hyanwong commented Mar 29, 2021

jeromekelleher commented Mar 29, 2021 •

edited

Loading

jeromekelleher commented Mar 30, 2021 •

edited

Loading

hyanwong commented Mar 31, 2021 •

edited

Loading

hyanwong commented Mar 31, 2021 •

edited

Loading

jeromekelleher commented Mar 31, 2021 •

edited

Loading

jeromekelleher commented Mar 31, 2021

hyanwong commented Mar 31, 2021

grahamgower left a comment

jeromekelleher commented Apr 6, 2021

RateMap API update #1599

RateMap API update #1599

Conversation

jeromekelleher commented Mar 28, 2021

hyanwong commented Mar 28, 2021

hyanwong commented Mar 28, 2021

codecov bot commented Mar 28, 2021 • edited Loading

Codecov Report

hyanwong commented Mar 28, 2021

jeromekelleher commented Mar 29, 2021

jeromekelleher commented Mar 29, 2021 • edited Loading

Unknown values

Implement the mapping protocol

hyanwong commented Mar 29, 2021

hyanwong commented Mar 29, 2021

jeromekelleher commented Mar 29, 2021

hyanwong commented Mar 29, 2021 • edited Loading

petrelharp commented Mar 29, 2021

hyanwong commented Mar 29, 2021 • edited Loading

hyanwong commented Mar 29, 2021 • edited Loading

jeromekelleher commented Mar 29, 2021

hyanwong commented Mar 29, 2021 • edited Loading

hyanwong commented Mar 29, 2021

jeromekelleher commented Mar 29, 2021 • edited Loading

jeromekelleher commented Mar 30, 2021 • edited Loading

hyanwong commented Mar 31, 2021 • edited Loading

hyanwong commented Mar 31, 2021 • edited Loading

jeromekelleher commented Mar 31, 2021 • edited Loading

jeromekelleher commented Mar 31, 2021

hyanwong commented Mar 31, 2021

grahamgower left a comment

Choose a reason for hiding this comment

jeromekelleher commented Apr 6, 2021

codecov bot commented Mar 28, 2021 •

edited

Loading

jeromekelleher commented Mar 29, 2021 •

edited

Loading

hyanwong commented Mar 29, 2021 •

edited

Loading

hyanwong commented Mar 29, 2021 •

edited

Loading

hyanwong commented Mar 29, 2021 •

edited

Loading

hyanwong commented Mar 29, 2021 •

edited

Loading

jeromekelleher commented Mar 29, 2021 •

edited

Loading

jeromekelleher commented Mar 30, 2021 •

edited

Loading

hyanwong commented Mar 31, 2021 •

edited

Loading

hyanwong commented Mar 31, 2021 •

edited

Loading

jeromekelleher commented Mar 31, 2021 •

edited

Loading