Skip to content

Commit

Permalink
Extend discussion of rate maps.
Browse files Browse the repository at this point in the history
- Bump matplotlib requirement

Closes #1419
  • Loading branch information
jeromekelleher committed Mar 30, 2021
1 parent 4dbefa6 commit ed90211
Show file tree
Hide file tree
Showing 5 changed files with 242 additions and 30 deletions.
10 changes: 10 additions & 0 deletions docs/ancestry.md
Original file line number Diff line number Diff line change
Expand Up @@ -2101,6 +2101,16 @@ multiple models. See the {ref}`sec_logging` section for more
details.
:::

(sec_ancestry_unknown_rates)=

## Unknown rates

:::{todo}
Give an example of doing a simulation with unknown rates on the flanks
and show that the resulting tree sequence has missing data in these
regions.
:::

(sec_ancestry_errors)=

## Common errors
Expand Down
226 changes: 197 additions & 29 deletions docs/rate_maps.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,10 @@ kernelspec:
```{code-cell}
:tags: [remove-cell]
import msprime
from IPython.display import SVG
from matplotlib import pyplot as plt
import msprime
import numpy as np
from IPython.display import SVG
from matplotlib import pyplot as plt
```

(sec_rate_maps)=
Expand All @@ -31,23 +32,27 @@ method. For instance, {func}`.sim_ancestry` accepts a {class}`.RateMap` as the
as the `rate` parameter.

A rate map is defined as a set of constant rates which apply piecewise
over contiguous regions of the chromosome. An example is shown below, both as a
table and as a plot of the rate at different genomic positions.
over contiguous regions of the genome.
For example, here we create a simple rate map over a 30kb genome:

```{code-cell}
:tags: [hide-input]
ratemap = msprime.RateMap(
rate_map = msprime.RateMap(
position=[0, 4000, 9000, 11000, 20000, 30000],
rate=[0, 1, 6, 2, 1]
)
display(ratemap)
rate=[0, 0.1, 0.6, 0.2, 0.1]
)
rate_map
```
See the {ref}`sec_rate_maps_basics` section for details on these columns.
We can see how these rates vary along the genome by plotting:

```{code-cell}
:tags: [hide-input]
plt.figure(figsize=(10, 4))
plt.stairs(ratemap.rate, ratemap.position)
plt.stairs(rate_map.rate, rate_map.position)
plt.title("Example RateMap with 5 rates, and a hotspot around position 10000")
plt.xlabel("Genome position")
plt.ylabel("Rate")
plt.xlim(0,ratemap.sequence_length)
plt.xlim(0, rate_map.sequence_length)
plt.show()
```

Expand All @@ -58,15 +63,15 @@ plt.show()
## Quick reference

{class}`.RateMap`
: The RateMap class: an instance can be created from provided data
: Represent the varying rate of processes along the genome as a piecewise function.:

**Creating rate maps**

{meth}`.RateMap.uniform`
: Create a RateMap with a single rate over the entire chromosome

{meth}`.RateMap.read_hapmap`
: Read in a RateMap from a file in "hapmap" format (common in recombination maps)
: Read in a RateMap from a file in "HapMap" format (common in recombination maps)

{meth}`.RateMap.slice`
: Create a new RateMap by slicing out a portion of an existing RateMap
Expand All @@ -80,11 +85,123 @@ plt.show()
: The cumulative sum of rates over the entire map

{meth}`.RateMap.get_rate`
: Compute rate at a set of user-specified positions
: Compute rate at a set of positions

{meth}`.RateMap.get_cumulative_mass`
: Compute the cumulative rate at a set of user-specified positions
: Compute the cumulative rate at a set of positions


(sec_rate_maps_basics)=
## Working with RateMaps

In this section we go illustrate the ways in which we can interact
with a {class}`.RateMap` object. Throughout we'll use the example
map from the introduction, defined over a 30kb genome:

```{code-cell}
rate_map = msprime.RateMap(
position=[0, 4000, 9000, 11000, 20000, 30000],
rate=[0, 0.1, 0.6, 0.2, 0.1]
)
rate_map
```

:::{seealso}
See the {ref}`sec_rate_maps_creating` section for different ways of
creating {class}`.RateMap` instances with various properties.
:::

To create the {class}`.RateMap` instance we called the contructor
with two lists, ``position`` and ``rate``. These define the
intervals along the genome and the rates in each interval. Because
we are interested in nonoverlapping contiguous intervals, we can
define all the {math}`n` intervals with {math}`n + 1` positions
and correspoding {math}`n` rate values.

:::{important}
All intervals as **open-closed**; that is, the left coordinate
is inclusive and the right coordinate exclusive.
:::

Information about each of the intervals is shown in the summary
table above, one interval per row. So, the last interval has a
``left`` coordinate of 20000, a ``right`` coordinate of 30000,
a midpoint of 25000 (``mid``), has a ``span`` of 10000 base pairs
and maps to a rate of 0.1.

:::{seealso}
See the {ref}`sec_rate_maps_basics_array` for examples of how to
access the columns efficiently as numpy arrays.
:::

(sec_rate_maps_basics_mapping)=
### Mapping interface

The easiest way to access the rate information from a rate map is
via the *mapping interface*. We can use the Python square bracket
notation to get the rate at a particular point:

```{code-cell}
print("rate at position 4500 = ", rate_map[4500])
```

:::{seealso}
The {meth}`.RateMap.get_rate` method is a more efficient way to
compute the rate values at an array of locations along the genome.
:::

We can also iterate over the intervals in the map, much
as if it was a Python dictionary:

```{code-cell}
for midpoint, rate in rate_map.items():
print(midpoint, rate, sep="\t")
```

:::{note}
Note that the key here is the **midpoint** of each interval.
:::

Operations such as ``in`` tell us whether the a give position is
in the rate map:


```{code-cell}
print("Is position 10000 in the map?", 10000 in rate_map)
print("Is position -1 in the map?", -1 in rate_map)
```

(sec_rate_maps_basics_array)=
### Array interface

We can also access the information from the rate map using numpy
arrays. This is usually the preferred option when working with
rate maps, since they can be quite large.

```{code-cell}
print("interval left position :", rate_map.left)
print("interval right position:", rate_map.right)
print("interval midpoint :", rate_map.mid)
print("interval span :", rate_map.span)
print("interval rate :", rate_map.rate)
```

Please see the documentation for
{attr}`~.RateMap.left`,
{attr}`~.RateMap.right`,
{attr}`~.RateMap.mid`,
{attr}`~.RateMap.span`,
and {attr}`~.RateMap.rate` for more information.

The {attr}`~.RateMap.position` array is also defined, which is
equal to the ``left`` array with last element of the ``right``
array (i.e., the {attr}`~.RateMap.sequence_length`) appended to
it.

:::{todo}
Give an example of how we might do something useful with these
arrays, like plotting something, e.g.
:::

(sec_rate_maps_creating)=
## Creating rate maps
Expand All @@ -102,16 +219,75 @@ The most basic rate map is a uniform rate over the entire sequence. This can sim
generated by {meth}`.RateMap.uniform`:

```{code-cell}
msprime.RateMap.uniform(length=1e8, rate=0.5)
msprime.RateMap.uniform(sequence_length=1e8, rate=0.5)
```

### Reading from a text file

:::{todo}
Give an example of working with the {meth}`.RateMap.read_hapmap` method
:::

(sec_rate_maps_creating_hotspots)=
### Hotspots

:::{todo}
Give an example of making a RateMap with some recurrent hotspots.
:::

(sec_rate_maps_unknown)=

## Unknown intervals

There are often times when we do not know what the rate of a particular
process is along the genome. For example, in telomeric regions, we don't
really know what the rate of recombination is, and we need some way
of encoding this information.

:::{seealso}
See the {ref}`sec_ancestry_unknown_rates` section for an example
of running an ancestry simulation with unknown recombination rates,
and the properties of the resulting tree sequence.
:::

:::{warning}
The unknown intervals/missing data functionality is preliminary
and the semantics of various operations are subject to change.
:::

We mark sections of a rate map as missing data by assigning
setting the rate to a [NaN value](https://en.wikipedia.org/wiki/NaN).
Here, for example we create a rate map in which we have a
uniform rate except for the 10 base pairs at either end,
which are unknown:

```{code-cell}
rate_map = msprime.RateMap(position=[0, 10, 90, 100], rate=[np.nan, 0.1, np.nan])
rate_map
```

### Reading from a file
Values such as the :attr:`~.RateMap.mean_rate` ignore the regions of missing
data:

```{code-cell}
print("mean rate = ", rate_map.mean_rate)
```

Describe read_hapmap briefly here.
:::{todo}
Talk through the semantics of positions in the array being "missing".
That is, if we ask ``5 in rate_map`` here we'll get False, because
we don't have any value defined there. We're effectively defining
holes in the map.
:::

(sec_rate_maps_masking_and_slicing)=
### Masking and creating new maps via slicing

:::{todo}
Revise this section to take into account the new unknown rates
way of doing things.
:::

Often, the rates at the start and end of the chromosome (i.e. in the telomeric regions)
are unknown, and treated as zero. Therefore if there is a zero-rate region at the start
or end of the list of provided rates, an alternative start position and/or end position
Expand All @@ -132,7 +308,8 @@ do this:
import io
# First and last sections of human chr 1 build 36 genetic map taken from
# https://ftp.ncbi.nlm.nih.gov/hapmap/recombination/latest/rates/
hapmap_chr1_snippet = io.StringIO("""chr position COMBINED_rate(cM/Mb) Genetic_Map(cM)
hapmap_chr1_snippet = io.StringIO("""\
chr position COMBINED_rate(cM/Mb) Genetic_Map(cM)
1 72434 0.0015000000 0
1 78032 0.0015000000 0.0000083970
1 554461 0.0015000000 0.0007230405
Expand All @@ -156,19 +333,10 @@ large_ratemap = msprime.RateMap.read_hapmap(hapmap_chr1_snippet)
sliced_ratemap = large_ratemap.slice(700e3, 740e3)
# Simulate only 40kb
ts = msprime.sim_ancestry(10, recombination_rate=sliced_ratemap)
# Only keep the trees in the sliced region. Then if we then mutate, we will not
# create mutations in the flanking regions
ts = ts.keep_intervals([[700e3, 740e3]])
```

Because the recombination rate is 0 in regions below position 700 000 and above position
740 000, `msprime` has not had to put effort into simulating recombination (and its
effect on ancestry) in these regions, so the simulation completes very quickly, even
though the resulting tree sequence has a `sequence_length` of 247 megabases.

(sec_rate_maps_creating_hotspots)=
### Hotspots

### Extracting values at specific positions

Document the used of `get_cumulative_mass` to convert physical to genetic coordinates.
2 changes: 2 additions & 0 deletions msprime/intervals.py
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,8 @@ def __getitem__(self, key):
if isinstance(key, numbers.Number):
index = self.find_index(key)
return self.rate[index]
# TODO we could implement numpy array indexing here and call
# to get_rate.
raise KeyError("Key {key} not in map")

def _display_table(self):
Expand Down
2 changes: 1 addition & 1 deletion requirements/CI-docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ jupyter-book==0.10.0
tskit==0.3.5
newick==1.0.0
PyGithub==1.53
matplotlib==3.3.0
matplotlib==3.4.0
32 changes: 32 additions & 0 deletions tests/test_intervals.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,38 @@ def test_interval_properties_all_known(self):
assert list(rate_map.mid) == [0.5, 1.5, 2.5]


class TestOperationsAllKnown:
examples = [
msprime.RateMap(position=[0, 1], rate=[0]),
msprime.RateMap(position=[0, 1], rate=[0.1]),
msprime.RateMap(position=[0, 1, 2], rate=[0.1, 0.2]),
msprime.RateMap(position=[0, 1, 2], rate=[0, 0.2]),
msprime.RateMap(position=[0, 1, 2], rate=[0.1, 1e-6]),
msprime.RateMap(position=range(100), rate=range(99)),
]

@pytest.mark.parametrize("rate_map", examples)
def test_get_rate_mid(self, rate_map):
rate = rate_map.get_rate(rate_map.mid)
assert len(rate) == len(rate_map)
for j in range(len(rate_map)):
assert rate[j] == rate_map[rate_map.mid[j]]

@pytest.mark.parametrize("rate_map", examples)
def test_get_rate_left(self, rate_map):
rate = rate_map.get_rate(rate_map.left)
assert len(rate) == len(rate_map)
for j in range(len(rate_map)):
assert rate[j] == rate_map[rate_map.left[j]]

@pytest.mark.parametrize("rate_map", examples)
def test_get_rate_right(self, rate_map):
rate = rate_map.get_rate(rate_map.right[:-1])
assert len(rate) == len(rate_map) - 1
for j in range(len(rate_map) - 1):
assert rate[j] == rate_map[rate_map.right[j]]


class TestFindIndex:
def test_one_interval(self):
rate_map = msprime.RateMap(position=[0, 10], rate=[0.1])
Expand Down

0 comments on commit ed90211

Please sign in to comment.