Extend discussion of rate maps.

- Bump matplotlib requirement Closes #1419
tskit-dev · Mar 30, 2021 · ed90211 · ed90211
1 parent 4dbefa6
commit ed90211
Show file tree

Hide file tree

Showing 5 changed files with 242 additions and 30 deletions.
diff --git a/docs/ancestry.md b/docs/ancestry.md
@@ -2101,6 +2101,16 @@ multiple models. See the {ref}`sec_logging` section for more
 details.
 :::
 
+(sec_ancestry_unknown_rates)=
+
+## Unknown rates
+
+:::{todo}
+Give an example of doing a simulation with unknown rates on the flanks
+and show that the resulting tree sequence has missing data in these
+regions.
+:::
+
 (sec_ancestry_errors)=
 
 ## Common errors

diff --git a/docs/rate_maps.md b/docs/rate_maps.md
@@ -14,9 +14,10 @@ kernelspec:
 ```{code-cell}
 :tags: [remove-cell]
 
-    import msprime
-    from IPython.display import SVG
-    from matplotlib import pyplot as plt
+import msprime
+import numpy as np
+from IPython.display import SVG
+from matplotlib import pyplot as plt
 ```
 
 (sec_rate_maps)=
@@ -31,23 +32,27 @@ method. For instance, {func}`.sim_ancestry` accepts a {class}`.RateMap` as the
 as the `rate` parameter.
 
 A rate map is defined as a set of constant rates which apply piecewise
-over contiguous regions of the chromosome. An example is shown below, both as a
-table and as a plot of the rate at different genomic positions.
+over contiguous regions of the genome.
+For example, here we create a simple rate map over a 30kb genome:
 
 ```{code-cell}
-:tags: [hide-input]
-ratemap = msprime.RateMap(
+rate_map = msprime.RateMap(
     position=[0, 4000, 9000, 11000, 20000, 30000],
-    rate=[0, 1, 6, 2, 1]
-    )
-display(ratemap)
+    rate=[0, 0.1, 0.6, 0.2, 0.1]
+)
+rate_map
+```
+See the {ref}`sec_rate_maps_basics` section for details on these columns.
+We can see how these rates vary along the genome by plotting:
 
+```{code-cell}
+:tags: [hide-input]
 plt.figure(figsize=(10, 4))
-plt.stairs(ratemap.rate, ratemap.position)
+plt.stairs(rate_map.rate, rate_map.position)
 plt.title("Example RateMap with 5 rates, and a hotspot around position 10000")
 plt.xlabel("Genome position")
 plt.ylabel("Rate")
-plt.xlim(0,ratemap.sequence_length)
+plt.xlim(0, rate_map.sequence_length)
 plt.show()
 ```
 
@@ -58,15 +63,15 @@ plt.show()
 ## Quick reference
 
 {class}`.RateMap`
-: The RateMap class: an instance can be created from provided data
+: Represent the varying rate of processes along the genome as a piecewise function.:
 
 **Creating rate maps**
 
 {meth}`.RateMap.uniform`
 : Create a RateMap with a single rate over the entire chromosome
 
 {meth}`.RateMap.read_hapmap`
-: Read in a RateMap from a file in "hapmap" format (common in recombination maps)
+: Read in a RateMap from a file in "HapMap" format (common in recombination maps)
 
 {meth}`.RateMap.slice`
 : Create a new RateMap by slicing out a portion of an existing RateMap
@@ -80,11 +85,123 @@ plt.show()
 : The cumulative sum of rates over the entire map
 
 {meth}`.RateMap.get_rate`
-: Compute rate at a set of user-specified positions
+: Compute rate at a set of positions
 
 {meth}`.RateMap.get_cumulative_mass`
-: Compute the cumulative rate at a set of user-specified positions
+: Compute the cumulative rate at a set of positions
+
+
+(sec_rate_maps_basics)=
+## Working with RateMaps
+
+In this section we go illustrate the ways in which we can interact
+with a {class}`.RateMap` object. Throughout we'll use the example
+map from the introduction, defined over a 30kb genome:
+
+```{code-cell}
+rate_map = msprime.RateMap(
+    position=[0, 4000, 9000, 11000, 20000, 30000],
+    rate=[0, 0.1, 0.6, 0.2, 0.1]
+)
+rate_map
+```
+
+:::{seealso}
+See the {ref}`sec_rate_maps_creating` section for different ways of
+creating {class}`.RateMap` instances with various properties.
+:::
+
+To create the {class}`.RateMap` instance we called the contructor
+with two lists, ``position`` and ``rate``. These define the
+intervals along the genome and the rates in each interval. Because
+we are interested in nonoverlapping contiguous intervals, we can
+define all the {math}`n` intervals with {math}`n + 1` positions
+and correspoding {math}`n` rate values.
+
+:::{important}
+All intervals as **open-closed**; that is, the left coordinate
+is inclusive and the right coordinate exclusive.
+:::
+
+Information about each of the intervals is shown in the summary
+table above, one interval per row. So, the last interval has a
+``left`` coordinate of 20000, a ``right`` coordinate of 30000,
+a midpoint of 25000 (``mid``), has a ``span`` of 10000 base pairs
+and maps to a rate of 0.1.
+
+:::{seealso}
+See the {ref}`sec_rate_maps_basics_array` for examples of how to
+access the columns efficiently as numpy arrays.
+:::
+
+(sec_rate_maps_basics_mapping)=
+### Mapping interface
+
+The easiest way to access the rate information from a rate map is
+via the *mapping interface*. We can use the Python square bracket
+notation to get the rate at a particular point:
+
+```{code-cell}
+print("rate at position 4500 = ", rate_map[4500])
+```
+
+:::{seealso}
+The {meth}`.RateMap.get_rate` method is a more efficient way to
+compute the rate values at an array of locations along the genome.
+:::
+
+We can also iterate over the intervals in the map, much
+as if it was a Python dictionary:
+
+```{code-cell}
+for midpoint, rate in rate_map.items():
+    print(midpoint, rate, sep="\t")
+```
+
+:::{note}
+Note that the key here is the **midpoint** of each interval.
+:::
+
+Operations such as ``in`` tell us whether the a give position is
+in the rate map:
+
+
+```{code-cell}
+print("Is position 10000 in the map?", 10000 in rate_map)
+print("Is position -1 in the map?", -1 in rate_map)
+```
+
+(sec_rate_maps_basics_array)=
+### Array interface
+
+We can also access the information from the rate map using numpy
+arrays. This is usually the preferred option when working with
+rate maps, since they can be quite large.
 
+```{code-cell}
+print("interval left position :", rate_map.left)
+print("interval right position:", rate_map.right)
+print("interval midpoint      :", rate_map.mid)
+print("interval span          :", rate_map.span)
+print("interval rate          :", rate_map.rate)
+```
+
+Please see the documentation for
+{attr}`~.RateMap.left`,
+{attr}`~.RateMap.right`,
+{attr}`~.RateMap.mid`,
+{attr}`~.RateMap.span`,
+and {attr}`~.RateMap.rate` for more information.
+
+The {attr}`~.RateMap.position` array is also defined, which is
+equal to the ``left`` array with last element of the ``right``
+array (i.e., the {attr}`~.RateMap.sequence_length`) appended to
+it.
+
+:::{todo}
+Give an example of how we might do something useful with these
+arrays, like plotting something, e.g.
+:::
 
 (sec_rate_maps_creating)=
 ## Creating rate maps
@@ -102,16 +219,75 @@ The most basic rate map is a uniform rate over the entire sequence. This can sim
 generated by {meth}`.RateMap.uniform`:
 
 ```{code-cell}
-msprime.RateMap.uniform(length=1e8, rate=0.5)
+msprime.RateMap.uniform(sequence_length=1e8, rate=0.5)
+```
+
+### Reading from a text file
+
+:::{todo}
+Give an example of working with the {meth}`.RateMap.read_hapmap` method
+:::
+
+(sec_rate_maps_creating_hotspots)=
+### Hotspots
+
+:::{todo}
+Give an example of making a RateMap with some recurrent hotspots.
+:::
+
+(sec_rate_maps_unknown)=
+
+## Unknown intervals
+
+There are often times when we do not know what the rate of a particular
+process is along the genome. For example, in telomeric regions, we don't
+really know what the rate of recombination is, and we need some way
+of encoding this information.
+
+:::{seealso}
+See the {ref}`sec_ancestry_unknown_rates` section for an example
+of running an ancestry simulation with unknown recombination rates,
+and the properties of the resulting tree sequence.
+:::
+
+:::{warning}
+The unknown intervals/missing data functionality is preliminary
+and the semantics of various operations are subject to change.
+:::
+
+We mark sections of a rate map as missing data by assigning
+setting the rate to a [NaN value](https://en.wikipedia.org/wiki/NaN).
+Here, for example we create a rate map in which we have a
+uniform rate except for the 10 base pairs at either end,
+which are unknown:
+
+```{code-cell}
+rate_map = msprime.RateMap(position=[0, 10, 90, 100], rate=[np.nan, 0.1, np.nan])
+rate_map
 ```
 
-### Reading from a file
+Values such as the :attr:`~.RateMap.mean_rate` ignore the regions of missing
+data:
+
+```{code-cell}
+print("mean rate = ", rate_map.mean_rate)
+```
 
-Describe read_hapmap briefly here.
+:::{todo}
+Talk through the semantics of positions in the array being "missing".
+That is, if we ask ``5 in rate_map`` here we'll get False, because
+we don't have any value defined there. We're effectively defining
+holes in the map.
+:::
 
 (sec_rate_maps_masking_and_slicing)=
 ### Masking and creating new maps via slicing
 
+:::{todo}
+Revise this section to take into account the new unknown rates
+way of doing things.
+:::
+
 Often, the rates at the start and end of the chromosome (i.e. in the telomeric regions)
 are unknown, and treated as zero. Therefore if there is a zero-rate region at the start
 or end of the list of provided rates, an alternative start position and/or end position
@@ -132,7 +308,8 @@ do this:
 import io
 # First and last sections of human chr 1 build 36 genetic map taken from
 # https://ftp.ncbi.nlm.nih.gov/hapmap/recombination/latest/rates/
-hapmap_chr1_snippet = io.StringIO("""chr position COMBINED_rate(cM/Mb) Genetic_Map(cM)
+hapmap_chr1_snippet = io.StringIO("""\
+chr position COMBINED_rate(cM/Mb) Genetic_Map(cM)
 1 72434 0.0015000000 0
 1 78032 0.0015000000 0.0000083970
 1 554461 0.0015000000 0.0007230405
@@ -156,19 +333,10 @@ large_ratemap = msprime.RateMap.read_hapmap(hapmap_chr1_snippet)
 sliced_ratemap = large_ratemap.slice(700e3, 740e3)
 # Simulate only 40kb
 ts = msprime.sim_ancestry(10, recombination_rate=sliced_ratemap)
-# Only keep the trees in the sliced region. Then if we then mutate, we will not
-# create mutations in the flanking regions
-ts = ts.keep_intervals([[700e3, 740e3]])
 ```
 
 Because the recombination rate is 0 in regions below position 700 000 and above position
 740 000, `msprime` has not had to put effort into simulating recombination (and its
 effect on ancestry) in these regions, so the simulation completes very quickly, even
 though the resulting tree sequence has a `sequence_length` of 247 megabases.
 
-(sec_rate_maps_creating_hotspots)=
-### Hotspots
-
-### Extracting values at specific positions
-
-Document the used of `get_cumulative_mass` to convert physical to genetic coordinates.
diff --git a/msprime/intervals.py b/msprime/intervals.py
@@ -304,6 +304,8 @@ def __getitem__(self, key):
         if isinstance(key, numbers.Number):
             index = self.find_index(key)
             return self.rate[index]
+        # TODO we could implement numpy array indexing here and call
+        # to get_rate.
         raise KeyError("Key {key} not in map")
 
     def _display_table(self):

diff --git a/requirements/CI-docs/requirements.txt b/requirements/CI-docs/requirements.txt
@@ -7,4 +7,4 @@ jupyter-book==0.10.0
 tskit==0.3.5
 newick==1.0.0
 PyGithub==1.53
-matplotlib==3.3.0
+matplotlib==3.4.0
diff --git a/tests/test_intervals.py b/tests/test_intervals.py
@@ -125,6 +125,38 @@ def test_interval_properties_all_known(self):
         assert list(rate_map.mid) == [0.5, 1.5, 2.5]
 
 
+class TestOperationsAllKnown:
+    examples = [
+        msprime.RateMap(position=[0, 1], rate=[0]),
+        msprime.RateMap(position=[0, 1], rate=[0.1]),
+        msprime.RateMap(position=[0, 1, 2], rate=[0.1, 0.2]),
+        msprime.RateMap(position=[0, 1, 2], rate=[0, 0.2]),
+        msprime.RateMap(position=[0, 1, 2], rate=[0.1, 1e-6]),
+        msprime.RateMap(position=range(100), rate=range(99)),
+    ]
+
+    @pytest.mark.parametrize("rate_map", examples)
+    def test_get_rate_mid(self, rate_map):
+        rate = rate_map.get_rate(rate_map.mid)
+        assert len(rate) == len(rate_map)
+        for j in range(len(rate_map)):
+            assert rate[j] == rate_map[rate_map.mid[j]]
+
+    @pytest.mark.parametrize("rate_map", examples)
+    def test_get_rate_left(self, rate_map):
+        rate = rate_map.get_rate(rate_map.left)
+        assert len(rate) == len(rate_map)
+        for j in range(len(rate_map)):
+            assert rate[j] == rate_map[rate_map.left[j]]
+
+    @pytest.mark.parametrize("rate_map", examples)
+    def test_get_rate_right(self, rate_map):
+        rate = rate_map.get_rate(rate_map.right[:-1])
+        assert len(rate) == len(rate_map) - 1
+        for j in range(len(rate_map) - 1):
+            assert rate[j] == rate_map[rate_map.right[j]]
+
+
 class TestFindIndex:
     def test_one_interval(self):
         rate_map = msprime.RateMap(position=[0, 10], rate=[0.1])