RateMap doc text

tskit-dev · Mar 30, 2021 · 4dbefa6 · 4dbefa6
1 parent 3928078
commit 4dbefa6
Showing 1 changed file with 146 additions and 9 deletions.
diff --git a/docs/rate_maps.md b/docs/rate_maps.md
@@ -1,37 +1,174 @@
+---
+jupytext:
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: 0.12
+    jupytext_version: 1.9.1
+kernelspec:
+  display_name: Python 3
+  language: python
+  name: python3
+---
+
+```{code-cell}
+:tags: [remove-cell]
+
+    import msprime
+    from IPython.display import SVG
+    from matplotlib import pyplot as plt
+```
 
 (sec_rate_maps)=
 # Rate Maps
 
+It is often useful to express a variable rate of some sort along the chromosome.
+For example, most organisms have mutation rates and recombination rates which vary
+depending on genomic position. Rates of this sort can be stored as an instance
+of a {class}`.RateMap`, and then passed as a parameter to the relevant `msprime`
+method. For instance, {func}`.sim_ancestry` accepts a {class}`.RateMap` as the
+`recombination_rate` parameter and {func}`.sim_mutations` accepts a {class}`.RateMap`
+as the `rate` parameter.
+
+A rate map is defined as a set of constant rates which apply piecewise
+over contiguous regions of the chromosome. An example is shown below, both as a
+table and as a plot of the rate at different genomic positions.
 
-```{eval-rst}
-.. todo:: Need to fill out the content here following existing patterns.
+```{code-cell}
+:tags: [hide-input]
+ratemap = msprime.RateMap(
+    position=[0, 4000, 9000, 11000, 20000, 30000],
+    rate=[0, 1, 6, 2, 1]
+    )
+display(ratemap)
+
+plt.figure(figsize=(10, 4))
+plt.stairs(ratemap.rate, ratemap.position)
+plt.title("Example RateMap with 5 rates, and a hotspot around position 10000")
+plt.xlabel("Genome position")
+plt.ylabel("Rate")
+plt.xlim(0,ratemap.sequence_length)
+plt.show()
 ```
 
+---
+
+(sec_rate_maps_quickref)=
+
+## Quick reference
+
 {class}`.RateMap`
-: Vary rates along a genome
+: The RateMap class: an instance can be created from provided data
 
-**Creating RateMaps**
+**Creating rate maps**
 
 {meth}`.RateMap.uniform`
-: A RateMap with one rate
+: Create a RateMap with a single rate over the entire chromosome
+
+{meth}`.RateMap.read_hapmap`
+: Read in a RateMap from a file in "hapmap" format (common in recombination maps)
 
 {meth}`.RateMap.slice`
-: Slice out a portion of a RateMap
+: Create a new RateMap by slicing out a portion of an existing RateMap
 
 **Computing rates**
 
-{meth}`.RateMap.mean_rate`
-: Weighted average rate
+{attr}`.RateMap.mean_rate`
+: The weighted average rate
+
+{attr}`.RateMap.total_mass`
+: The cumulative sum of rates over the entire map
 
 {meth}`.RateMap.get_rate`
-: Compute rate at a position
+: Compute rate at a set of user-specified positions
+
+{meth}`.RateMap.get_cumulative_mass`
+: Compute the cumulative rate at a set of user-specified positions
 
 
 (sec_rate_maps_creating)=
 ## Creating rate maps
 
+A rate map can be created by initializing an instance of the {class}`.RateMap` class
+with a list of _n+1_ positions and _n_ rates. Alternatively, uniform rate maps can be
+created as described below, or rates read in from a text file. The last position of a
+rate map is taken as the sequence length: it is your responsibility to ensure that this
+is the same length as the simulated chromosome to which the rate is applied.
+
 (sec_rate_maps_creating_uniform)=
 ### Uniform rate maps
 
+The most basic rate map is a uniform rate over the entire sequence. This can simply be
+generated by {meth}`.RateMap.uniform`:
+
+```{code-cell}
+msprime.RateMap.uniform(length=1e8, rate=0.5)
+```
+
+### Reading from a file
+
+Describe read_hapmap briefly here.
+
+(sec_rate_maps_masking_and_slicing)=
+### Masking and creating new maps via slicing
+
+Often, the rates at the start and end of the chromosome (i.e. in the telomeric regions)
+are unknown, and treated as zero. Therefore if there is a zero-rate region at the start
+or end of the list of provided rates, an alternative start position and/or end position
+can be defined when creating the map such that some or all of these terminal zero-rate
+regions are "masked out" (i.e. not counted) when calculating the mean rate over the
+entire map.
+
+This sort of zeroing-out can also be handy if only a small region of the rate map is
+needed, for instance, because simulation is focussed on a small region of a larger
+chromosome. The {meth}`.RateMap.slice` method does this by setting user-specified
+start and end positions, overwriting the rate to the left of the start and to the right
+of the end with zero. The benefit to doing this is that the original coordinate system
+is still used. For example, if you wish to simulate a 40 kb section of human chromosome
+1 from position 0.7Mb, but keep the standard chromosome 1 genomic coordinates, you could
+do this:
+
+```{code-cell}
+import io
+# First and last sections of human chr 1 build 36 genetic map taken from
+# https://ftp.ncbi.nlm.nih.gov/hapmap/recombination/latest/rates/
+hapmap_chr1_snippet = io.StringIO("""chr position COMBINED_rate(cM/Mb) Genetic_Map(cM)
+1 72434 0.0015000000 0
+1 78032 0.0015000000 0.0000083970
+1 554461 0.0015000000 0.0007230405
+1 554484 0.0015000000 0.0007230750
+1 555296 0.0015000000 0.0007242930
+1 558185 0.0015000000 0.0007286265
+1 558390 0.0010000000 0.0007289340
+1 711153 1.9757156611 0.0008816970
+1 713682 2.0415081787 0.0058782819
+1 713754 2.1264889217 0.0060252705
+1 718105 2.1260252999 0.0152776238
+1 719811 2.1911540342 0.0189046230
+1 728873 2.1938678585 0.0387608608
+1 730720 2.1951341342 0.0428129347
+1 740098 1.3377804252 0.0633989027
+1 742429 0.9035885389 0.0665172688
+1 743132 0.9037201735 0.0671524916  # Next 256440 lines snipped
+1 247184904 0.6238867396 278.0908415443
+1 247185273 0 278.0910717585""")
+large_ratemap = msprime.RateMap.read_hapmap(hapmap_chr1_snippet)
+sliced_ratemap = large_ratemap.slice(700e3, 740e3)
+# Simulate only 40kb
+ts = msprime.sim_ancestry(10, recombination_rate=sliced_ratemap)
+# Only keep the trees in the sliced region. Then if we then mutate, we will not
+# create mutations in the flanking regions
+ts = ts.keep_intervals([[700e3, 740e3]])
+```
+
+Because the recombination rate is 0 in regions below position 700 000 and above position
+740 000, `msprime` has not had to put effort into simulating recombination (and its
+effect on ancestry) in these regions, so the simulation completes very quickly, even
+though the resulting tree sequence has a `sequence_length` of 247 megabases.
+
 (sec_rate_maps_creating_hotspots)=
 ### Hotspots
+
+### Extracting values at specific positions
+
+Document the used of `get_cumulative_mass` to convert physical to genetic coordinates.