From 4dbefa6743b894468c0594add62dd8e9d33678a4 Mon Sep 17 00:00:00 2001 From: Yan Wong Date: Sun, 28 Mar 2021 22:17:58 +0100 Subject: [PATCH] RateMap doc text --- docs/rate_maps.md | 155 +++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 146 insertions(+), 9 deletions(-) diff --git a/docs/rate_maps.md b/docs/rate_maps.md index 8831d7bae..97f714d4d 100644 --- a/docs/rate_maps.md +++ b/docs/rate_maps.md @@ -1,37 +1,174 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.12 + jupytext_version: 1.9.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +```{code-cell} +:tags: [remove-cell] + + import msprime + from IPython.display import SVG + from matplotlib import pyplot as plt +``` (sec_rate_maps)= # Rate Maps +It is often useful to express a variable rate of some sort along the chromosome. +For example, most organisms have mutation rates and recombination rates which vary +depending on genomic position. Rates of this sort can be stored as an instance +of a {class}`.RateMap`, and then passed as a parameter to the relevant `msprime` +method. For instance, {func}`.sim_ancestry` accepts a {class}`.RateMap` as the +`recombination_rate` parameter and {func}`.sim_mutations` accepts a {class}`.RateMap` +as the `rate` parameter. + +A rate map is defined as a set of constant rates which apply piecewise +over contiguous regions of the chromosome. An example is shown below, both as a +table and as a plot of the rate at different genomic positions. -```{eval-rst} -.. todo:: Need to fill out the content here following existing patterns. +```{code-cell} +:tags: [hide-input] +ratemap = msprime.RateMap( + position=[0, 4000, 9000, 11000, 20000, 30000], + rate=[0, 1, 6, 2, 1] + ) +display(ratemap) + +plt.figure(figsize=(10, 4)) +plt.stairs(ratemap.rate, ratemap.position) +plt.title("Example RateMap with 5 rates, and a hotspot around position 10000") +plt.xlabel("Genome position") +plt.ylabel("Rate") +plt.xlim(0,ratemap.sequence_length) +plt.show() ``` +--- + +(sec_rate_maps_quickref)= + +## Quick reference + {class}`.RateMap` -: Vary rates along a genome +: The RateMap class: an instance can be created from provided data -**Creating RateMaps** +**Creating rate maps** {meth}`.RateMap.uniform` -: A RateMap with one rate +: Create a RateMap with a single rate over the entire chromosome + +{meth}`.RateMap.read_hapmap` +: Read in a RateMap from a file in "hapmap" format (common in recombination maps) {meth}`.RateMap.slice` -: Slice out a portion of a RateMap +: Create a new RateMap by slicing out a portion of an existing RateMap **Computing rates** -{meth}`.RateMap.mean_rate` -: Weighted average rate +{attr}`.RateMap.mean_rate` +: The weighted average rate + +{attr}`.RateMap.total_mass` +: The cumulative sum of rates over the entire map {meth}`.RateMap.get_rate` -: Compute rate at a position +: Compute rate at a set of user-specified positions + +{meth}`.RateMap.get_cumulative_mass` +: Compute the cumulative rate at a set of user-specified positions (sec_rate_maps_creating)= ## Creating rate maps +A rate map can be created by initializing an instance of the {class}`.RateMap` class +with a list of _n+1_ positions and _n_ rates. Alternatively, uniform rate maps can be +created as described below, or rates read in from a text file. The last position of a +rate map is taken as the sequence length: it is your responsibility to ensure that this +is the same length as the simulated chromosome to which the rate is applied. + (sec_rate_maps_creating_uniform)= ### Uniform rate maps +The most basic rate map is a uniform rate over the entire sequence. This can simply be +generated by {meth}`.RateMap.uniform`: + +```{code-cell} +msprime.RateMap.uniform(length=1e8, rate=0.5) +``` + +### Reading from a file + +Describe read_hapmap briefly here. + +(sec_rate_maps_masking_and_slicing)= +### Masking and creating new maps via slicing + +Often, the rates at the start and end of the chromosome (i.e. in the telomeric regions) +are unknown, and treated as zero. Therefore if there is a zero-rate region at the start +or end of the list of provided rates, an alternative start position and/or end position +can be defined when creating the map such that some or all of these terminal zero-rate +regions are "masked out" (i.e. not counted) when calculating the mean rate over the +entire map. + +This sort of zeroing-out can also be handy if only a small region of the rate map is +needed, for instance, because simulation is focussed on a small region of a larger +chromosome. The {meth}`.RateMap.slice` method does this by setting user-specified +start and end positions, overwriting the rate to the left of the start and to the right +of the end with zero. The benefit to doing this is that the original coordinate system +is still used. For example, if you wish to simulate a 40 kb section of human chromosome +1 from position 0.7Mb, but keep the standard chromosome 1 genomic coordinates, you could +do this: + +```{code-cell} +import io +# First and last sections of human chr 1 build 36 genetic map taken from +# https://ftp.ncbi.nlm.nih.gov/hapmap/recombination/latest/rates/ +hapmap_chr1_snippet = io.StringIO("""chr position COMBINED_rate(cM/Mb) Genetic_Map(cM) +1 72434 0.0015000000 0 +1 78032 0.0015000000 0.0000083970 +1 554461 0.0015000000 0.0007230405 +1 554484 0.0015000000 0.0007230750 +1 555296 0.0015000000 0.0007242930 +1 558185 0.0015000000 0.0007286265 +1 558390 0.0010000000 0.0007289340 +1 711153 1.9757156611 0.0008816970 +1 713682 2.0415081787 0.0058782819 +1 713754 2.1264889217 0.0060252705 +1 718105 2.1260252999 0.0152776238 +1 719811 2.1911540342 0.0189046230 +1 728873 2.1938678585 0.0387608608 +1 730720 2.1951341342 0.0428129347 +1 740098 1.3377804252 0.0633989027 +1 742429 0.9035885389 0.0665172688 +1 743132 0.9037201735 0.0671524916 # Next 256440 lines snipped +1 247184904 0.6238867396 278.0908415443 +1 247185273 0 278.0910717585""") +large_ratemap = msprime.RateMap.read_hapmap(hapmap_chr1_snippet) +sliced_ratemap = large_ratemap.slice(700e3, 740e3) +# Simulate only 40kb +ts = msprime.sim_ancestry(10, recombination_rate=sliced_ratemap) +# Only keep the trees in the sliced region. Then if we then mutate, we will not +# create mutations in the flanking regions +ts = ts.keep_intervals([[700e3, 740e3]]) +``` + +Because the recombination rate is 0 in regions below position 700 000 and above position +740 000, `msprime` has not had to put effort into simulating recombination (and its +effect on ancestry) in these regions, so the simulation completes very quickly, even +though the resulting tree sequence has a `sequence_length` of 247 megabases. + (sec_rate_maps_creating_hotspots)= ### Hotspots + +### Extracting values at specific positions + +Document the used of `get_cumulative_mass` to convert physical to genetic coordinates. \ No newline at end of file