Skip to content

Commit

Permalink
RateMap doc text
Browse files Browse the repository at this point in the history
  • Loading branch information
hyanwong authored and jeromekelleher committed Mar 30, 2021
1 parent 3928078 commit 4dbefa6
Showing 1 changed file with 146 additions and 9 deletions.
155 changes: 146 additions & 9 deletions docs/rate_maps.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,174 @@
---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.12
jupytext_version: 1.9.1
kernelspec:
display_name: Python 3
language: python
name: python3
---

```{code-cell}
:tags: [remove-cell]
import msprime
from IPython.display import SVG
from matplotlib import pyplot as plt
```

(sec_rate_maps)=
# Rate Maps

It is often useful to express a variable rate of some sort along the chromosome.
For example, most organisms have mutation rates and recombination rates which vary
depending on genomic position. Rates of this sort can be stored as an instance
of a {class}`.RateMap`, and then passed as a parameter to the relevant `msprime`
method. For instance, {func}`.sim_ancestry` accepts a {class}`.RateMap` as the
`recombination_rate` parameter and {func}`.sim_mutations` accepts a {class}`.RateMap`
as the `rate` parameter.

A rate map is defined as a set of constant rates which apply piecewise
over contiguous regions of the chromosome. An example is shown below, both as a
table and as a plot of the rate at different genomic positions.

```{eval-rst}
.. todo:: Need to fill out the content here following existing patterns.
```{code-cell}
:tags: [hide-input]
ratemap = msprime.RateMap(
position=[0, 4000, 9000, 11000, 20000, 30000],
rate=[0, 1, 6, 2, 1]
)
display(ratemap)
plt.figure(figsize=(10, 4))
plt.stairs(ratemap.rate, ratemap.position)
plt.title("Example RateMap with 5 rates, and a hotspot around position 10000")
plt.xlabel("Genome position")
plt.ylabel("Rate")
plt.xlim(0,ratemap.sequence_length)
plt.show()
```

---

(sec_rate_maps_quickref)=

## Quick reference

{class}`.RateMap`
: Vary rates along a genome
: The RateMap class: an instance can be created from provided data

**Creating RateMaps**
**Creating rate maps**

{meth}`.RateMap.uniform`
: A RateMap with one rate
: Create a RateMap with a single rate over the entire chromosome

{meth}`.RateMap.read_hapmap`
: Read in a RateMap from a file in "hapmap" format (common in recombination maps)

{meth}`.RateMap.slice`
: Slice out a portion of a RateMap
: Create a new RateMap by slicing out a portion of an existing RateMap

**Computing rates**

{meth}`.RateMap.mean_rate`
: Weighted average rate
{attr}`.RateMap.mean_rate`
: The weighted average rate

{attr}`.RateMap.total_mass`
: The cumulative sum of rates over the entire map

{meth}`.RateMap.get_rate`
: Compute rate at a position
: Compute rate at a set of user-specified positions

{meth}`.RateMap.get_cumulative_mass`
: Compute the cumulative rate at a set of user-specified positions


(sec_rate_maps_creating)=
## Creating rate maps

A rate map can be created by initializing an instance of the {class}`.RateMap` class
with a list of _n+1_ positions and _n_ rates. Alternatively, uniform rate maps can be
created as described below, or rates read in from a text file. The last position of a
rate map is taken as the sequence length: it is your responsibility to ensure that this
is the same length as the simulated chromosome to which the rate is applied.

(sec_rate_maps_creating_uniform)=
### Uniform rate maps

The most basic rate map is a uniform rate over the entire sequence. This can simply be
generated by {meth}`.RateMap.uniform`:

```{code-cell}
msprime.RateMap.uniform(length=1e8, rate=0.5)
```

### Reading from a file

Describe read_hapmap briefly here.

(sec_rate_maps_masking_and_slicing)=
### Masking and creating new maps via slicing

Often, the rates at the start and end of the chromosome (i.e. in the telomeric regions)
are unknown, and treated as zero. Therefore if there is a zero-rate region at the start
or end of the list of provided rates, an alternative start position and/or end position
can be defined when creating the map such that some or all of these terminal zero-rate
regions are "masked out" (i.e. not counted) when calculating the mean rate over the
entire map.

This sort of zeroing-out can also be handy if only a small region of the rate map is
needed, for instance, because simulation is focussed on a small region of a larger
chromosome. The {meth}`.RateMap.slice` method does this by setting user-specified
start and end positions, overwriting the rate to the left of the start and to the right
of the end with zero. The benefit to doing this is that the original coordinate system
is still used. For example, if you wish to simulate a 40 kb section of human chromosome
1 from position 0.7Mb, but keep the standard chromosome 1 genomic coordinates, you could
do this:

```{code-cell}
import io
# First and last sections of human chr 1 build 36 genetic map taken from
# https://ftp.ncbi.nlm.nih.gov/hapmap/recombination/latest/rates/
hapmap_chr1_snippet = io.StringIO("""chr position COMBINED_rate(cM/Mb) Genetic_Map(cM)
1 72434 0.0015000000 0
1 78032 0.0015000000 0.0000083970
1 554461 0.0015000000 0.0007230405
1 554484 0.0015000000 0.0007230750
1 555296 0.0015000000 0.0007242930
1 558185 0.0015000000 0.0007286265
1 558390 0.0010000000 0.0007289340
1 711153 1.9757156611 0.0008816970
1 713682 2.0415081787 0.0058782819
1 713754 2.1264889217 0.0060252705
1 718105 2.1260252999 0.0152776238
1 719811 2.1911540342 0.0189046230
1 728873 2.1938678585 0.0387608608
1 730720 2.1951341342 0.0428129347
1 740098 1.3377804252 0.0633989027
1 742429 0.9035885389 0.0665172688
1 743132 0.9037201735 0.0671524916 # Next 256440 lines snipped
1 247184904 0.6238867396 278.0908415443
1 247185273 0 278.0910717585""")
large_ratemap = msprime.RateMap.read_hapmap(hapmap_chr1_snippet)
sliced_ratemap = large_ratemap.slice(700e3, 740e3)
# Simulate only 40kb
ts = msprime.sim_ancestry(10, recombination_rate=sliced_ratemap)
# Only keep the trees in the sliced region. Then if we then mutate, we will not
# create mutations in the flanking regions
ts = ts.keep_intervals([[700e3, 740e3]])
```

Because the recombination rate is 0 in regions below position 700 000 and above position
740 000, `msprime` has not had to put effort into simulating recombination (and its
effect on ancestry) in these regions, so the simulation completes very quickly, even
though the resulting tree sequence has a `sequence_length` of 247 megabases.

(sec_rate_maps_creating_hotspots)=
### Hotspots

### Extracting values at specific positions

Document the used of `get_cumulative_mass` to convert physical to genetic coordinates.

0 comments on commit 4dbefa6

Please sign in to comment.