Skip to content

Commit c3d5f27

Browse files
committed
Fill out the forward simulation tutorial
See extensive discussion at tskit-dev#14
1 parent c61a151 commit c3d5f27

File tree

1 file changed

+350
-10
lines changed

1 file changed

+350
-10
lines changed

forward_sims.md

+350-10
Original file line numberDiff line numberDiff line change
@@ -11,26 +11,366 @@ kernelspec:
1111
name: python3
1212
---
1313

14+
```{currentmodule} tskit
15+
```
16+
1417
(sec_tskit_forward_simulations)=
1518

16-
# _Building a forward simulator_
19+
# Building a forward simulator
1720

18-
% remove underscores in title when tutorial is complete or near-complete
21+
This tutorial shows how tskit can be used to build your own forward-time, tree sequence simulator from scratch.
22+
The simulator will use the discrete-time Wright-Fisher model, and track individuals
23+
along with their genomes, storing inherited genomic regions as well as the full pedigree.
1924

20-
This tutorial shows how tskit can be used to
21-
build your own forwards-in-time tree sequence recording simulator from scratch.
25+
The code in this tutorial is broken into separate functions for clarity and
26+
to make it easier to modify for your own purposes; a simpler and substantially
27+
condensed forward-simulator is coded as a single function at the top of the
28+
{ref}`sec_completing_forwards_simulations` tutorial.
2229

2330
:::{note}
2431
If you are simply trying to obtain a tree sequence which is
25-
the result of a forwards-in-time simulation, this can be done by using one of the
26-
highly capable forwards-in-time genetic simulators that already exist, such as
32+
the result of a forward-time simulation, this can be done by using one of the
33+
highly capable forward-time genetic simulators that already exist, such as
2734
[SLiM](https://messerlab.org/slim/) or [fwdpy11](https://github.com/molpopgen/fwdpy11).
2835
Documentation and tutorials for these tools exist on their respective websites. This
29-
tutorial focusses instead on illustrating the general principles that lie behind such
36+
tutorial is instead intended to illustrate the general principles that lie behind such
3037
simulators.
3138
:::
3239

33-
:::{todo}
34-
Add details on building a forward simulator (see issue
35-
[#14](https://github.com/tskit-dev/tutorials/issues/14))
40+
We will focus on the case of diploids, in which each {ref}`individual<sec_nodes_or_individuals>` contains 2 genomes,
41+
but the concepts used here generalize to any ploidy, if you are willing to do the book-keeping.
42+
The individuals themselves are not strictly necessary for representing genetic genealogies
43+
(it's the genomes which are important), but they are needed during the simulation,
44+
and so we record them in the resulting output for completeness.
45+
46+
## Definitions
47+
48+
Before we can make any progress, we require a few definitions.
49+
50+
A *node* represents a genome at a point in time (often we imagine this as the "birth time" of the genome).
51+
It can be described by a tuple, `(id, time)`, where `id` is a unique integer,
52+
and `time` reflects the birth time of that `id`. When generating a tree sequence,
53+
this will be stored in a row of the {ref}`sec_node_table_definition`.
54+
55+
A *diploid individual* is a group of two nodes. During simulation,
56+
a simple and efficient grouping assigns sequential pairs of node IDs to an individual.
57+
It can be helpful (but not strictly necessary) to store individuals within the tree sequence
58+
as rows of the the {ref}`sec_individual_table_definition` (a node can then be assigned to an
59+
individual by storing that individual's id in the appropriate row of the node table).
60+
61+
An *edge* reflects a transmission event between nodes. An edge is a tuple `(Left, Right, Parent, Child)`
62+
whose meaning is "Child genome $C$ inherited the genomic interval $[L, R)$ from $Parent genome $P$".
63+
In a tree sequence this is stored in a row of the {ref}`sec_edge_table_definition`.
64+
65+
The *time*, in the discrete-time Wright-Fisher (WF) model which we will simulate, is measured in
66+
integer generations. To match the tskit notion of time, we record time in *generations ago*:
67+
i.e. for a simple simulation of G generations, we start the simulation at generation $G-1$
68+
and count down until we reach generation 0 (the current-day).
69+
70+
The *population* consists of $N$ diploid individuals ($2N$ nodes) at a particular time $t$.
71+
At the start, the population will have no known ancestry, but subsequently
72+
each individual will be formed by choosing (at random) two parent individuals from the
73+
population in the previous generation.
74+
75+
## Approach
76+
77+
We will generate edges, nodes, and individuals forwards in time, adding them to the relevant `tskit` tables.
78+
To aid efficiency, we will also see how to "simplify" the tables into the minimal set of nodes and edges that
79+
describe the history of the sample. Finally, these tables can be exported into an immutable
80+
{class}`TreeSequence` for storing or analysis.
81+
82+
### Setup
83+
First, we'll import the necessary libraries and define some general parameters.
84+
The [numpy](https://numpy.org/doc/stable/) library will be used to produce random numbers.
85+
86+
```{code-cell} ipython3
87+
import tskit
88+
import numpy as np
89+
90+
random_seed = 7
91+
random = np.random.default_rng(random_seed) # A random number generator for general use
92+
93+
sequence_length = 50_000 # 50 Kb
94+
```
95+
96+
## Simulating a population
97+
98+
Our simulated population can be though of as set of $N$ numerical IDs corresponding to *individuals*.
99+
Each individual ID will also be associated with a pair of numbers corresponding to node (i.e. genome) IDs.
100+
For efficiency we keep this mapping in a Python dictionary, e.g. `{indivID_x:(nodeID_a, nodeID_b), indivID_y:(nodeID_c, nodeID_d), ...}`.
101+
The required IDs are created when adding rows to the
102+
{meth}`individual<IndividualTable.add_row>` and {meth}`node<NodeTable.add_row>` tables.
103+
104+
We will split the simulation into a few, relatively simple functions,
105+
defining how we choose parents and genomes for children in the next generation.
106+
The `chose_parents()` function below simply chooses $N$ pairs of parents at random.
107+
108+
```{code-cell} ipython3
109+
def chose_parents(pop):
110+
"Return a list of randomly chosen pairs of individual IDs from a population."
111+
ids = np.array([i for i in pop.keys()], dtype=np.int32) # array of individual IDs
112+
return [random.choice(ids, 2, replace=True) for _ in range(len(pop))]
113+
114+
def make_individuals(tables, num_individuals):
115+
"Return a list of new individual IDs; used to initialise a new population."
116+
return [tables.individuals.add_row() for _ in range(num_individuals)]
117+
118+
def make_individuals_from_population(tables, pop):
119+
"Return a list of (new ID, parent IDs) tuples of individual IDs"
120+
# note that specifying parents in add_row() is optional & simply stores the pedigree
121+
return [(tables.individuals.add_row(parents=ids), ids) for ids in chose_parents(pop)]
122+
123+
flags = tskit.NODE_IS_SAMPLE # during the simulation, flag up all new nodes as samples
124+
def make_nodes(tables, time, i, ploidy=2):
125+
"Return (usually 2) new nodes: the genomes associated with an individual `i`"
126+
return [tables.nodes.add_row(flags, time, individual=i) for _ in range(ploidy)]
127+
128+
def init_population(tables, time, N) -> dict:
129+
"Return a new population of N individuals, each with 2 genomes"
130+
return {i: make_nodes(tables, time, i) for i in make_individuals(tables, N)}
131+
132+
def next_population(tables, time, prev_pop, recombination_rate) -> dict:
133+
"As for `init_population`, but base the new population on a previous one"
134+
pop = {}
135+
for i, parents in make_individuals_from_population(tables, prev_pop):
136+
child_nodes = make_nodes(tables, time, i)
137+
pop[i] = child_nodes
138+
139+
# Now add the inheritance paths, so we can track the genetic genealogy
140+
for child_node, parent in zip(child_nodes, parents):
141+
parent_nodes = prev_pop[parent]
142+
# the add_inheritance_paths() function below is yet to be defined
143+
add_inheritance_paths(tables, parent_nodes, child_node, recombination_rate)
144+
return pop
145+
```
146+
147+
:::{note}
148+
For simplicity, the code above assumes any parent can be a mother or a father (i.e. this is a hermaphrodite species).
149+
It also allows the same parent to be chosed as a mother and as a father (i.e. "selfing" is allowed),
150+
which gives simpler theoretical results. This is easy to change if required.
151+
:::
152+
153+
Our forward-time simulator simply involves repeatedly running the `next_population()` routine,
154+
replacing the old population with the new one. For efficiency reasons, `tskit` has strict requirements
155+
for the order of edges in the edge table, so we need to {meth}`~TableCollection.sort` the tables before we output the final tree sequence.
156+
157+
```{code-cell} ipython3
158+
def forward_WF_sim(num_diploids, generations, recombination_rate=0, random_seed=7):
159+
"""
160+
Run a forward-time Wright Fisher simulation of a diploid population, returning
161+
a tree sequence representing the genetic genealogy of the simulated genomes.
162+
"""
163+
global random
164+
random = np.random.default_rng(random_seed)
165+
tables = tskit.TableCollection(sequence_length)
166+
tables.time_units = "generations" # optional, but helpful when plotting
167+
168+
pop = init_population(tables, generations, num_diploids) # initial population
169+
while generations > 0:
170+
generations = generations - 1
171+
pop = next_population(tables, generations, pop, recombination_rate)
172+
173+
tables.sort()
174+
return tables.tree_sequence()
175+
```
176+
177+
### Inheritance without recombination
178+
179+
The final piece of the simulation is to define the `add_inheritance_paths()` function,
180+
which saves the inheritance paths in the {ref}`sec_edge_table_definition`.
181+
For reference, the simplest case (a small focal region in which there is no recombination)
182+
can be coded as follows:
183+
184+
```{code-cell} ipython3
185+
def add_inheritance_paths(tables, parent_nodes, child_node, recombination_rate):
186+
"Add inheritance paths from a randomly chosen parent genome to the child genome."
187+
assert recombination_rate == 0
188+
left, right = [20_000, 21_000] # only define inheritance in this focal region
189+
inherit_from = random.integers(2) # randomly choose the 1st or the 2nd parent node
190+
tables.edges.add_row(left, right, parent_nodes[inherit_from], child_node)
191+
```
192+
193+
### Inheritance with recombination
194+
195+
Recombination adds complexity to the inheritance paths from a child to its parents.
196+
The function below selects a set of "breakpoints" along the genome,
197+
and points the first edge (from zero to the first breakpoint) to one parental genome,
198+
the second edge (from the first to the second breakpoint) to the other parent genome,
199+
and so on up to the end of the sequence. More biologically realistic recombination
200+
models could be substituted into this function.
201+
202+
Note that real recombination rates are usually such that they result in relatively
203+
few breakpoints per chromosome (in humans, around 1 or 2).
204+
205+
```{code-cell} ipython3
206+
def add_inheritance_paths(tables, parent_genomes, child_genome, recombination_rate):
207+
"Add paths from parent genomes to the child genome, with crossover recombination."
208+
L = tables.sequence_length
209+
num_recombinations = random.poisson(recombination_rate * L)
210+
breakpoints = random.uniform(0, L, size=num_recombinations)
211+
breakpoints = np.concatenate(([0], np.unique(breakpoints), [L]))
212+
inherit_from = random.integers(2) # starting parental genome
213+
214+
# iterate over pairs of ([0, b1], [b1, b2], [b2, b3], ... [bN, L])
215+
for left, right in zip(breakpoints[:-1], breakpoints[1:]):
216+
tables.edges.add_row(
217+
left, right, parent_genomes[inherit_from], child_genome)
218+
inherit_from = 1 - inherit_from # switch to other parent genome
219+
```
220+
221+
:::{note}
222+
Above, breakpoint positions occur on a continuous line (i.e. "infinite breakpoint positions"),
223+
to match population genetic theory. It is relatively easy to alter this to
224+
allos recombinations only at integer positions
225+
:::
226+
227+
### Basic examples
228+
229+
Now we can test the `forward_WF_sim()` function for a single generation with a small
230+
population size of 6 diploids, and print out the resulting tree sequence. For simplicity,
231+
we will set the recombination rate to zero for now.
232+
233+
```{code-cell} ipython3
234+
ts = forward_WF_sim(6, generations=1, recombination_rate=0)
235+
ts.draw_svg(y_axis=True, size=(500, 200))
236+
```
237+
238+
It looks like it is working correctly: all 12 genomes (6 diploids) in the current generation at time=0 trace back to a
239+
genome in the initial generation at time=1. Note that not all individuals in the initial generation have passed on genetic
240+
material at this genomic position (they appear as isolated nodes at the top of the plot).
241+
242+
Now let's simulate for a longer time period, and set a few helpful plotting parameters.
243+
244+
:::{note}
245+
By convention we plot the most recent generation at the bottom of the plot
246+
(i.e. perversely, each "tree" has leaves towards the bottom, and roots at the top)
36247
:::
248+
249+
```{code-cell} ipython3
250+
ts = forward_WF_sim(6, generations=20, recombination_rate=0)
251+
252+
graphics_params = {
253+
"y_axis": True,
254+
"y_label": f"Time ({ts.time_units} ago)",
255+
"y_ticks": {i: 'Current' if i==0 else str(i) for i in range(21)},
256+
}
257+
ts.draw_svg(size=(1200, 400), **graphics_params)
258+
```
259+
260+
This is starting to look like a real genealogy! Clearly, however, there are many
261+
"extinct" lineages that have not made it to the current day.
262+
263+
## Simplification
264+
265+
The key to efficent forward-time genealogical simulation is the process of {ref}`sec_simplification`,
266+
which can reduce much of the complexity shown in the tree above.
267+
Typically, we want to remove all the lineages that do not contribute to the current day genomes.
268+
We do this via the {meth}`~tskit.TreeSequence.simplify` method, specifying that only the nodes
269+
in the current generation are "samples".
270+
271+
```{code-cell} ipython3
272+
current_day_genomes = ts.samples(time=0)
273+
simplified_ts = ts.simplify(current_day_genomes, keep_unary=True, filter_nodes=False)
274+
simplified_ts.draw_svg(size=(600, 400), **graphics_params)
275+
```
276+
277+
### Removing unreferenced nodes
278+
279+
We just simplified with `filter_nodes=False`, meaning that the tree sequence retained
280+
all nodes even after simplification, even those that are no longer part of
281+
the genealogy. By default (if `filter_nodes` is not specified), these nodes are removed,
282+
which changes the node IDs.
283+
284+
```{code-cell} ipython3
285+
simplified_ts = ts.simplify(current_day_genomes, keep_unary=True)
286+
simplified_ts.draw_svg(size=(600, 300), **graphics_params)
287+
```
288+
289+
You can see that the list of nodes passed to {meth}`~tskit.TreeSequence.simplify` (i.e. the current-day genomes)
290+
have become the first nodes in the table, numbered from 0..11;
291+
the remaining (internal) nodes have been renumbered from youngest to oldest.
292+
293+
### Removing more nodes
294+
295+
The `keep_unary=True` parameter meant that we kept intermediate ("unary") nodes,
296+
even those that do not not represent branch-points in the tree.
297+
Often these are also unneeded, and by default we remove those too, although
298+
this will mean that we lose track of the pedigree of the individuals
299+
(which is stored in the parents column of the {ref}`sec_individual_table_definition`).
300+
Since we are removing more nodes, the node IDs of non-samples will again change.
301+
302+
```{code-cell} ipython3
303+
simplified_ts = ts.simplify(current_day_genomes)
304+
simplified_ts.draw_svg(size=(400, 300), y_axis=True)
305+
```
306+
307+
This is now looking much more like a "normal" genetic genealogy (a "gene tree"),
308+
in which all the sample genomes trace back to a single common ancestor.
309+
310+
## Recombination
311+
312+
If we pass a non-zero recombination rate to the `forward_WF_sim()` function, different regions
313+
of the genome may have different ancestries. This results in multiple trees along the genome.
314+
315+
```{code-cell} ipython3
316+
rho = 1e-7
317+
ts = forward_WF_sim(6, generations=50, recombination_rate=rho)
318+
print(f"A recombination rate of {rho} has created {ts.num_trees} trees over {ts.sequence_length} bp")
319+
```
320+
321+
Here's how the full (unsimplified) genealogy looks:
322+
323+
```{code-cell} ipython3
324+
graphics_params["y_ticks"] = [0, 10, 20, 30, 40 ,50]
325+
ts.draw_svg(size=(1000, 400), time_scale="log_time", **graphics_params)
326+
```
327+
328+
Because we are showing the extinct lineages and the recombinations within them, this plot is rather confusing.
329+
Again, the act of simplification allows us to to reduce the genealogy to something more managable,
330+
even with many generations. This is useful both for analysis and for visualization:
331+
332+
```{code-cell} ipython3
333+
ts = forward_WF_sim(6, generations=100, recombination_rate=rho)
334+
simplified_ts = ts.simplify(ts.samples(time=0))
335+
graphics_params["y_ticks"] = [0, 10, 20, 30, 40 ,50]
336+
simplified_ts.draw_svg(size=(1000, 300), time_scale="log_time", **graphics_params)
337+
```
338+
339+
## Multiple roots
340+
341+
If we run the simulation for fewer generations, or have a larger population size,
342+
we are not guaranteed that the genomes will share a common ancestor within the timeframe of our simulation.
343+
Even with no recombination (so that all regions of the genome share the same pattern of
344+
inheritance) the "tree" will consist of multiple unlinked topologies. In `tskit` this is described as a
345+
tree with {ref}`multiple roots<sec_data_model_tree_roots>`.
346+
347+
```{code-cell} ipython3
348+
ts = forward_WF_sim(10, generations=20)
349+
simplified_ts = ts.simplify(ts.samples(time=0))
350+
graphics_params["y_ticks"] = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
351+
simplified_ts.draw_svg(size=(800, 200), **graphics_params)
352+
```
353+
354+
When a forward-simulated tree has multiple roots, it can be useful to retain relevant lineages
355+
all the way back to the start of the simulation. This can be done using the `keep_input_roots` option:
356+
357+
```{code-cell} ipython3
358+
simplified_ts = ts.simplify(ts.samples(time=0), keep_input_roots=True)
359+
simplified_ts.draw_svg(size=(800, 300), **graphics_params)
360+
```
361+
362+
Since the trees have not all coalesced, the simulation will be failing to capture the entire genetic diversity within the sample.
363+
Moreover, the larger the population, the longer the time needed to ensure that the full genealogy is captured.
364+
For large models, time period required for forward simulations to ensure full coalescence can be prohibitive.
365+
366+
A powerful way to get around this problem is *recapitation*, in which an alternative technique,
367+
such as backward-time coalescent simulation is used to to fill in the "head" of the tree sequence.
368+
In other words, we use a fast backward-time simulator such as `msprime` to simulate the genealogy of the oldest nodes in the simplified tree sequence.
369+
This is described in the {ref}`sec_completing_forwards_simulations` tutorial.
370+
371+
## More complex forward-simulations
372+
373+
The next tutorial shows the principles behind more complex simulations,
374+
including e.g. regular simplification during the simulation,
375+
adding mutations, and adding metadata.
376+
It also details several extra tips and tricks we have learned when building forward simulators.

0 commit comments

Comments
 (0)