@@ -11,26 +11,366 @@ kernelspec:
11
11
name : python3
12
12
---
13
13
14
+ ``` {currentmodule} tskit
15
+ ```
16
+
14
17
(sec_tskit_forward_simulations)=
15
18
16
- # _ Building a forward simulator _
19
+ # Building a forward simulator
17
20
18
- % remove underscores in title when tutorial is complete or near-complete
21
+ This tutorial shows how tskit can be used to build your own forward-time, tree sequence simulator from scratch.
22
+ The simulator will use the discrete-time Wright-Fisher model, and track individuals
23
+ along with their genomes, storing inherited genomic regions as well as the full pedigree.
19
24
20
- This tutorial shows how tskit can be used to
21
- build your own forwards-in-time tree sequence recording simulator from scratch.
25
+ The code in this tutorial is broken into separate functions for clarity and
26
+ to make it easier to modify for your own purposes; a simpler and substantially
27
+ condensed forward-simulator is coded as a single function at the top of the
28
+ {ref}` sec_completing_forwards_simulations ` tutorial.
22
29
23
30
:::{note}
24
31
If you are simply trying to obtain a tree sequence which is
25
- the result of a forwards-in -time simulation, this can be done by using one of the
26
- highly capable forwards-in -time genetic simulators that already exist, such as
32
+ the result of a forward -time simulation, this can be done by using one of the
33
+ highly capable forward -time genetic simulators that already exist, such as
27
34
[ SLiM] ( https://messerlab.org/slim/ ) or [ fwdpy11] ( https://github.com/molpopgen/fwdpy11 ) .
28
35
Documentation and tutorials for these tools exist on their respective websites. This
29
- tutorial focusses instead on illustrating the general principles that lie behind such
36
+ tutorial is instead intended to illustrate the general principles that lie behind such
30
37
simulators.
31
38
:::
32
39
33
- :::{todo}
34
- Add details on building a forward simulator (see issue
35
- [ #14 ] ( https://github.com/tskit-dev/tutorials/issues/14 ) )
40
+ We will focus on the case of diploids, in which each {ref}` individual<sec_nodes_or_individuals> ` contains 2 genomes,
41
+ but the concepts used here generalize to any ploidy, if you are willing to do the book-keeping.
42
+ The individuals themselves are not strictly necessary for representing genetic genealogies
43
+ (it's the genomes which are important), but they are needed during the simulation,
44
+ and so we record them in the resulting output for completeness.
45
+
46
+ ## Definitions
47
+
48
+ Before we can make any progress, we require a few definitions.
49
+
50
+ A * node* represents a genome at a point in time (often we imagine this as the "birth time" of the genome).
51
+ It can be described by a tuple, ` (id, time) ` , where ` id ` is a unique integer,
52
+ and ` time ` reflects the birth time of that ` id ` . When generating a tree sequence,
53
+ this will be stored in a row of the {ref}` sec_node_table_definition ` .
54
+
55
+ A * diploid individual* is a group of two nodes. During simulation,
56
+ a simple and efficient grouping assigns sequential pairs of node IDs to an individual.
57
+ It can be helpful (but not strictly necessary) to store individuals within the tree sequence
58
+ as rows of the the {ref}` sec_individual_table_definition ` (a node can then be assigned to an
59
+ individual by storing that individual's id in the appropriate row of the node table).
60
+
61
+ An * edge* reflects a transmission event between nodes. An edge is a tuple ` (Left, Right, Parent, Child) `
62
+ whose meaning is "Child genome $C$ inherited the genomic interval $[ L, R)$ from $Parent genome $P$".
63
+ In a tree sequence this is stored in a row of the {ref}` sec_edge_table_definition ` .
64
+
65
+ The * time* , in the discrete-time Wright-Fisher (WF) model which we will simulate, is measured in
66
+ integer generations. To match the tskit notion of time, we record time in * generations ago* :
67
+ i.e. for a simple simulation of G generations, we start the simulation at generation $G-1$
68
+ and count down until we reach generation 0 (the current-day).
69
+
70
+ The * population* consists of $N$ diploid individuals ($2N$ nodes) at a particular time $t$.
71
+ At the start, the population will have no known ancestry, but subsequently
72
+ each individual will be formed by choosing (at random) two parent individuals from the
73
+ population in the previous generation.
74
+
75
+ ## Approach
76
+
77
+ We will generate edges, nodes, and individuals forwards in time, adding them to the relevant ` tskit ` tables.
78
+ To aid efficiency, we will also see how to "simplify" the tables into the minimal set of nodes and edges that
79
+ describe the history of the sample. Finally, these tables can be exported into an immutable
80
+ {class}` TreeSequence ` for storing or analysis.
81
+
82
+ ### Setup
83
+ First, we'll import the necessary libraries and define some general parameters.
84
+ The [ numpy] ( https://numpy.org/doc/stable/ ) library will be used to produce random numbers.
85
+
86
+ ``` {code-cell} ipython3
87
+ import tskit
88
+ import numpy as np
89
+
90
+ random_seed = 7
91
+ random = np.random.default_rng(random_seed) # A random number generator for general use
92
+
93
+ sequence_length = 50_000 # 50 Kb
94
+ ```
95
+
96
+ ## Simulating a population
97
+
98
+ Our simulated population can be though of as set of $N$ numerical IDs corresponding to * individuals* .
99
+ Each individual ID will also be associated with a pair of numbers corresponding to node (i.e. genome) IDs.
100
+ For efficiency we keep this mapping in a Python dictionary, e.g. ` {indivID_x:(nodeID_a, nodeID_b), indivID_y:(nodeID_c, nodeID_d), ...} ` .
101
+ The required IDs are created when adding rows to the
102
+ {meth}` individual<IndividualTable.add_row> ` and {meth}` node<NodeTable.add_row> ` tables.
103
+
104
+ We will split the simulation into a few, relatively simple functions,
105
+ defining how we choose parents and genomes for children in the next generation.
106
+ The ` chose_parents() ` function below simply chooses $N$ pairs of parents at random.
107
+
108
+ ``` {code-cell} ipython3
109
+ def chose_parents(pop):
110
+ "Return a list of randomly chosen pairs of individual IDs from a population."
111
+ ids = np.array([i for i in pop.keys()], dtype=np.int32) # array of individual IDs
112
+ return [random.choice(ids, 2, replace=True) for _ in range(len(pop))]
113
+
114
+ def make_individuals(tables, num_individuals):
115
+ "Return a list of new individual IDs; used to initialise a new population."
116
+ return [tables.individuals.add_row() for _ in range(num_individuals)]
117
+
118
+ def make_individuals_from_population(tables, pop):
119
+ "Return a list of (new ID, parent IDs) tuples of individual IDs"
120
+ # note that specifying parents in add_row() is optional & simply stores the pedigree
121
+ return [(tables.individuals.add_row(parents=ids), ids) for ids in chose_parents(pop)]
122
+
123
+ flags = tskit.NODE_IS_SAMPLE # during the simulation, flag up all new nodes as samples
124
+ def make_nodes(tables, time, i, ploidy=2):
125
+ "Return (usually 2) new nodes: the genomes associated with an individual `i`"
126
+ return [tables.nodes.add_row(flags, time, individual=i) for _ in range(ploidy)]
127
+
128
+ def init_population(tables, time, N) -> dict:
129
+ "Return a new population of N individuals, each with 2 genomes"
130
+ return {i: make_nodes(tables, time, i) for i in make_individuals(tables, N)}
131
+
132
+ def next_population(tables, time, prev_pop, recombination_rate) -> dict:
133
+ "As for `init_population`, but base the new population on a previous one"
134
+ pop = {}
135
+ for i, parents in make_individuals_from_population(tables, prev_pop):
136
+ child_nodes = make_nodes(tables, time, i)
137
+ pop[i] = child_nodes
138
+
139
+ # Now add the inheritance paths, so we can track the genetic genealogy
140
+ for child_node, parent in zip(child_nodes, parents):
141
+ parent_nodes = prev_pop[parent]
142
+ # the add_inheritance_paths() function below is yet to be defined
143
+ add_inheritance_paths(tables, parent_nodes, child_node, recombination_rate)
144
+ return pop
145
+ ```
146
+
147
+ :::{note}
148
+ For simplicity, the code above assumes any parent can be a mother or a father (i.e. this is a hermaphrodite species).
149
+ It also allows the same parent to be chosed as a mother and as a father (i.e. "selfing" is allowed),
150
+ which gives simpler theoretical results. This is easy to change if required.
151
+ :::
152
+
153
+ Our forward-time simulator simply involves repeatedly running the ` next_population() ` routine,
154
+ replacing the old population with the new one. For efficiency reasons, ` tskit ` has strict requirements
155
+ for the order of edges in the edge table, so we need to {meth}` ~TableCollection.sort ` the tables before we output the final tree sequence.
156
+
157
+ ``` {code-cell} ipython3
158
+ def forward_WF_sim(num_diploids, generations, recombination_rate=0, random_seed=7):
159
+ """
160
+ Run a forward-time Wright Fisher simulation of a diploid population, returning
161
+ a tree sequence representing the genetic genealogy of the simulated genomes.
162
+ """
163
+ global random
164
+ random = np.random.default_rng(random_seed)
165
+ tables = tskit.TableCollection(sequence_length)
166
+ tables.time_units = "generations" # optional, but helpful when plotting
167
+
168
+ pop = init_population(tables, generations, num_diploids) # initial population
169
+ while generations > 0:
170
+ generations = generations - 1
171
+ pop = next_population(tables, generations, pop, recombination_rate)
172
+
173
+ tables.sort()
174
+ return tables.tree_sequence()
175
+ ```
176
+
177
+ ### Inheritance without recombination
178
+
179
+ The final piece of the simulation is to define the ` add_inheritance_paths() ` function,
180
+ which saves the inheritance paths in the {ref}` sec_edge_table_definition ` .
181
+ For reference, the simplest case (a small focal region in which there is no recombination)
182
+ can be coded as follows:
183
+
184
+ ``` {code-cell} ipython3
185
+ def add_inheritance_paths(tables, parent_nodes, child_node, recombination_rate):
186
+ "Add inheritance paths from a randomly chosen parent genome to the child genome."
187
+ assert recombination_rate == 0
188
+ left, right = [20_000, 21_000] # only define inheritance in this focal region
189
+ inherit_from = random.integers(2) # randomly choose the 1st or the 2nd parent node
190
+ tables.edges.add_row(left, right, parent_nodes[inherit_from], child_node)
191
+ ```
192
+
193
+ ### Inheritance with recombination
194
+
195
+ Recombination adds complexity to the inheritance paths from a child to its parents.
196
+ The function below selects a set of "breakpoints" along the genome,
197
+ and points the first edge (from zero to the first breakpoint) to one parental genome,
198
+ the second edge (from the first to the second breakpoint) to the other parent genome,
199
+ and so on up to the end of the sequence. More biologically realistic recombination
200
+ models could be substituted into this function.
201
+
202
+ Note that real recombination rates are usually such that they result in relatively
203
+ few breakpoints per chromosome (in humans, around 1 or 2).
204
+
205
+ ``` {code-cell} ipython3
206
+ def add_inheritance_paths(tables, parent_genomes, child_genome, recombination_rate):
207
+ "Add paths from parent genomes to the child genome, with crossover recombination."
208
+ L = tables.sequence_length
209
+ num_recombinations = random.poisson(recombination_rate * L)
210
+ breakpoints = random.uniform(0, L, size=num_recombinations)
211
+ breakpoints = np.concatenate(([0], np.unique(breakpoints), [L]))
212
+ inherit_from = random.integers(2) # starting parental genome
213
+
214
+ # iterate over pairs of ([0, b1], [b1, b2], [b2, b3], ... [bN, L])
215
+ for left, right in zip(breakpoints[:-1], breakpoints[1:]):
216
+ tables.edges.add_row(
217
+ left, right, parent_genomes[inherit_from], child_genome)
218
+ inherit_from = 1 - inherit_from # switch to other parent genome
219
+ ```
220
+
221
+ :::{note}
222
+ Above, breakpoint positions occur on a continuous line (i.e. "infinite breakpoint positions"),
223
+ to match population genetic theory. It is relatively easy to alter this to
224
+ allos recombinations only at integer positions
225
+ :::
226
+
227
+ ### Basic examples
228
+
229
+ Now we can test the ` forward_WF_sim() ` function for a single generation with a small
230
+ population size of 6 diploids, and print out the resulting tree sequence. For simplicity,
231
+ we will set the recombination rate to zero for now.
232
+
233
+ ``` {code-cell} ipython3
234
+ ts = forward_WF_sim(6, generations=1, recombination_rate=0)
235
+ ts.draw_svg(y_axis=True, size=(500, 200))
236
+ ```
237
+
238
+ It looks like it is working correctly: all 12 genomes (6 diploids) in the current generation at time=0 trace back to a
239
+ genome in the initial generation at time=1. Note that not all individuals in the initial generation have passed on genetic
240
+ material at this genomic position (they appear as isolated nodes at the top of the plot).
241
+
242
+ Now let's simulate for a longer time period, and set a few helpful plotting parameters.
243
+
244
+ :::{note}
245
+ By convention we plot the most recent generation at the bottom of the plot
246
+ (i.e. perversely, each "tree" has leaves towards the bottom, and roots at the top)
36
247
:::
248
+
249
+ ``` {code-cell} ipython3
250
+ ts = forward_WF_sim(6, generations=20, recombination_rate=0)
251
+
252
+ graphics_params = {
253
+ "y_axis": True,
254
+ "y_label": f"Time ({ts.time_units} ago)",
255
+ "y_ticks": {i: 'Current' if i==0 else str(i) for i in range(21)},
256
+ }
257
+ ts.draw_svg(size=(1200, 400), **graphics_params)
258
+ ```
259
+
260
+ This is starting to look like a real genealogy! Clearly, however, there are many
261
+ "extinct" lineages that have not made it to the current day.
262
+
263
+ ## Simplification
264
+
265
+ The key to efficent forward-time genealogical simulation is the process of {ref}` sec_simplification ` ,
266
+ which can reduce much of the complexity shown in the tree above.
267
+ Typically, we want to remove all the lineages that do not contribute to the current day genomes.
268
+ We do this via the {meth}` ~tskit.TreeSequence.simplify ` method, specifying that only the nodes
269
+ in the current generation are "samples".
270
+
271
+ ``` {code-cell} ipython3
272
+ current_day_genomes = ts.samples(time=0)
273
+ simplified_ts = ts.simplify(current_day_genomes, keep_unary=True, filter_nodes=False)
274
+ simplified_ts.draw_svg(size=(600, 400), **graphics_params)
275
+ ```
276
+
277
+ ### Removing unreferenced nodes
278
+
279
+ We just simplified with ` filter_nodes=False ` , meaning that the tree sequence retained
280
+ all nodes even after simplification, even those that are no longer part of
281
+ the genealogy. By default (if ` filter_nodes ` is not specified), these nodes are removed,
282
+ which changes the node IDs.
283
+
284
+ ``` {code-cell} ipython3
285
+ simplified_ts = ts.simplify(current_day_genomes, keep_unary=True)
286
+ simplified_ts.draw_svg(size=(600, 300), **graphics_params)
287
+ ```
288
+
289
+ You can see that the list of nodes passed to {meth}` ~tskit.TreeSequence.simplify ` (i.e. the current-day genomes)
290
+ have become the first nodes in the table, numbered from 0..11;
291
+ the remaining (internal) nodes have been renumbered from youngest to oldest.
292
+
293
+ ### Removing more nodes
294
+
295
+ The ` keep_unary=True ` parameter meant that we kept intermediate ("unary") nodes,
296
+ even those that do not not represent branch-points in the tree.
297
+ Often these are also unneeded, and by default we remove those too, although
298
+ this will mean that we lose track of the pedigree of the individuals
299
+ (which is stored in the parents column of the {ref}` sec_individual_table_definition ` ).
300
+ Since we are removing more nodes, the node IDs of non-samples will again change.
301
+
302
+ ``` {code-cell} ipython3
303
+ simplified_ts = ts.simplify(current_day_genomes)
304
+ simplified_ts.draw_svg(size=(400, 300), y_axis=True)
305
+ ```
306
+
307
+ This is now looking much more like a "normal" genetic genealogy (a "gene tree"),
308
+ in which all the sample genomes trace back to a single common ancestor.
309
+
310
+ ## Recombination
311
+
312
+ If we pass a non-zero recombination rate to the ` forward_WF_sim() ` function, different regions
313
+ of the genome may have different ancestries. This results in multiple trees along the genome.
314
+
315
+ ``` {code-cell} ipython3
316
+ rho = 1e-7
317
+ ts = forward_WF_sim(6, generations=50, recombination_rate=rho)
318
+ print(f"A recombination rate of {rho} has created {ts.num_trees} trees over {ts.sequence_length} bp")
319
+ ```
320
+
321
+ Here's how the full (unsimplified) genealogy looks:
322
+
323
+ ``` {code-cell} ipython3
324
+ graphics_params["y_ticks"] = [0, 10, 20, 30, 40 ,50]
325
+ ts.draw_svg(size=(1000, 400), time_scale="log_time", **graphics_params)
326
+ ```
327
+
328
+ Because we are showing the extinct lineages and the recombinations within them, this plot is rather confusing.
329
+ Again, the act of simplification allows us to to reduce the genealogy to something more managable,
330
+ even with many generations. This is useful both for analysis and for visualization:
331
+
332
+ ``` {code-cell} ipython3
333
+ ts = forward_WF_sim(6, generations=100, recombination_rate=rho)
334
+ simplified_ts = ts.simplify(ts.samples(time=0))
335
+ graphics_params["y_ticks"] = [0, 10, 20, 30, 40 ,50]
336
+ simplified_ts.draw_svg(size=(1000, 300), time_scale="log_time", **graphics_params)
337
+ ```
338
+
339
+ ## Multiple roots
340
+
341
+ If we run the simulation for fewer generations, or have a larger population size,
342
+ we are not guaranteed that the genomes will share a common ancestor within the timeframe of our simulation.
343
+ Even with no recombination (so that all regions of the genome share the same pattern of
344
+ inheritance) the "tree" will consist of multiple unlinked topologies. In ` tskit ` this is described as a
345
+ tree with {ref}` multiple roots<sec_data_model_tree_roots> ` .
346
+
347
+ ``` {code-cell} ipython3
348
+ ts = forward_WF_sim(10, generations=20)
349
+ simplified_ts = ts.simplify(ts.samples(time=0))
350
+ graphics_params["y_ticks"] = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
351
+ simplified_ts.draw_svg(size=(800, 200), **graphics_params)
352
+ ```
353
+
354
+ When a forward-simulated tree has multiple roots, it can be useful to retain relevant lineages
355
+ all the way back to the start of the simulation. This can be done using the ` keep_input_roots ` option:
356
+
357
+ ``` {code-cell} ipython3
358
+ simplified_ts = ts.simplify(ts.samples(time=0), keep_input_roots=True)
359
+ simplified_ts.draw_svg(size=(800, 300), **graphics_params)
360
+ ```
361
+
362
+ Since the trees have not all coalesced, the simulation will be failing to capture the entire genetic diversity within the sample.
363
+ Moreover, the larger the population, the longer the time needed to ensure that the full genealogy is captured.
364
+ For large models, time period required for forward simulations to ensure full coalescence can be prohibitive.
365
+
366
+ A powerful way to get around this problem is * recapitation* , in which an alternative technique,
367
+ such as backward-time coalescent simulation is used to to fill in the "head" of the tree sequence.
368
+ In other words, we use a fast backward-time simulator such as ` msprime ` to simulate the genealogy of the oldest nodes in the simplified tree sequence.
369
+ This is described in the {ref}` sec_completing_forwards_simulations ` tutorial.
370
+
371
+ ## More complex forward-simulations
372
+
373
+ The next tutorial shows the principles behind more complex simulations,
374
+ including e.g. regular simplification during the simulation,
375
+ adding mutations, and adding metadata.
376
+ It also details several extra tips and tricks we have learned when building forward simulators.
0 commit comments