You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The population can be though of as set of numerical IDs corresponding to *individuals*. Each individual ID will also be associated with a pair of numbers corresponding to node (i.e. genome) IDs. For efficiency we keep this mapping in a Python dictionary, e.g. `{indivID_X:(nodeID_a, nodeID_b), indivID_Y:(nodeID_c, nodeID_d), ...}`. The required IDs are created by adding rows to the {meth}`individual<IndividualTable.add_row>` and {meth}`node<NodeTable.add_row>` tables.
104
+
105
+
We will split the simulation into a few relatively simple functions. The only random element comes from chosing a pair of parents for each new child we want to create in the next generation, which we encapsulate in a function called `chose_parent_pairs()`
106
+
107
+
```{code-cell} ipython3
108
+
def chose_parent_pairs(population):
109
+
"""
110
+
Return a list of randomly chosen pairs of individual IDs from a population.
111
+
The list is of the same length as the size of the population
112
+
"""
113
+
ids = np.array([i for i in population.keys()], dtype=np.int32)
114
+
return [
115
+
random.choice(ids, 2, replace=True) # To disallow selfing set replace=False
116
+
for _ in range(len(population))
117
+
]
118
+
119
+
def make_individuals(tables, num_individuals):
120
+
"""
121
+
Return a list of new individual IDs, used when initialise a new population
122
+
"""
123
+
return [tables.individuals.add_row() for _ in range(num_individuals)]
tables.time_units = "generations" # optional, but helpful when plotting
178
+
179
+
population = init_population(tables, generations, num_diploids) # initial population
180
+
while generations > 0:
181
+
generations = generations - 1
182
+
population = next_population(tables, generations, population)
183
+
184
+
tables.sort()
185
+
return tables.tree_sequence()
186
+
```
187
+
188
+
Now we can test it for a single generation and a population size of 6, and print out the resulting tree sequence:
189
+
190
+
```{code-cell} ipython3
191
+
ts = simple_diploid_sim(6, generations=1)
192
+
ts.draw_svg(y_axis=True, size=(500, 200))
193
+
```
194
+
195
+
It looks like it is working correctly: all 12 genomes (6 diploids) in the current generation at time=0, trace back to a
196
+
genome in the initial generation at time=1. Note that not all individuals in the initial generation have passed on genetic
197
+
material at this genomic position (they appear as isolated nodes at the top of the plot).
198
+
199
+
Now let's simulate for a longer time period, and set a few helpful plotting parameters.
200
+
201
+
:::{note}
202
+
By convention we plot the most recent generation at the bottom of the plot (i.e. perversely, each "tree" has leaves towards the bottom, and roots at the top)
36
203
:::
204
+
205
+
```{code-cell} ipython3
206
+
ts = simple_diploid_sim(6, generations=20)
207
+
208
+
graphics_params = {
209
+
"y_axis": True,
210
+
"y_label": f"Time ({ts.time_units} ago)",
211
+
#"y_ticks": {i: 'Current' if i==0 else str(i) for i in range(16)},
212
+
}
213
+
ts.draw_svg(size=(1200, 350), **graphics_params)
214
+
```
215
+
216
+
This is starting to look like a real genealogy! But you can see that there are a lot of potentially
217
+
"redundant" lineages that have not made it to the current day.
218
+
219
+
## Simplification
220
+
221
+
The key to efficent forward-time genealogical simulation is the process of [simplification]((https://tskit.dev/tutorials/simplification.html)), which can reduce much of the complexity shown in the tree above. Typically, we want to remove all the lineages that do not contribute to the current day genomes. We do this via the :meth:`~tskit.TreeSequence.simplify` method, specifying that only the nodes in the current generation are "samples".
We just simplified with `filter_nodes=False`, meaning that the tree sequence retained all nodes even after simplification. However, many nodes are not longer part of the genealogy; removing them means we can store fewer nodes (although it will change the node IDs).
Note that the list of nodes passed to `simplify` (i.e. the current-day genomes)
239
+
have become the first nodes in the table, numbered from 0..11,
240
+
and the remaining nodes have been renumbered from youngest to oldest.
241
+
242
+
### Extra node removal
243
+
244
+
The `keep_unary=True` parameter meant that we kept intermediate ("unary") nodes, even those that do not not represent branch-points in the tree. Often these are also unneeded, and by default we remove those too; this will mean that the node IDs of older nodes will change again
It is relatively easy to modify the simulation code to allow recombination. All we need to do is to redefine the `add_inheritance_paths()` function, so that the child inherits a mosaic of the two genomes present in each parent.
278
+
279
+
Below is a redefined function which selects a set of "breakpoints" along the genome. It then allocates the first edge from zero to breakpoint 1 pointing it to one parent genome, and then allocates a second edge from breakpoint 1 onwards pointing to the other parent genome. If there is a second breakpoint, a third edge is created from breakpoint 2 to the next breakpoint that points back to the initial parent genome, and so forth, up to the end of the sequence. Biologically, recombination rates are such that they usually result in a relatively small number of breakpoints per chromosome (in humans, around 1 or 2).
280
+
281
+
:::{note}
282
+
Here we chose breakpoint positions in continuous space ("infinite breakpoint positions"), to match population genetic theory, although it is relatively easy to alter this to recombinations at integer positions
inherit_from = 1 - inherit_from # switch to other parent genome
302
+
303
+
304
+
# Simulate a few generations, for testing
305
+
ts = simple_diploid_sim(6, generations=5) # Now includes recombination
306
+
ts # Show the tree sequence
307
+
```
308
+
309
+
You can see that recombination has lead to more than one tree along the genome. Here's how the full (unsimplified) genealogy looks:
310
+
311
+
```{code-cell} ipython3
312
+
ts.draw_svg(size=(1000, 300), **graphics_params)
313
+
```
314
+
315
+
This is rather confusing to visualise, and will get even worse if we simulate more generations. However, even with more generations, the act of simplification allows us to to reduce the genealogy to something more managable, both for analysis and for visualization:
316
+
317
+
```{code-cell} ipython3
318
+
# Carry out for even more generations, but simplify the resulting ts
You can see that some of these strees still have multiple roots. In other words, 1000 generations is not long enough to capture the ancestry back to a single common ancestor (i.e. to ensure "full coalescence" of all local trees). If the local trees have not all coalesced, then the simulation will be failing to capture the entire genetic diversity within the sample. Moreover, the larger the populations, the longer the time needed to ensure that the full genealogy is captured. For large models, time period required for forward simulations to ensure full coalescence can be prohibitive.
328
+
329
+
A powerful way to get around this problem is *recapitation*, in which an alternative technique, such as backward-in-time coalescent simulation is used to to fill in the "head" of the tree sequence. In other words, we use a fast backward-time simulator such as `msprime` to simulate the genealogy of the oldest nodes in the simplified tree sequence. To see how this is done, consult the [recapitation tutorial].
330
+
331
+
## More complex forward-simulations
332
+
333
+
The next tutorial shows the principles behind more complex simulations, e.g. including regular simplification during the simulation, adding mutations, and adding metadata. It also details several extra tips and tricks we have learned when building forward simulators.
0 commit comments