Fill out the forward simulation tutorial

hyanwong · hyanwong · commit 611a8eec69de · 2023-09-28T11:56:05.000+01:00
See extensive discussion at tskit-dev#14
diff --git a/forward_sims.md b/forward_sims.md
@@ -13,12 +13,10 @@ kernelspec:
 
 (sec_tskit_forward_simulations)=
 
-# _Building a forward simulator_
-
-% remove underscores in title when tutorial is complete or near-complete
+# Building a forward simulator
 
 This tutorial shows how tskit can be used to
-build your own forwards-in-time tree sequence recording simulator from scratch.
+build your own forwards-in-time tree-sequence-recording simulator from scratch.
 
 :::{note}
 If you are simply trying to obtain a tree sequence which is
@@ -30,7 +28,306 @@ tutorial focusses instead on illustrating the general principles that lie behind
 simulators.
 :::
 
-:::{todo}
-Add details on building a forward simulator (see issue
-[#14](https://github.com/tskit-dev/tutorials/issues/14))
+We will focus on the case of diploids, in which each {ref}`individual<sec_nodes_or_individuals>` contains 2 genomes,
+but the concepts used here generalize to any ploidy, if you are willing to do the book-keeping.
+The individuals themselves are not strictly necessary for representing genetic genealogies
+(it's the genomes which are important), but they are needed during the simulation,
+and so we record them in the resulting output for completeness.
+
+## Definitions
+
+Before we can make any progress, we require a few definitions.
+
+A *node* represents a genome at a point in time (often we imagine this as the "birth time" of the genome).
+It can be described by a tuple, `(id, time)`, where `id` is a unique integer,
+and `time` reflects the birth time of that `id`. When generating a tree sequence,
+this will be stored in a row of the {ref}`sec_node_table_definition`.
+
+A *diploid individual* is a group of two nodes. During simulation,
+a simple and efficient grouping assigns sequential pairs of node IDs to an individual.
+It can be helpful (but not strictly necessary) to store individuals within the tree sequence
+as rows of the the {ref}`sec_individual_table_definition` (a node can then be assigned to an
+individual by storing that individual's id in the appropriate row of the node table).
+
+An *edge* reflects a transmission event between nodes.  An edge is a tuple `(Left, Right, Parent, Child)`
+whose meaning is "Parent genome $P$ passed on the genomic interval $[L, R)$ to child genome $C$".
+In a tree sequence this is stored in a row of the {ref}`sec_edge_table_definition`.
+
+The *time*, in the discrete-time Wright-Fisher (WF) model which we will simulate, is measured in
+integer generations. To match the tskit notion of time, we record time in *generations ago*:
+i.e. for a simple simulation of G generations, we start the simulation at generation $G-1$
+and count down until we reach generation 0 (the current-day).
+
+The *population* consists of $N$ diploid individuals ($2N$ nodes) at a particular time $t$.
+At the start, the population will have no known ancestry, but subsequently,
+each individual will be formed by choosing (at random) two parent individuals from the population in the previous generation.
+
+
+## Approach
+
+We will generate edges, nodes, and individuals forwards in time, adding them to the relevant `tskit` tables.
+To aid efficiency, we will also see how to "simplify" the tables into the minimal set of nodes and edges that
+describe the history of the sample. Finally, these tables can be exported into an immutable
+{class}`TreeSequence` for storing or analysis.
+
+### Setup
+First, we'll import the necessary libraries and define some general parameters.
+The [numpy](https://numpy.org/doc/stable/) library will be used to produce random numbers.
+
+```{code-cell} ipython3
+import tskit
+import numpy as np
+
+random_seed = 123
+random = np.random.default_rng(random_seed)  # A random number generator for general use
+
+sequence_length = 50_000  # 50 Kb
+```
+
+## Simulation without recombination
+
+We'll start by simulating a small region of a larger genome, assuming the region is small enough that there is no recombination.
+First we describe how one of the child's genomes is created from the pair of genomes present in one of its parents.
+With no recombination, we can simply pick one the pair at random, and save the inheritance paths in the [edge table](edge-table).
+We can then call the same function for the genome inherited from the other parent
+
+```{code-cell} ipython3
+focal_region = [20_000, 21_000]
+def add_inheritance_paths(tables, parent_genomes, child_genome):
+    "Add inheritance paths from a randomly chosen parent genome to the child genome"
+    left, right = focal_region  # only define inheritance in this focal region
+    inherit_from = random.integers(2)  # randomly chose 0 or 1
+    tables.edges.add_row(left, right, parent_genomes[inherit_from], child_genome)
+```
+
+The population can be though of as set of numerical IDs corresponding to *individuals*. Each individual ID will also be associated with a pair of numbers corresponding to node (i.e. genome) IDs. For efficiency we keep this mapping in a Python dictionary, e.g. `{indivID_X:(nodeID_a, nodeID_b), indivID_Y:(nodeID_c, nodeID_d), ...}`. The required IDs are created by adding rows to the {meth}`individual<IndividualTable.add_row>` and {meth}`node<NodeTable.add_row>` tables.
+
+We will split the simulation into a few relatively simple functions. The only random element comes from chosing a pair of parents for each new child we want to create in the next generation, which we encapsulate in a function called `chose_parent_pairs()`
+
+```{code-cell} ipython3
+def chose_parent_pairs(population):
+    """
+    Return a list of randomly chosen pairs of individual IDs from a population.
+    The list is of the same length as the size of the population
+    """
+    ids = np.array([i for i in population.keys()], dtype=np.int32)
+    return [
+        random.choice(ids, 2, replace=True)  # To disallow selfing set replace=False
+        for _ in range(len(population))
+    ]
+
+def make_individuals(tables, num_individuals):
+    """
+    Return a list of new individual IDs, used when initialise a new population
+    """
+    return [tables.individuals.add_row()  for _ in range(num_individuals)]
+
+def make_individuals_from_pop(tables, prev_population):
+    """
+    Return a list of (individual ID, parent individual IDs) tuples,
+    where parent IDs have been chosen from an existing population.
+    """
+    return [
+        (tables.individuals.add_row(parents=parent_IDs), parent_IDs)
+        for parent_IDs in chose_parent_pairs(prev_population)
+    ]
+
+def make_genomes(tables, time, individual_id, ploidy=2):
+    """
+    Return a set of (usually 2) genomes associated with an individual
+    """
+    return [
+        tables.nodes.add_row(tskit.NODE_IS_SAMPLE, time, individual=individual_id)
+        for _ in range(ploidy)
+    ]
+    
+def init_population(tables, time, diploid_population_size) -> dict:
+    return {
+        new_ind: make_genomes(tables, time, new_ind)
+        for new_ind in make_individuals(tables, diploid_population_size)
+    }
+
+def next_population(tables, time, prev_population) -> dict:
+    population = {}
+    for new_ind, parent_inds in make_individuals_from_pop(tables, prev_population):
+        child_genomes = make_genomes(tables, time, new_ind)
+        population[new_ind] = child_genomes
+
+        # Now add the inheritance paths, so we can track the genetic genealogy
+        for child_genome, parent_individual in zip(child_genomes, parent_inds):
+            parent_genomes = prev_population[parent_individual]
+            add_inheritance_paths(tables, parent_genomes, child_genome)
+    return population
+```
+
+:::{note}
+For simplicity, the code above assumes any parent can be a mother or a father (i.e. this is a hermaphrodite species).
+It also allows the same parent to be chosed as a mother and as a father (i.e. "selfing" is allowed), which gives simpler theoretical results.
+This is easy to change if required.
+:::
+
+Our forward-in-time simulator simply involves repeatedly running the `next_population()` routine,
+replacing the old population with the new one. For efficiency reasons, `tskit` has strict requirements
+for the order of edges in the edge table, so we need to {meth}`~TableCollection.sort` the tables before we output the final tree sequence.
+
+```{code-cell} ipython3
+def simple_diploid_sim(num_diploids, generations, random_seed=123) -> tskit.TreeSequence:
+    random = np.random.default_rng(random_seed) 
+    tables = tskit.TableCollection(sequence_length)
+    tables.time_units = "generations"  # optional, but helpful when plotting
+
+    population = init_population(tables, generations, num_diploids)  # initial population
+    while generations > 0:
+        generations = generations - 1
+        population = next_population(tables, generations, population)
+
+    tables.sort()
+    return tables.tree_sequence()
+```
+
+Now we can test it for a single generation and a population size of 6, and print out the resulting tree sequence:
+
+```{code-cell} ipython3
+ts = simple_diploid_sim(6, generations=1)
+ts.draw_svg(y_axis=True, size=(500, 200))
+```
+
+It looks like it is working correctly: all 12 genomes (6 diploids) in the current generation at time=0, trace back to a 
+genome in the initial generation at time=1. Note that not all individuals in the initial generation have passed on genetic
+material at this genomic position (they appear as isolated nodes at the top of the plot).
+
+Now let's simulate for a longer time period, and set a few helpful plotting parameters.
+
+:::{note}
+By convention we plot the most recent generation at the bottom of the plot (i.e. perversely, each "tree" has leaves towards the bottom, and roots at the top)
 :::
+
+```{code-cell} ipython3
+ts = simple_diploid_sim(6, generations=20)
+
+graphics_params = {
+    "y_axis": True,
+    "y_label": f"Time ({ts.time_units} ago)",
+    #"y_ticks": {i: 'Current' if i==0 else str(i) for i in range(16)},
+}
+ts.draw_svg(size=(1200, 350), **graphics_params)
+```
+
+This is starting to look like a real genealogy! But you can see that there are a lot of potentially
+"redundant" lineages that have not made it to the current day.
+
+## Simplification
+
+The key to efficent forward-time genealogical simulation is the process of [simplification]((https://tskit.dev/tutorials/simplification.html)), which can reduce much of the complexity shown in the tree above. Typically, we want to remove all the lineages that do not contribute to the current day genomes. We do this via the :meth:`~tskit.TreeSequence.simplify` method, specifying that only the nodes in the current generation are "samples".
+
+```{code-cell} ipython3
+current_day_genomes = ts.samples(time=0)
+simplified_ts = ts.simplify(current_day_genomes, keep_unary=True, filter_nodes=False)
+simplified_ts.draw_svg(size=(600, 300), **graphics_params)
+```
+
+### ID changes
+
+We just simplified with `filter_nodes=False`, meaning that the tree sequence retained all nodes even after simplification. However, many nodes are not longer part of the genealogy; removing them means we can store fewer nodes (although it will change the node IDs).
+
+```{code-cell} ipython3
+simplified_ts = ts.simplify(current_day_genomes, keep_unary=True)
+simplified_ts.draw_svg(size=(600, 300), **graphics_params)
+```
+
+Note that the list of nodes passed to `simplify` (i.e. the current-day genomes)
+have become the first nodes in the table, numbered from 0..11,
+and the remaining nodes have been renumbered from youngest to oldest.
+
+### Extra node removal
+
+The `keep_unary=True` parameter meant that we kept intermediate ("unary") nodes, even those that do not not represent branch-points in the tree. Often these are also unneeded, and by default we remove those too; this will mean that the node IDs of older nodes will change again
+
+```{code-cell} ipython3
+simplified_ts = ts.simplify(current_day_genomes)
+simplified_ts.draw_svg(size=(400, 300), y_axis=True)
+```
+
+This is now looking much more like a "normal" genetic genealogy (a "gene tree"),
+in which all the sample genomes trace back to a single common ancestor.
+
+## Multiple roots
+
+If we run the simulation for fewer generations, we are not guaranteed to create
+genomes that share a common ancestor within the timeframe of our simulation. Even when
+simplified the result doesn't look quite like a normal "tree", as it contains several
+unlinked topologies. In `tskit` we call this a single tree with {ref}`multiple roots<sec_data_model_tree_roots>`.
+
+```{code-cell} ipython3
+ts = simple_diploid_sim(6, generations=5)
+simplified_ts = ts.simplify(ts.samples(time=0))
+simplified_ts.draw_svg(size=(700, 200), y_axis=True)
+```
+
+When a forward-simulated tree has multiple roots, it can be useful to retain relevant lineages
+all the way back to the start of the simulation. This can be done using the `keep_input_roots` option:
+
+```{code-cell} ipython3
+simplified_ts = ts.simplify(ts.samples(time=0), keep_input_roots=True)
+simplified_ts.draw_svg(size=(700, 200), y_axis=True)
+```
+
+## Recombination
+
+It is relatively easy to modify the simulation code to allow recombination. All we need to do is to redefine the `add_inheritance_paths()` function, so that the child inherits a mosaic of the two genomes present in each parent.
+
+Below is a redefined function which selects a set of "breakpoints" along the genome. It then allocates the first edge from zero to breakpoint 1 pointing it to one parent genome, and then allocates a second edge from breakpoint 1 onwards pointing to the other parent genome. If there is a second breakpoint, a third edge is created from breakpoint 2 to the next breakpoint that points back to the initial parent genome, and so forth, up to the end of the sequence. Biologically, recombination rates are such that they usually result in a relatively small number of breakpoints per chromosome (in humans, around 1 or 2).
+
+:::{note}
+Here we chose breakpoint positions in continuous space ("infinite breakpoint positions"), to match population genetic theory, although it is relatively easy to alter this to recombinations at integer positions
+:::
+
+
+```{code-cell} ipython3
+recombination_rate = 5e-7
+
+def add_inheritance_paths(tables, parent_genomes, child_genome):
+    "Add paths from parent genomes to the child genome, with crossover recombination"
+    L = tables.sequence_length
+    num_recombinations = random.poisson(recombination_rate * L)
+    breakpoints = random.uniform(0, L, size=num_recombinations)
+    breakpoints = np.concatenate(([0], np.unique(breakpoints), [L]))
+    inherit_from = random.integers(2)  # starting parental genome
+
+    # iterate over pairs of ([0, b1], [b1, b2], [b2, b3], ... [bN, L])
+    for left, right in zip(breakpoints[:-1], breakpoints[1:]):
+        tables.edges.add_row(
+            left, right, parent_genomes[inherit_from], child_genome)
+        inherit_from = 1 - inherit_from  # switch to other parent genome
+
+
+# Simulate a few generations, for testing
+ts = simple_diploid_sim(6, generations=5)  # Now includes recombination
+ts  # Show the tree sequence
+```
+
+You can see that recombination has lead to more than one tree along the genome. Here's how the full (unsimplified) genealogy looks:
+
+```{code-cell} ipython3
+ts.draw_svg(size=(1000, 300), **graphics_params)
+```
+
+This is rather confusing to visualise, and will get even worse if we simulate more generations. However, even with more generations, the act of simplification allows us to to reduce the genealogy to something more managable, both for analysis and for visualization:
+
+```{code-cell} ipython3
+# Carry out for even more generations, but simplify the resulting ts
+ts = simple_diploid_sim(6, generations=50)
+simplified_ts = ts.simplify(ts.samples(time=0), keep_input_roots=True)
+graphics_params["y_ticks"] = [0, 10, 20, 30, 40 ,50]
+simplified_ts.draw_svg(size=(1000, 300), time_scale="log_time", **graphics_params)
+```
+
+## Ensuring coalescence
+
+You can see that some of these strees still have multiple roots. In other words, 1000 generations is not long enough to capture the ancestry back to a single common ancestor (i.e. to ensure "full coalescence" of all local trees). If the local trees have not all coalesced, then the simulation will be failing to capture the entire genetic diversity within the sample. Moreover, the larger the populations, the longer the time needed to ensure that the full genealogy is captured. For large models, time period required for forward simulations to ensure full coalescence can be prohibitive.
+
+A powerful way to get around this problem is *recapitation*, in which an alternative technique, such as backward-in-time coalescent simulation is used to to fill in the "head" of the tree sequence. In other words, we use a fast backward-time simulator such as `msprime` to simulate the genealogy of the oldest nodes in the simplified tree sequence. To see how this is done, consult the [recapitation tutorial].
+
+## More complex forward-simulations
+
+The next tutorial shows the principles behind more complex simulations, e.g. including regular simplification during the simulation, adding mutations, and adding metadata. It also details several extra tips and tricks we have learned when building forward simulators.