Fill out the forward simulation tutorial

hyanwong · hyanwong · commit c3d5f279b0ac · 2023-09-28T18:47:12.000+01:00
See extensive discussion at tskit-dev#14
diff --git a/forward_sims.md b/forward_sims.md
@@ -11,26 +11,366 @@ kernelspec:
   name: python3
 ---
 
+```{currentmodule} tskit
+```
+
 (sec_tskit_forward_simulations)=
 
-# _Building a forward simulator_
+# Building a forward simulator
 
-% remove underscores in title when tutorial is complete or near-complete
+This tutorial shows how tskit can be used to build your own forward-time, tree sequence simulator from scratch.
+The simulator will use the discrete-time Wright-Fisher model, and track individuals
+along with their genomes, storing inherited genomic regions as well as the full pedigree.
 
-This tutorial shows how tskit can be used to
-build your own forwards-in-time tree sequence recording simulator from scratch.
+The code in this tutorial is broken into separate functions for clarity and
+to make it easier to modify for your own purposes; a simpler and substantially
+condensed forward-simulator is coded as a single function at the top of the
+{ref}`sec_completing_forwards_simulations` tutorial.
 
 :::{note}
 If you are simply trying to obtain a tree sequence which is
-the result of a forwards-in-time simulation, this can be done by using one of the
-highly capable forwards-in-time genetic simulators that already exist, such as
+the result of a forward-time simulation, this can be done by using one of the
+highly capable forward-time genetic simulators that already exist, such as
 [SLiM](https://messerlab.org/slim/) or [fwdpy11](https://github.com/molpopgen/fwdpy11).
 Documentation and tutorials for these tools exist on their respective websites. This
-tutorial focusses instead on illustrating the general principles that lie behind such
+tutorial is instead intended to illustrate the general principles that lie behind such
 simulators.
 :::
 
-:::{todo}
-Add details on building a forward simulator (see issue
-[#14](https://github.com/tskit-dev/tutorials/issues/14))
+We will focus on the case of diploids, in which each {ref}`individual<sec_nodes_or_individuals>` contains 2 genomes,
+but the concepts used here generalize to any ploidy, if you are willing to do the book-keeping.
+The individuals themselves are not strictly necessary for representing genetic genealogies
+(it's the genomes which are important), but they are needed during the simulation,
+and so we record them in the resulting output for completeness.
+
+## Definitions
+
+Before we can make any progress, we require a few definitions.
+
+A *node* represents a genome at a point in time (often we imagine this as the "birth time" of the genome).
+It can be described by a tuple, `(id, time)`, where `id` is a unique integer,
+and `time` reflects the birth time of that `id`. When generating a tree sequence,
+this will be stored in a row of the {ref}`sec_node_table_definition`.
+
+A *diploid individual* is a group of two nodes. During simulation,
+a simple and efficient grouping assigns sequential pairs of node IDs to an individual.
+It can be helpful (but not strictly necessary) to store individuals within the tree sequence
+as rows of the the {ref}`sec_individual_table_definition` (a node can then be assigned to an
+individual by storing that individual's id in the appropriate row of the node table).
+
+An *edge* reflects a transmission event between nodes.  An edge is a tuple `(Left, Right, Parent, Child)`
+whose meaning is "Child genome $C$ inherited the genomic interval $[L, R)$ from $Parent genome $P$".
+In a tree sequence this is stored in a row of the {ref}`sec_edge_table_definition`.
+
+The *time*, in the discrete-time Wright-Fisher (WF) model which we will simulate, is measured in
+integer generations. To match the tskit notion of time, we record time in *generations ago*:
+i.e. for a simple simulation of G generations, we start the simulation at generation $G-1$
+and count down until we reach generation 0 (the current-day).
+
+The *population* consists of $N$ diploid individuals ($2N$ nodes) at a particular time $t$.
+At the start, the population will have no known ancestry, but subsequently
+each individual will be formed by choosing (at random) two parent individuals from the
+population in the previous generation.
+
+## Approach
+
+We will generate edges, nodes, and individuals forwards in time, adding them to the relevant `tskit` tables.
+To aid efficiency, we will also see how to "simplify" the tables into the minimal set of nodes and edges that
+describe the history of the sample. Finally, these tables can be exported into an immutable
+{class}`TreeSequence` for storing or analysis.
+
+### Setup
+First, we'll import the necessary libraries and define some general parameters.
+The [numpy](https://numpy.org/doc/stable/) library will be used to produce random numbers.
+
+```{code-cell} ipython3
+import tskit
+import numpy as np
+
+random_seed = 7
+random = np.random.default_rng(random_seed)  # A random number generator for general use
+
+sequence_length = 50_000  # 50 Kb
+```
+
+## Simulating a population
+
+Our simulated population can be though of as set of $N$ numerical IDs corresponding to *individuals*.
+Each individual ID will also be associated with a pair of numbers corresponding to node (i.e. genome) IDs.
+For efficiency we keep this mapping in a Python dictionary, e.g. `{indivID_x:(nodeID_a, nodeID_b), indivID_y:(nodeID_c, nodeID_d), ...}`.
+The required IDs are created when adding rows to the
+{meth}`individual<IndividualTable.add_row>` and {meth}`node<NodeTable.add_row>` tables.
+
+We will split the simulation into a few, relatively simple functions,
+defining how we choose parents and genomes for children in the next generation.
+The `chose_parents()` function below simply chooses $N$ pairs of parents at random.
+
+```{code-cell} ipython3
+def chose_parents(pop):
+    "Return a list of randomly chosen pairs of individual IDs from a population."
+    ids = np.array([i for i in pop.keys()], dtype=np.int32)  # array of individual IDs
+    return [random.choice(ids, 2, replace=True) for _ in range(len(pop))]
+
+def make_individuals(tables, num_individuals):
+    "Return a list of new individual IDs; used to initialise a new population."
+    return [tables.individuals.add_row() for _ in range(num_individuals)]
+
+def make_individuals_from_population(tables, pop):
+    "Return a list of (new ID, parent IDs) tuples of individual IDs"
+    # note that specifying parents in add_row() is optional & simply stores the pedigree
+    return [(tables.individuals.add_row(parents=ids), ids) for ids in chose_parents(pop)]
+
+flags = tskit.NODE_IS_SAMPLE  # during the simulation, flag up all new nodes as samples
+def make_nodes(tables, time, i, ploidy=2):
+    "Return (usually 2) new nodes: the genomes associated with an individual `i`"
+    return [tables.nodes.add_row(flags, time, individual=i) for _ in range(ploidy)]
+    
+def init_population(tables, time, N) -> dict:
+    "Return a new population of N individuals, each with 2 genomes"
+    return {i: make_nodes(tables, time, i) for i in make_individuals(tables, N)}
+
+def next_population(tables, time, prev_pop, recombination_rate) -> dict:
+    "As for `init_population`, but base the new population on a previous one"
+    pop = {}
+    for i, parents in make_individuals_from_population(tables, prev_pop):
+        child_nodes = make_nodes(tables, time, i)
+        pop[i] = child_nodes
+
+        # Now add the inheritance paths, so we can track the genetic genealogy
+        for child_node, parent in zip(child_nodes, parents):
+            parent_nodes = prev_pop[parent]
+            # the add_inheritance_paths() function below is yet to be defined
+            add_inheritance_paths(tables, parent_nodes, child_node, recombination_rate)
+    return pop
+```
+
+:::{note}
+For simplicity, the code above assumes any parent can be a mother or a father (i.e. this is a hermaphrodite species).
+It also allows the same parent to be chosed as a mother and as a father (i.e. "selfing" is allowed),
+which gives simpler theoretical results. This is easy to change if required.
+:::
+
+Our forward-time simulator simply involves repeatedly running the `next_population()` routine,
+replacing the old population with the new one. For efficiency reasons, `tskit` has strict requirements
+for the order of edges in the edge table, so we need to {meth}`~TableCollection.sort` the tables before we output the final tree sequence.
+
+```{code-cell} ipython3
+def forward_WF_sim(num_diploids, generations, recombination_rate=0, random_seed=7):
+    """
+    Run a forward-time Wright Fisher simulation of a diploid population, returning
+    a tree sequence representing the genetic genealogy of the simulated genomes.
+    """
+    global random
+    random = np.random.default_rng(random_seed) 
+    tables = tskit.TableCollection(sequence_length)
+    tables.time_units = "generations"  # optional, but helpful when plotting
+
+    pop = init_population(tables, generations, num_diploids)  # initial population
+    while generations > 0:
+        generations = generations - 1
+        pop = next_population(tables, generations, pop, recombination_rate)
+
+    tables.sort()
+    return tables.tree_sequence()
+```
+
+### Inheritance without recombination
+
+The final piece of the simulation is to define the `add_inheritance_paths()` function,
+which saves the inheritance paths in the {ref}`sec_edge_table_definition`.
+For reference, the simplest case (a small focal region in which there is no recombination)
+can be coded as follows:
+
+```{code-cell} ipython3
+def add_inheritance_paths(tables, parent_nodes, child_node, recombination_rate):
+    "Add inheritance paths from a randomly chosen parent genome to the child genome."
+    assert recombination_rate == 0
+    left, right = [20_000, 21_000]  # only define inheritance in this focal region
+    inherit_from = random.integers(2)  # randomly choose the 1st or the 2nd parent node
+    tables.edges.add_row(left, right, parent_nodes[inherit_from], child_node)
+```
+
+### Inheritance with recombination
+
+Recombination adds complexity to the inheritance paths from a child to its parents.
+The function below selects a set of "breakpoints" along the genome,
+and points the first edge (from zero to the first breakpoint) to one parental genome,
+the second edge (from the first to the second breakpoint) to the other parent genome,
+and so on up to the end of the sequence. More biologically realistic recombination
+models could be substituted into this function.
+
+Note that real recombination rates are usually such that they result in relatively
+few breakpoints per chromosome (in humans, around 1 or 2).
+
+```{code-cell} ipython3
+def add_inheritance_paths(tables, parent_genomes, child_genome, recombination_rate):
+    "Add paths from parent genomes to the child genome, with crossover recombination."
+    L = tables.sequence_length
+    num_recombinations = random.poisson(recombination_rate * L)
+    breakpoints = random.uniform(0, L, size=num_recombinations)
+    breakpoints = np.concatenate(([0], np.unique(breakpoints), [L]))
+    inherit_from = random.integers(2)  # starting parental genome
+
+    # iterate over pairs of ([0, b1], [b1, b2], [b2, b3], ... [bN, L])
+    for left, right in zip(breakpoints[:-1], breakpoints[1:]):
+        tables.edges.add_row(
+            left, right, parent_genomes[inherit_from], child_genome)
+        inherit_from = 1 - inherit_from  # switch to other parent genome
+```
+
+:::{note}
+Above, breakpoint positions occur on a continuous line (i.e. "infinite breakpoint positions"),
+to match population genetic theory. It is relatively easy to alter this to
+allos recombinations only at integer positions
+:::
+
+### Basic examples
+
+Now we can test the `forward_WF_sim()` function for a single generation with a small
+population size of 6 diploids, and print out the resulting tree sequence. For simplicity,
+we will set the recombination rate to zero for now.
+
+```{code-cell} ipython3
+ts = forward_WF_sim(6, generations=1, recombination_rate=0)
+ts.draw_svg(y_axis=True, size=(500, 200))
+```
+
+It looks like it is working correctly: all 12 genomes (6 diploids) in the current generation at time=0 trace back to a 
+genome in the initial generation at time=1. Note that not all individuals in the initial generation have passed on genetic
+material at this genomic position (they appear as isolated nodes at the top of the plot).
+
+Now let's simulate for a longer time period, and set a few helpful plotting parameters.
+
+:::{note}
+By convention we plot the most recent generation at the bottom of the plot
+(i.e. perversely, each "tree" has leaves towards the bottom, and roots at the top)
 :::
+
+```{code-cell} ipython3
+ts = forward_WF_sim(6, generations=20, recombination_rate=0)
+
+graphics_params = {
+    "y_axis": True,
+    "y_label": f"Time ({ts.time_units} ago)",
+    "y_ticks": {i: 'Current' if i==0 else str(i) for i in range(21)},
+}
+ts.draw_svg(size=(1200, 400), **graphics_params)
+```
+
+This is starting to look like a real genealogy! Clearly, however, there are many
+"extinct" lineages that have not made it to the current day.
+
+## Simplification
+
+The key to efficent forward-time genealogical simulation is the process of {ref}`sec_simplification`,
+which can reduce much of the complexity shown in the tree above.
+Typically, we want to remove all the lineages that do not contribute to the current day genomes.
+We do this via the {meth}`~tskit.TreeSequence.simplify` method, specifying that only the nodes
+in the current generation are "samples".
+
+```{code-cell} ipython3
+current_day_genomes = ts.samples(time=0)
+simplified_ts = ts.simplify(current_day_genomes, keep_unary=True, filter_nodes=False)
+simplified_ts.draw_svg(size=(600, 400), **graphics_params)
+```
+
+### Removing unreferenced nodes
+
+We just simplified with `filter_nodes=False`, meaning that the tree sequence retained
+all nodes even after simplification, even those that are no longer part of
+the genealogy. By default (if `filter_nodes` is not specified), these nodes are removed,
+which changes the node IDs.
+
+```{code-cell} ipython3
+simplified_ts = ts.simplify(current_day_genomes, keep_unary=True)
+simplified_ts.draw_svg(size=(600, 300), **graphics_params)
+```
+
+You can see that the list of nodes passed to {meth}`~tskit.TreeSequence.simplify` (i.e. the current-day genomes)
+have become the first nodes in the table, numbered from 0..11;
+the remaining (internal) nodes have been renumbered from youngest to oldest.
+
+### Removing more nodes
+
+The `keep_unary=True` parameter meant that we kept intermediate ("unary") nodes,
+even those that do not not represent branch-points in the tree.
+Often these are also unneeded, and by default we remove those too, although
+this will mean that we lose track of the pedigree of the individuals
+(which is stored in the parents column of the {ref}`sec_individual_table_definition`).
+Since we are removing more nodes, the node IDs of non-samples will again change. 
+
+```{code-cell} ipython3
+simplified_ts = ts.simplify(current_day_genomes)
+simplified_ts.draw_svg(size=(400, 300), y_axis=True)
+```
+
+This is now looking much more like a "normal" genetic genealogy (a "gene tree"),
+in which all the sample genomes trace back to a single common ancestor.
+
+## Recombination
+
+If we pass a non-zero recombination rate to the `forward_WF_sim()` function, different regions
+of the genome may have different ancestries. This results in multiple trees along the genome.
+
+```{code-cell} ipython3
+rho = 1e-7
+ts = forward_WF_sim(6, generations=50, recombination_rate=rho)
+print(f"A recombination rate of {rho} has created {ts.num_trees} trees over {ts.sequence_length} bp")
+```
+
+Here's how the full (unsimplified) genealogy looks:
+
+```{code-cell} ipython3
+graphics_params["y_ticks"] = [0, 10, 20, 30, 40 ,50]
+ts.draw_svg(size=(1000, 400), time_scale="log_time", **graphics_params)
+```
+
+Because we are showing the extinct lineages and the recombinations within them, this plot is rather confusing.
+Again, the act of simplification allows us to to reduce the genealogy to something more managable,
+even with many generations. This is useful both for analysis and for visualization:
+
+```{code-cell} ipython3
+ts = forward_WF_sim(6, generations=100, recombination_rate=rho)
+simplified_ts = ts.simplify(ts.samples(time=0))
+graphics_params["y_ticks"] = [0, 10, 20, 30, 40 ,50]
+simplified_ts.draw_svg(size=(1000, 300), time_scale="log_time", **graphics_params)
+```
+
+## Multiple roots
+
+If we run the simulation for fewer generations, or have a larger population size, 
+we are not guaranteed that the genomes will share a common ancestor within the timeframe of our simulation.
+Even with no recombination (so that all regions of the genome share the same pattern of
+inheritance) the "tree" will consist of multiple unlinked topologies. In `tskit` this is described as a
+tree with {ref}`multiple roots<sec_data_model_tree_roots>`.
+
+```{code-cell} ipython3
+ts = forward_WF_sim(10, generations=20)
+simplified_ts = ts.simplify(ts.samples(time=0))
+graphics_params["y_ticks"] = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
+simplified_ts.draw_svg(size=(800, 200), **graphics_params)
+```
+
+When a forward-simulated tree has multiple roots, it can be useful to retain relevant lineages
+all the way back to the start of the simulation. This can be done using the `keep_input_roots` option:
+
+```{code-cell} ipython3
+simplified_ts = ts.simplify(ts.samples(time=0), keep_input_roots=True)
+simplified_ts.draw_svg(size=(800, 300), **graphics_params)
+```
+
+Since the trees have not all coalesced, the simulation will be failing to capture the entire genetic diversity within the sample.
+Moreover, the larger the population, the longer the time needed to ensure that the full genealogy is captured.
+For large models, time period required for forward simulations to ensure full coalescence can be prohibitive.
+
+A powerful way to get around this problem is *recapitation*, in which an alternative technique,
+such as backward-time coalescent simulation is used to to fill in the "head" of the tree sequence.
+In other words, we use a fast backward-time simulator such as `msprime` to simulate the genealogy of the oldest nodes in the simplified tree sequence.
+This is described in the {ref}`sec_completing_forwards_simulations` tutorial.
+
+## More complex forward-simulations
+
+The next tutorial shows the principles behind more complex simulations,
+including e.g. regular simplification during the simulation,
+adding mutations, and adding metadata.
+It also details several extra tips and tricks we have learned when building forward simulators.