diff --git a/forward_sims.md b/forward_sims.md index c90b186d..b0cef078 100644 --- a/forward_sims.md +++ b/forward_sims.md @@ -11,26 +11,476 @@ kernelspec: name: python3 --- +```{currentmodule} tskit +``` + (sec_tskit_forward_simulations)= -# _Building a forward simulator_ +# Building a forward simulator -% remove underscores in title when tutorial is complete or near-complete +This tutorial shows how tskit can be used to build your own forward-time, tree sequence simulator from scratch. +The simulator will use the discrete-time Wright-Fisher model, and track individuals +along with their genomes, storing inherited genomic regions as well as the full pedigree. -This tutorial shows how tskit can be used to -build your own forwards-in-time tree sequence recording simulator from scratch. +The code in this tutorial is broken into separate functions for clarity and +to make it easier to modify for your own purposes; a simpler and substantially +condensed forward-simulator is coded as a single function at the top of the +{ref}`sec_completing_forwards_simulations` tutorial. :::{note} If you are simply trying to obtain a tree sequence which is -the result of a forwards-in-time simulation, this can be done by using one of the -highly capable forwards-in-time genetic simulators that already exist, such as +the result of a forward-time simulation, this can be done by using one of the +highly capable forward-time genetic simulators that already exist, such as [SLiM](https://messerlab.org/slim/) or [fwdpy11](https://github.com/molpopgen/fwdpy11). Documentation and tutorials for these tools exist on their respective websites. This -tutorial focusses instead on illustrating the general principles that lie behind such +tutorial is instead intended to illustrate the general principles that lie behind such simulators. ::: -:::{todo} -Add details on building a forward simulator (see issue -[#14](https://github.com/tskit-dev/tutorials/issues/14)) +We will focus on the case of diploids, in which each {ref}`individual` contains 2 genomes, +but the concepts used here generalize to any ploidy, if you are willing to do the book-keeping. +The individuals themselves are not strictly necessary for representing genetic genealogies +(it's the genomes which are important), but they are needed during the simulation, +and so we record them in the resulting output for completeness. + +## Definitions + +Before we can make any progress, we require a few definitions. + +A *node* represents a genome at a point in time (often we imagine this as the "birth time" of the genome). +It can be described by a tuple, `(id, time)`, where `id` is a unique integer, +and `time` reflects the birth time of that `id`. When generating a tree sequence, +this will be stored in a row of the {ref}`sec_node_table_definition`. + +A *diploid individual* is a group of two nodes. During simulation, +a simple and efficient grouping assigns sequential pairs of node IDs to an individual. +It can be helpful (but not strictly necessary) to store individuals within the tree sequence +as rows of the the {ref}`sec_individual_table_definition` (a node can then be assigned to an +individual by storing that individual's id in the appropriate row of the node table). + +An *edge* reflects a transmission event between nodes. An edge is a tuple `(Left, Right, Parent, Child)` +whose meaning is "Child genome $C$ inherited the genomic interval $[L, R)$ from $Parent genome $P$". +In a tree sequence this is stored in a row of the {ref}`sec_edge_table_definition`. + +The *time*, in the discrete-time Wright-Fisher (WF) model which we will simulate, is measured in +integer generations. To match the tskit notion of time, we record time in *generations ago*: +i.e. for a simple simulation of G generations, we start the simulation at generation $G-1$ +and count down until we reach generation 0 (the current-day). + +The *population* consists of $N$ diploid individuals ($2N$ nodes) at a particular time $t$. +At the start, the population will have no known ancestry, but subsequently +each individual will be formed by choosing (at random) two parent individuals from the +population in the previous generation. + +## Approach + +We will generate edges, nodes, and individuals forwards in time, adding them to the relevant `tskit` tables. +To aid efficiency, we will also see how to "simplify" the tables into the minimal set of nodes and edges that +describe the history of the sample. Finally, these tables can be exported into an immutable +{class}`TreeSequence` for storing or analysis. + +### Setup +First, we'll import the necessary libraries and define some general parameters. +The [numpy](https://numpy.org/doc/stable/) library will be used to produce random numbers. + +```{code-cell} ipython3 +import tskit +import numpy as np + +random_seed = 2 +random = np.random.default_rng(random_seed) # A random number generator for general use + +L = 50_000 # The sequence length = 50 Kb +``` + +### Core steps + + The core simulation function generates a new population by repeating the following steps: + 1. Pick a pair of parent individuals at random from the population in the previous generation. + 2. Create a new child individual with two genomes (an individual ID will be created + by {meth}`adding a row to the individual table` + and two nodes IDs will be created by + {meth}`adding two rows to the node table`. If desired, + we can also provide the parent IDs when adding to the individual table, which + will result in storing the entire genealogical pedigree. + 3. Add the inheritance paths for each child genome using an `add_inheritance_paths()` + function (defined later). + + We will also define a simpler function, involving just step 2, to create the initial + population. + + For convenience, the population will be stored in a Python dictionary that maps + the individual ID to the IDs of its two genomes. Note that in lower-level + languages such as C, rather than use a mapping of individual ID to two genomes, + we could instead rely on the fact that genome IDs are allocated sequentially, and + pick a random pair of genomes from the previous population directly, + without using individual IDs. + + +```{code-cell} ipython3 +# For visualising unsimplified tree sequences, it can help to flag all nodes as samples +default_node_flags = tskit.NODE_IS_SAMPLE + +def make_diploid(tables, time, parent_individuals=None) -> tuple[int, tuple[int, int]]: + """ + Make an individual and its diploid genomes by adding to tables, returning the IDs. + Specifying parent_individuals is optional but results in the pedigree being stored. + """ + individual_id = tables.individuals.add_row(parents=parent_individuals) + return individual_id, ( + tables.nodes.add_row(default_node_flags, time, individual=individual_id), + tables.nodes.add_row(default_node_flags, time, individual=individual_id), + ) + +def new_population(tables, time, prev_pop, recombination_rate) -> dict[int, tuple[int, int]]: + pop = {} # fill with individual_ID: (maternal_genome_ID, paternal_genome_ID) + + # Cache the list of individual IDs in the previous population, for efficiency + prev_individuals = np.array([i for i in prev_pop.keys()], dtype=np.int32) + + for _ in range(len(prev_pop)): + # 1. Pick two individual parent IDs at random, `replace=True` allows selfing + mother_and_father = random.choice(prev_individuals, 2, replace=True) + + # 2. Get 1 new individual ID + 2 new node IDs + child_id, child_genomes = make_diploid(tables, time, mother_and_father) + pop[child_id] = child_genomes # store the genome IDs + + # 3. Add inheritance paths to both child genomes + for child_genome, parent_individual in zip(child_genomes, mother_and_father): + parent_genomes = prev_pop[parent_individual] + add_inheritance_paths(tables, parent_genomes, child_genome, recombination_rate) + return pop + +def initialise_population(tables, time, size) -> dict[int, tuple[int, int]]: + # Just return a dictionary by repeating step 2 above + return dict(make_diploid(tables, time) for _ in range(size)) +``` + +:::{note} +For simplicity, the code above assumes any parent can be a mother or a father (i.e. this is a hermaphrodite species). +It also allows the same parent to be chosen as a mother and as a father (i.e. "selfing" is allowed), +which gives simpler theoretical results. This is easy to change if required. +::: + +Our forward-time simulator simply involves repeatedly running the `next_population()` routine, +replacing the old population with the new one. For efficiency reasons, `tskit` has strict requirements +for the order of edges in the edge table, so we need to {meth}`~TableCollection.sort` the tables before we output the final tree sequence. + +```{code-cell} ipython3 +def forward_WF(num_diploids, seq_len, generations, recombination_rate=0, random_seed=2): + """ + Run a forward-time Wright Fisher simulation of a diploid population, returning + a tree sequence representing the genetic genealogy of the simulated genomes. + """ + global random + random = np.random.default_rng(random_seed) + tables = tskit.TableCollection(seq_len) + tables.time_units = "generations" # optional, but helpful when plotting + + pop = initialise_population(tables, generations, num_diploids) + while generations > 0: + generations = generations - 1 + pop = new_population(tables, generations, pop, recombination_rate) + + tables.sort() + return tables.tree_sequence() +``` + +### Inheritance without recombination + +The final piece of the simulation is to define the `add_inheritance_paths()` function, +which saves the inheritance paths in the {ref}`sec_edge_table_definition`. +For reference, the simplest case (a small focal region in which there is no recombination) +can be coded as follows: + +```{code-cell} ipython3 +def add_inheritance_paths(tables, parent_nodes, child_node, recombination_rate): + "Add inheritance paths from a randomly chosen parent genome to the child genome." + assert recombination_rate == 0 + left, right = [20_000, 21_000] # only define inheritance in this focal region + inherit_from = random.integers(2) # randomly choose the 1st or the 2nd parent node + tables.edges.add_row(left, right, parent_nodes[inherit_from], child_node) +``` + +### Inheritance with recombination + +Recombination adds complexity to the inheritance paths from a child to its parents, +that the child inherits a mosaic of the two genomes present in each parent. + +The exact details of the mosaic will depend on the model of recombination you +wish to implement. For instance, a simple model such as that in the +{ref}`sec_completing_forwards_simulations` tutorial might assume exactly one +crossover per chromosome. A complex model might allow not just multiple +crossovers with e.g. recombination "hotspots", but also non-crossover events +such as {ref}`msprime:sec_ancestry_gene_conversion`. + +Below is a redefined `add_inheritance_paths()` function of intermediate complexity, +which models recombination as a uniform Poisson process along a genomic interval. +We generate a set of "breakpoints" along the genome, then allocates edges from the +child genome to one or the other genome from the parent, with left and right positions +reflecting the breakpoint positions. Note that real recombination rates are usually +such that they result in relatively +few breakpoints per chromosome (in humans, around 1 or 2). + +```{code-cell} ipython3 +def add_inheritance_paths(tables, parent_genomes, child_genome, recombination_rate): + "Add paths from parent genomes to the child genome, with crossover recombination." + L = tables.sequence_length + num_recombinations = random.poisson(recombination_rate * L) + breakpoints = random.uniform(0, L, size=num_recombinations) + breakpoints = np.concatenate(([0], np.unique(breakpoints), [L])) + inherit_from = random.integers(2) # starting parental genome + + # iterate over pairs of ([0, b1], [b1, b2], [b2, b3], ... [bN, L]) + for left, right in zip(breakpoints[:-1], breakpoints[1:]): + tables.edges.add_row( + left, right, parent_genomes[inherit_from], child_genome) + inherit_from = 1 - inherit_from # switch to other parent genome +``` + +:::{note} +Above, breakpoint positions occur on a continuous line (i.e. "infinite breakpoint positions"), +to match population genetic theory. It is relatively easy to alter this to +allow recombinations only at integer positions ::: + +### Basic examples + +Now we can test the `forward_WF()` function for a single generation with a small +population size of 6 diploids, and print out the resulting tree sequence. For simplicity, +we will omit recombination for now. + +```{code-cell} ipython3 +ts = forward_WF(6, L, generations=1) +ts.draw_svg(y_axis=True, size=(500, 200)) +``` + +It looks like it is working correctly: all 12 genomes (6 diploids) in the current generation at time=0 trace back to a +genome in the initial generation at time=1. Note that not all individuals in the initial generation have passed on genetic +material at this genomic position (they appear as isolated nodes at the top of the plot). + +Now let's simulate for a longer time period, and set a few helpful plotting parameters. + +:::{note} +By convention we plot the most recent generation at the bottom of the plot +(i.e. perversely, each "tree" has leaves towards the bottom, and roots at the top) +::: + +```{code-cell} ipython3 +ts = forward_WF(6, L, generations=20) + +graphics_params = { + "y_axis": True, + "y_label": f"Time ({ts.time_units} ago)", + "y_ticks": {i: 'Current' if i==0 else str(i) for i in range(21)}, +} +ts.draw_svg(size=(1200, 400), **graphics_params) +``` + +This is starting to look like a real genealogy! Clearly, however, there are many +"extinct" lineages that have not made it to the current day. + +(sec_tskit_forward_simulations_simplification)= + +## Simplification + +The key to efficent forward-time genealogical simulation is the process of {ref}`sec_simplification`, +which can reduce much of the complexity shown in the tree above. +Typically, we want to remove all the lineages that do not contribute to the current day genomes. +We do this via the {meth}`~tskit.TreeSequence.simplify` method, specifying that only the nodes +in the current generation are "samples". + +```{code-cell} ipython3 +current_day_genomes = ts.samples(time=0) +simplified_ts = ts.simplify(current_day_genomes, keep_unary=True, filter_nodes=False) +simplified_ts.draw_svg(size=(600, 400), **graphics_params) +``` + +(sec_tskit_forward_simulations_simplification_id_changes)= + +### ID changes + +We just simplified with `filter_nodes=False`, meaning that the tree sequence retained +all nodes even after simplification, even those that are no longer part of +the genealogy. By default (if `filter_nodes` is not specified), these nodes are removed, +which changes the node IDs. + +```{code-cell} ipython3 +simplified_ts = ts.simplify(current_day_genomes, keep_unary=True) +simplified_ts.draw_svg(size=(600, 300), **graphics_params) +``` + +You can see that the list of nodes passed to {meth}`~tskit.TreeSequence.simplify` (i.e. the current-day genomes) +have become the first nodes in the table, numbered from 0..11; +the remaining (internal) nodes have been renumbered from youngest to oldest. + +### Removing intermediate nodes + +The `keep_unary=True` parameter meant that we kept intermediate ("unary") nodes, +even those that do not not represent branch-points in the tree. +Often these are also unneeded, and by default we remove those too, although +this will mean that we lose track of the pedigree of the individuals +(which is stored in the parents column of the {ref}`sec_individual_table_definition`). +Since we are removing more nodes, the node IDs of non-samples will again change. + +```{code-cell} ipython3 +simplified_ts = ts.simplify(current_day_genomes) +simplified_ts.draw_svg(size=(400, 300), y_axis=True) +``` + +This is now looking much more like a "normal" genetic genealogy (a "gene tree"), +in which all the sample genomes trace back to a single common ancestor. + +## Recombination + +If we pass a non-zero recombination rate to the `forward_WF()` function, different regions +of the genome may have different ancestries. This results in multiple trees along the genome. + +```{code-cell} ipython3 +rho = 1e-7 +ts = forward_WF(6, L, generations=50, recombination_rate=rho) +print(f"A recombination rate of {rho} has created {ts.num_trees} trees over {ts.sequence_length} bp") +``` + +Here's how the full (unsimplified) genealogy looks (labels omitted for clarity): + +```{code-cell} ipython3 +graphics_params["y_ticks"] = [0, 10, 20, 30, 40 ,50] +ts.draw_svg(size=(1000, 400), node_labels={}, symbol_size=2, **graphics_params) +``` + +Because we are showing the extinct lineages and the recombinations associated with them, +this plot has many trees and nodes, and is rather confusing. +However, {ref}`sec_tskit_forward_simulations_simplification` can be equally applied to a +recombinant genealogy, and will reduce the genealogy to something more managable for +analysis and visualization. + +```{code-cell} ipython3 +ts = forward_WF(6, L, generations=100, recombination_rate=rho) +simplified_ts = ts.simplify(ts.samples(time=0)) +graphics_params["y_ticks"] = [0, 10, 20, 30, 40 ,50] +simplified_ts.draw_svg(size=(1000, 300), **graphics_params) +``` + + +### Larger simulations + +So far we have only simulated a relatively small population size (12 genomes / 6 diploids) +for a short time. We can easily simulate for longer times, and increase the population size. +In this case, simplification +can have a dramatic effect on the disk storage and memory required: + +```{code-cell} +from datetime import datetime + +population_size = 100 +gens = 1000 # ten times the population size +L = 500_000 # ten times the sequence length + +start = datetime.now() +large_ts = forward_WF(population_size, L, gens, recombination_rate=rho) +print( + f"Simulated {population_size} individuals ({population_size * 2} genomes)", + f"for {gens} generations in {(datetime.now() - start).seconds:.1f} seconds", +) +``` + +```{code-cell} +print(f"Full tree sequence including dead lineages: {large_ts.nbytes/1024/1024:.2f} MB") +current_day_genomes = large_ts.samples(time=0) +simplified_ts = large_ts.simplify(current_day_genomes, keep_input_roots=True) +print( + f"Tree sequence of current-day individuals: {simplified_ts.nbytes/1024/1024:.2f} MB,", + f"{simplified_ts.num_trees} trees." +) +print( + "Simplification has reduced the size by a factor of", + f"{large_ts.nbytes / simplified_ts.nbytes:.2f}" +) +``` + +The most obvious improvement when simulating genealogies in forward time +is therefore to carry out regular simplification steps *during* the simulation, +rather than just at the end. This is +described in the next tutorial: + + +(sec_forward_simulations_ensuring_coalescence)= + +## Ensuring coalescence + +Even though the simulation above was run for hundreds of generations, there are still +trees with multiple roots in some regions of the genome. 1000 generations is not therefore +long enough to capture the ancestry back to a single common ancestor +(i.e. to ensure "full coalescence" of all local trees): + +```{code-cell} +from matplotlib import pyplot as plt +plt.figure(figsize=(7, 2)) +plt.stairs( + [tree.num_roots for tree in simplified_ts.trees()], + simplified_ts.breakpoints(as_array=True), + baseline=None, +) +plt.xlabel("Genome position (bp)") +plt.ylabel("Number of roots") +plt.show() +``` + +Most of our trees have one root, but a handful have two. +What is happening is that even though we simulated for $10N$ generations, +and the theoretical expectation for the time to the most recent common ancestor +is $E[tMRCA]=4N$, the variance is pretty big, such that some marginal +trees are not completely coalesced. Let's take a look at the distribution of +tMRCAs using `msprime` to carry out an equivalent backward-time simulation +(which is much faster, so we can replicate it, say, 1000 times). + +```{code-cell} +import msprime + +tmrca=[] +sequence_length = 50_000 +rho = 5e-7 +for sim in msprime.sim_ancestry( + 100, + sequence_length=sequence_length, + recombination_rate=rho, + random_seed=42, + num_replicates=1000, +): + for tree in sim.trees(): + tmrca.append(tree.time(tree.root)) + +plt.figure(figsize=(7, 3)) +plt.hist(tmrca, bins=30, density=True) +plt.xlabel("tMRCA (units of N generations)") +plt.ylabel("Proportion of trees") +plt.show() +``` + +So, some fraction of local trees have very large tMRCAs! And of course +the larger the population, the longer the time needed to ensure full coalescence. +Hence for large forward simulation models, the number of generations required +to ensure full coalescence can be prohibitive. + +Uncoalesced local trees matter, because they result in some pairs of +samples which are not connected, and so the genetic distance betweem +them is undefined (another way to imagine this is that we cannot spot +mutations that occur above the local roots of the tree). Hence the +simulation is not capturing the full genetic diversity within the sample. + +A powerful way to get around this problem is *recapitation*, +in which an alternative technique, such as backward-in-time coalescent simulation, +is used to to fill in the "head" of the tree sequence. In other words, +we can use a fast backward-time simulator such as `msprime` to simulate the +prior genealogy of the oldest nodes in the simplified tree sequence. +Details are described in the {ref}`sec_completing_forwards_simulations` +tutorial. + +## More complex forward-simulations + +The next tutorial shows the principles behind more complex simulations, +including e.g. regular simplification during the simulation, +adding mutations, and adding metadata. +It also details several extra tips and tricks we have learned when building forward simulators. diff --git a/more_forward_sims.md b/more_forward_sims.md new file mode 100644 index 00000000..932c63c1 --- /dev/null +++ b/more_forward_sims.md @@ -0,0 +1,245 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.12 + jupytext_version: 1.9.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +```{currentmodule} tskit +``` + +(sec_tskit_more_forward_simulations)= + +# _Advanced forward simulations_ + +% remove underscores in title when tutorial is complete or near-complete + +:::{todo} +Add further details on building a forward simulator +(see issue [#14](https://github.com/tskit-dev/tutorials/issues/14)) +::: + +In the {ref}`previous tutorial`, we developed the following +basic forward-time Wright-Fisher (WF) simulator (refer back to that tutorial for a +detailed run through of the code): + +```{code-cell} +import tskit +import numpy as np + +random_seed = 6 +random = np.random.default_rng(random_seed) # A random number generator for general use + +L = 50_000 # The sequence length: 50 Kb + +def add_inheritance_paths(tables, parent_genomes, child_genome, recombination_rate): + "Add paths from parent genomes to the child genome, with crossover recombination" + L = tables.sequence_length + num_recombinations = random.poisson(recombination_rate * L) + breakpoints = random.integers(0, L - 1, size=num_recombinations) + break_pos, counts = np.unique(breakpoints, return_counts=True) + crossovers = break_pos[counts % 2 == 1] # no crossover if e.g. 2 breaks at same pos + left_positions = np.insert(crossovers, 0, 0) + right_positions = np.append(crossovers, L) + + inherit_from = random.integers(2) + for left, right in zip(left_positions, right_positions): + tables.edges.add_row( + left, right, parent_genomes[inherit_from], child_genome) + inherit_from = 1 - inherit_from # switch to other parent genome + +def make_diploid(tables, time, parent_individuals=None): + individual_id = tables.individuals.add_row(parents=parent_individuals) + return individual_id, ( + tables.nodes.add_row(time=time, individual=individual_id), + tables.nodes.add_row(time=time, individual=individual_id), + ) + +def new_population(tables, time, prev_pop, recombination_rate): + pop = {} + prev_individuals = np.array([i for i in prev_pop.keys()], dtype=np.int32) + for _ in range(len(prev_pop)): + mother_and_father = random.choice(prev_individuals, 2, replace=True) + child_id, child_genomes = make_diploid(tables, time, mother_and_father) + pop[child_id] = child_genomes # store the genome IDs + for child_genome, parent_individual in zip(child_genomes, mother_and_father): + parent_genomes = prev_pop[parent_individual] + add_inheritance_paths(tables, parent_genomes, child_genome, recombination_rate) + return pop + +def initialise_population(tables, time, size) -> dict: + return dict(make_diploid(tables, time) for _ in range(size)) + +def forward_WF(num_diploids, seq_len, generations, recombination_rate=0, random_seed=7): + global random + random = np.random.default_rng(random_seed) + tables = tskit.TableCollection(seq_len) + tables.time_units = "generations" + + pop = initialise_population(tables, generations, num_diploids) + while generations > 0: + generations = generations - 1 + pop = new_population(tables, generations, pop, recombination_rate) + + tables.sort() + return tables.tree_sequence() +``` + +## Repeated simplification + +We will now implement the WF simulation above but with simplification happening at +repeated intervals. It is helpful to think of this as regular "garbage collection", +as what we're really doing is getting rid of extinct lineages while also "trimming" +extant lineages down to a minimal representation. + +:::{caution} +Regular garbage collection forces us to reckon with the fact that simplification +{ref}`changes the node IDs `. +We therefore need to remap any node (and individual) IDs that are used outside of +`tskit`. In the implementation described here, those IDs are stored in the `pop` +variable. +::: + +```{code-cell} +def simplify_tables(tables, samples, pop) -> dict[int, tuple[int, int]]: + """ + Simplify the tables with respect to the given samples, returning a + population dict in which individual and nodes are remapped. + """ + tables.sort() + node_map = tables.simplify(samples, keep_input_roots=True) + + nodes_individual = tables.nodes.individual + remapped_pop = {} + for node1, node2 in pop.values(): + node1, node2 = node_map[[node1, node2]] # remap + assert nodes_individual[node1] == nodes_individual[node2] # sanity check + remapped_pop[nodes_individual[node1]] = (node1, node2) + return remapped_pop + + +def forward_WF( + num_diploids, + seq_len, + generations, + recombination_rate=None, + simplification_interval=None, # default to simplifying only at end + show=None, +): + tables = tskit.TableCollection(seq_len) + tables.time_units = "generations" # optional, but helpful when plotting + if simplification_interval is None: + simplification_interval = generations + simplify_mod = generations % simplification_interval + + pop = initialise_population(tables, generations, num_diploids) + while generations > 0: + generations = generations - 1 + pop = new_population(tables, generations, pop, recombination_rate) + if generations > 0 and generations % simplification_interval == simplify_mod: + current_nodes = [u for nodes in pop.values() for u in nodes] + pop = simplify_tables(tables, current_nodes, pop) + if show: + print("Simplified", generations, "generations before end") + + pop = simplify_tables(tables, [u for nodes in pop.values() for u in nodes], pop) + if show: + print("Final simplification") + return tables.tree_sequence() + +ts = forward_WF(6, L, generations=100, simplification_interval=25, show=True) +ts.draw_svg(size=(800, 200)) +``` + +### Invariance to simplification interval +A critical concept to keep in mind is that the simulation itself is the only random component. +The simplification algorithm is deterministic given a set of (nodes, edges) satisfying +`tskit`'s sorting requirements. Therefore, the results of our new `forward_WF()` function +must be the same for all simplification intervals + +:::{note} +This invariance property only holds in some cases. +We discuss this in more detail below when we add in mutation. +::: + +```{code-cell} +ts = diploid_sim(10, 500, simplification_interval=1, random_seed=42) + +# Iterate over a range of odd and even simplification intervals. +print("Testing invariance to simplification interval") +test_intervals = list(range(2, 500, 33)) +for i in test_intervals: + # Make sure each new sim starts with same random seed! + ts_test = diploid_sim(10, 500, simplification_interval=i, random_seed=42) + assert ts.equals(ts_test, ignore_provenance=True) +print(f"Intervals {test_intervals} passed") +``` + +:::{tip} +Testing your own code using loops like the one above is a very +good way to identify subtle bugs in book-keeping. +::: + +### Summary + +* Simplifying during a simulation changes IDs in the tree sequence tables, so we need to remap +entities that store any of these IDs between generations. +* Our code to carry out simplification gets called both during the simulation and at the end. +It's therefore worth encapsulating it into a class or function for easier code re-use and testing. + +#### Technical notes + +We have found that it is possible to write a simulation where the results differ +by simplification interval, but appear correct in distribution. +By this we mean that looking at distributions of numbers of mutations, their frequencies, etc., +match predictions from analytical theory. However, our experience is that such simulations +contain bugs and that the summaries being used for testing are too crude to catch them. +For example, they may affect the variance in a subtle way that would require millions +of simulations to catch. Often what is going on is that parent/offspring relationships +are not being properly recorded, resulting in lineages that either persist too long or +not long enough. (In other words, the variance in offspring number per diploid is no +longer what it should be, meaning you've changed the effective population size.) +Thus, please make sure you get the **same** `tskit` tables out of a simulation for +any simplification interval. + + +## Mutations + +In this section, we will add mutation to our simulation. Mutations will occur according to the +infinitely-many sites model, which means that a new mutation cannot arise at a currently-mutated +position. $\theta = 4N\mu$ is the scaled mutation rate, and is equal to twice the expected number +of new mutations per generation. The parameter $\mu$ is the expected number of new mutations +per gamete, per generation. Mutation positions will be uniformly distributed along the genome. + +Adding mutations changes the complexity of the simulation quite a bit, because now we must +add to and simplify [site tables](sec_edge_table_definition) and +[mutation tables](sec_mutation_table_definition) instances. We might also +want to add *metadata* to the sites or mutations, recording details such as +the selection coefficient of a mutation, or the type of mutation (e.g., synonymous vs. non-synonymous). + +We will write a mutation function here which we will re-use in future examples. + +:::{note} +We will be treating mutations as neutral. Doing so is odd, as one big +selling point of `tskit` is the ability to skip the tracking of neutral mutations +in forward simulations. However, tracking neutral mutations plus metadata is the +same as tracking selected mutations and their metadata, and being able to do neat +things like put your selected mutations onto a figure of the genealogy +is one of several possible use cases. +::: + +:::{todo} +The rest of this tutorial is still under construction, and needs porting from +[this workbook](https://github.com/tskit-dev/tutorials/blob/main/old-content/notebooks/wfforward.ipynb). +This will primarily deal with sites and mutations (and mutational metadata). +We could also include details on selection, if that seems sensible. + +The section in that workbook on "Starting with a prior history" should be put in +the {ref}`sec_completing_forwards_simulations` tutorial. +::: \ No newline at end of file