tskit-dev · gtsambos · Jul 20, 2022
diff --git a/simplification.md b/simplification.md
@@ -24,9 +24,100 @@ def create_notebook_data():
 
 (sec_simplification)=
 
-# _Simplification_
+# Simplification
 % remove underscores in title when tutorial is complete or near-complete
 
+**Georgia Tsambos**
+
 :::{todo}
 Create content. See https://github.com/tskit-dev/tutorials/issues/52
 :::
+
+Consider two of the most common 'subset'-type operations we might want to perform on genomic datasets:
+
+ - Look at the data for a subset of the samples in the original dataset.
+ - Look at sequence or variant information at the specific sites that vary within that subsample.
+
+`simplify` is the tree sequence version of these operations,
+but it is also more flexible than this.
+Essentially, `simplify` allows you to prune away certain types of information in the tree sequence that may be irrelevant in your particular application.
+
+```{code-cell}
+import tskit
+import msprime
+```
+
+## An example dataset
+
+To demonstrate `simplify` in action, we'll simulate a scenario involving three modern-day populations, `SMALL`, `BIG` and `ADMIX`, and one ancestral population `ANC`:
+
+```{code-cell}
+demography = msprime.Demography()
+demography.add_population(name="SMALL", initial_size=2000)
+demography.add_population(name="BIG", initial_size=5000)
+demography.add_population(name="ADMIX", initial_size=2000)
+demography.add_population(name="ANC", initial_size=5000)
+demography.add_admixture(
+    time=100, derived="ADMIX", ancestral=["SMALL", "BIG"],
+    proportions=[0.5, 0.5])
+demography.add_population_split(
+    time=1000, derived=["SMALL", "BIG"], ancestral="ANC")
+```
+
+Our simulated tree sequence will contain genomic information for 100 diploid individuals from each of the three contemporary populations:
+
+```{code-cell}
+ts = msprime.sim_ancestry(
+    samples={"SMALL": 100, "BIG": 100, "ADMIX" : 100},
+    demography=demography,
+    random_seed=2432,
+    sequence_length=5e7,
+    recombination_rate=1e-8
+)
+ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=6151)
+```
+Before moving on, have a quick look at the number of elements and overall size of this tree sequence. We'll see that applying `simplify` will reduce many of these.
+
+```{code-cell}
+ts
+```
+
+## The basic syntax
+
+At minimum, `simplify` requires a list of sample IDs that you wish to include in the new, smaller tree sequence.
+Suppose we wanted to create a tree sequence holding the coalescent history of only those samples from population 'ADMIX' (which has a population label of 2):
+
+```{code-cell}
+tss = ts.simplify(ts.samples(2))
+tss
+```
+
+We now have a smaller tree sequence holding just those 200 sample chromosomes from population 'ADMIX'.
+
+Although there are fewer edges and nodes in this newer, simplified tree sequence, and the total file size is smaller,
+it's not a drastic difference.
+For instance, we have reduced the total number of nodes by less than half (from 48663 to 30390),
+even though we are looking at just a third of our original samples.
+This demonstrates the sub-linearity and efficiency of tree sequence structures.
+
+Note that the numbers of sites and mutations are also reduced.
+This is because `simplify` has removed all the mutations on the edges that were pruned away, and all of the corresponding sites (we'll see how to make `simplify` behave differently later on).
+The only mutations that remain in this smaller tree sequence are those
+that are inherited by *some, but not all* of the admixed samples.
+That is, the mutations that produced *variation* within this subsample.
+
+## When is `simplify` most useful?
+
+Suppose that there are certain calculations that we only wish to perform on the subsample of admixed genomes.
+
+:::{todo}
+Think about this a bit more. Main overall 'applications' are (1) making tree sequences nicer and more regular ('simpler'), (2) pruning away irrelevant stuff so that other operations run more quickly/efficiently, and (3) forward-time sims (though this is a specific case of (2)). How much quicker/more efficient depends on the scaling of the operations -- probably a big difference for something like ibd_segments, less for tree stats stuff.
+:::
+
+## Keeping unary nodes 
+
+## Keeping invariant sites
+
+
+
+