Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of a simplification tute. #188

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 92 additions & 1 deletion simplification.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,100 @@ def create_notebook_data():

(sec_simplification)=

# _Simplification_
# Simplification
% remove underscores in title when tutorial is complete or near-complete

**Georgia Tsambos**

:::{todo}
Create content. See https://github.com/tskit-dev/tutorials/issues/52
:::

Consider two of the most common 'subset'-type operations we might want to perform on genomic datasets:

- Look at the data for a subset of the samples in the original dataset.
- Look at sequence or variant information at the specific sites that vary within that subsample.

`simplify` is the tree sequence version of these operations,
but it is also more flexible than this.
Essentially, `simplify` allows you to prune away certain types of information in the tree sequence that may be irrelevant in your particular application.

```{code-cell}
import tskit
import msprime
```

## An example dataset

To demonstrate `simplify` in action, we'll simulate a scenario involving three modern-day populations, `SMALL`, `BIG` and `ADMIX`, and one ancestral population `ANC`:

```{code-cell}
demography = msprime.Demography()
demography.add_population(name="SMALL", initial_size=2000)
demography.add_population(name="BIG", initial_size=5000)
demography.add_population(name="ADMIX", initial_size=2000)
demography.add_population(name="ANC", initial_size=5000)
demography.add_admixture(
time=100, derived="ADMIX", ancestral=["SMALL", "BIG"],
proportions=[0.5, 0.5])
demography.add_population_split(
time=1000, derived=["SMALL", "BIG"], ancestral="ANC")
```

Our simulated tree sequence will contain genomic information for 100 diploid individuals from each of the three contemporary populations:

```{code-cell}
ts = msprime.sim_ancestry(
samples={"SMALL": 100, "BIG": 100, "ADMIX" : 100},
demography=demography,
random_seed=2432,
sequence_length=5e7,
recombination_rate=1e-8
)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=6151)
```
Before moving on, have a quick look at the number of elements and overall size of this tree sequence. We'll see that applying `simplify` will reduce many of these.

```{code-cell}
ts
```

## The basic syntax

At minimum, `simplify` requires a list of sample IDs that you wish to include in the new, smaller tree sequence.
Suppose we wanted to create a tree sequence holding the coalescent history of only those samples from population 'ADMIX' (which has a population label of 2):

```{code-cell}
tss = ts.simplify(ts.samples(2))
tss
```

We now have a smaller tree sequence holding just those 200 sample chromosomes from population 'ADMIX'.

Although there are fewer edges and nodes in this newer, simplified tree sequence, and the total file size is smaller,
it's not a drastic difference.
For instance, we have reduced the total number of nodes by less than half (from 48663 to 30390),
even though we are looking at just a third of our original samples.
This demonstrates the sub-linearity and efficiency of tree sequence structures.

Note that the numbers of sites and mutations are also reduced.
This is because `simplify` has removed all the mutations on the edges that were pruned away, and all of the corresponding sites (we'll see how to make `simplify` behave differently later on).
The only mutations that remain in this smaller tree sequence are those
that are inherited by *some, but not all* of the admixed samples.
That is, the mutations that produced *variation* within this subsample.

## When is `simplify` most useful?

Suppose that there are certain calculations that we only wish to perform on the subsample of admixed genomes.

:::{todo}
Think about this a bit more. Main overall 'applications' are (1) making tree sequences nicer and more regular ('simpler'), (2) pruning away irrelevant stuff so that other operations run more quickly/efficiently, and (3) forward-time sims (though this is a specific case of (2)). How much quicker/more efficient depends on the scaling of the operations -- probably a big difference for something like ibd_segments, less for tree stats stuff.
:::

## Keeping unary nodes

## Keeping invariant sites