From 22dd1a56929159f5684fa3e21b7b7d78ee8fc91f Mon Sep 17 00:00:00 2001 From: Gregor Gorjanc Date: Fri, 20 Sep 2024 07:41:36 +0100 Subject: [PATCH] Minor formatting what_is.md, terminology_and_concepts.md, getting_started.md --- getting_started.md | 81 +++++++++++++++++++++---------------- terminology_and_concepts.md | 58 ++++++++++++-------------- what_is.md | 14 +++---- 3 files changed, 80 insertions(+), 73 deletions(-) diff --git a/getting_started.md b/getting_started.md index 672903ab..0c25bcd8 100644 --- a/getting_started.md +++ b/getting_started.md @@ -18,20 +18,21 @@ kernelspec: # Getting started with {program}`tskit` You've run some simulations or inference methods, and you now have a -{class}`TreeSequence` object; what now? This tutorial is aimed +{class}`TreeSequence` object; what now? This tutorial is aimed for users who are new to {program}`tskit` and would like to get some basic tasks completed. We'll look at five fundamental things you might -need to do: +want to do: {ref}`process trees`, -{ref}`sites & mutations`, and -{ref}`genotypes`, +{ref}`process sites & mutations`, +{ref}`extract genotypes`, {ref}`compute statistics`, and {ref}`save or export data`. Throughout, we'll also provide pointers to where you can learn more. + :::{note} The examples in this -tutorial are all written using the {ref}`sec_python_api`, but it's also possible to +tutorial are all written using the {ref}`tskit:sec_python_api`, but it's also possible to {ref}`use R `, or access the API in other languages, notably {ref}`C` and [Rust](https://github.com/tskit-dev/tskit-rust). ::: @@ -43,7 +44,6 @@ twenty diploid individuals. To make it a bit more interesting, we'll simulate th of a {ref}`selective sweep` in the middle of the chromosome, then throw some neutral mutations onto the resulting tree sequence. - ```{code-cell} ipython3 import msprime @@ -61,7 +61,7 @@ ts = msprime.sim_ancestry( recombination_rate=1e-8, random_seed=1234, # only needed for repeatabilty ) -# Optionally add finite-site mutations to the ts using the Jukes & Cantor model, creating SNPs +# Optionally add finite-site mutations to the ts using the default Jukes & Cantor model, creating SNPs ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=4321) ts ``` @@ -250,13 +250,16 @@ for nmuts, count in enumerate(np.bincount(num_muts)): (sec_processing_genotypes)= -## Processing genotypes +## Extracting genotypes At each site, the sample nodes will have a particular allelic state (or be flagged as {ref}`tskit:sec_data_model_missing_data`). The -{meth}`TreeSequence.variants` method gives access to the -full variation data. For efficiency, the {attr}`~Variant.genotypes` -at a site are returned as a [numpy](https://numpy.org) array of integers: +{meth}`TreeSequence.variants` method gives access to +full variation data - possible allelic states and node genotypes at each site. +Since nodes are by definition haploid, their genotype at a site is their allelic state +at the site. For efficiency, its attribute {attr}`~Variant.genotypes` +at a site returns a [numpy](https://numpy.org) array of integers +representing allelic states: ```{code-cell} ipython3 import numpy as np @@ -275,11 +278,9 @@ Tree sequences are optimised to look at all samples at one site, then all sample adjacent site, and so on along the genome. It is much less efficient look at all the sites for a single sample, then all the sites for the next sample, etc. In other words, you should generally iterate over sites, not samples. Nevertheless, all the alleles for -a single sample can be obtained via the -{meth}`TreeSequence.haplotypes` method. +a single sample can be obtained via the {meth}`TreeSequence.haplotypes` method. ::: - To find the actual allelic states at a site, you can refer to the {attr}`~Variant.alleles` provided for each {class}`Variant`: the genotype value is an index into this list. Here's one way to print them out; for @@ -287,8 +288,6 @@ clarity this example also prints out the IDs of both the sample nodes (i.e. the and the diploid {ref}`individuals ` in which each sample node resides. - - ````{code-cell} ipython3 samp_ids = ts.samples() print(" ID of diploid individual: ", " ".join([f"{ts.node(s).individual:3}" for s in samp_ids])) @@ -309,7 +308,6 @@ such as indels, leading to allelic states which need not be one of these 4 lette even be a single letter. ::: - (sec_tskit_getting_started_compute_statistics)= ## Computing statistics @@ -353,17 +351,17 @@ is the spectrum for a section of the tree sequence between 5 and 5.5Mb, which we created by deleting the regions outside that interval using {meth}`TreeSequence.keep_intervals`. Unsurprisingly, as we noted when looking at the trees, there's a far higher proportion of singletons in -the region of the sweep. +the region of the sweep compared to the entire genome. (sec_tskit_getting_started_compute_statistics_windowing)= ### Windowing It is often useful to see how statistics vary in different genomic regions. This is done -by calculating them in {ref}`tskit:sec_stats_windows` along the genome. For this, +by calculating them in {ref}`windows` along the genome. For this, let's look at a single statistic, the genetic {meth}`~TreeSequence.diversity` (π). As a -site statistic this measures the average number of genetic differences between two -randomly chosen samples, whereas as a branch length statistic it measures the average +site statistic, it measures the average number of genetic differences between two +randomly chosen samples, whereas as a branch statistic, it measures the average branch length between them. We'll plot how the value of π changes using 10kb windows, plotting the resulting diversity between positions 4 and 6 Mb: @@ -380,15 +378,15 @@ ax1.set_yscale("log") ax1.set_ylim(1e-6, 1e-3) ax2.stairs(ts.diversity(windows=windows, mode="branch"), windows/1_000, baseline=None) ax2.set_xlabel("Genome position (kb)") -ax2.set_title("Branch-length-based calculation") +ax2.set_title("Branch-based calculation") ax2.set_xlim(4e3, 6e3) ax2.set_yscale("log") plt.show() ``` There's a clear drop in diversity in the region of the selective sweep. And as expected, -the statistic based on branch-lengths gives a less noisy signal. - +the statistic based on branch lengths gives a less noisy signal than the statistic based +on randomly occuring mutations at sites. (sec_tskit_getting_started_exporting_data)= @@ -434,7 +432,7 @@ print(small_ts.as_nexus(precision=3, include_alignments=False)) ### VCF -The standard way of interchanging genetic variation data is the Variant Call Format, +The standard way of interchanging genetic variation data is the Variant Call Format (VCF), for which tskit has basic support: ```{code-cell} ipython3 @@ -442,9 +440,8 @@ import sys small_ts.write_vcf(sys.stdout) ``` -The write_vcf method takes a file object as a parameter; to get it to write out to the -notebook here we ask it to write to stdout. - +The {meth}`TreeSequence.write_vcf` method takes a file object as a parameter; +to get it to write out to the notebook here we ask it to write to stdout. (sec_tskit_getting_started_exporting_data_allel)= @@ -466,8 +463,8 @@ print(h.n_variants, h.n_haplotypes) h ``` -Sckit.allel has a wide-ranging and efficient suite of tools for working with genotype -data, so should provide anything that's needed. For example, it gives us an +{program}`scikit-allel` has a wide-ranging and efficient suite of tools for working +with genotype data for most of users' needs. For example, it gives us an another way to compute the pairwise diversity statistic (that we calculated {ref}`above` using the native {meth}`TreeSequence.diversity` method): @@ -477,6 +474,23 @@ ac = h.count_alleles() allel.mean_pairwise_difference(ac) ``` +Comparative call with tskit is: + +```{code-cell} ipython3 +# Per site (not span-normalised to match scikit-allel) +small_ts.diversity(windows="sites", span_normalise=False) +# Per site (span-normalised to account for variable span of windows defined by sites) +small_ts.diversity(windows="sites") + +# Per whole genome +small_ts.diversity(span_normalise=False) +# ... equivalent call to the above +sum(small_ts.diversity(windows="sites", span_normalise=False)) +small_ts.diversity() +``` + +See {ref}`windows` and {ref}`windows` +for details about the ``span_normalise=False``. (sec_tskit_getting_started_key_points)= @@ -496,15 +510,15 @@ in rough order of importance: * {ref}`sec_terminology_nodes` (i.e. genomes) can belong to {ref}`individuals`. For example, sampling a diploid individual results in an {class}`Individual` object which - possesses two distinct {ref}`sample nodes`. + links to two distinct {ref}`sample nodes`. * Key tree sequence methods * {meth}`~TreeSequence.samples()` returns an array of node IDs specifying the nodes that are marked as samples * {meth}`~TreeSequence.node` returns the node object for a given integer node ID * {meth}`~TreeSequence.trees` iterates over all the trees * {meth}`~TreeSequence.sites` iterates over all the sites - * {meth}`~TreeSequence.variants` iterates over all the sites with their genotypes - and alleles + * {meth}`~TreeSequence.variants` iterates over all the sites with their list of + unique alleles and sample/node? genotypes * {meth}`~TreeSequence.simplify()` reduces the number of sample nodes in the tree sequence to a specified subset * {meth}`~TreeSequence.keep_intervals()` (or its complement, @@ -519,4 +533,3 @@ in rough order of importance: sequence, for example {meth}`~TreeSequence.allele_frequency_spectrum`, {meth}`~TreeSequence.diversity`, and {meth}`~TreeSequence.Fst`; these can also be calculated in windows along the genome. - diff --git a/terminology_and_concepts.md b/terminology_and_concepts.md index 360b587d..6e672150 100644 --- a/terminology_and_concepts.md +++ b/terminology_and_concepts.md @@ -48,8 +48,6 @@ def basics(): ) tables.mutations.time = np.full_like(tables.mutations.time, tskit.UNKNOWN_TIME) tables.tree_sequence().dump("data/basics.trees") - - def create_notebook_data(): basics() @@ -71,18 +69,18 @@ concepts behind {program}`tskit`, the tree sequence toolkit. ::::{margin} :::{note} -See {ref}`sec_intro_downloading_datafiles` to run this tutorial on your own computer +See {ref}`sec_intro_downloading_datafiles` to run this tutorial on your own computer. ::: :::: A tree sequence is a data structure which describes a set of correlated evolutionary trees, together with some associated data that specifies, for example, -the location of mutations in the tree sequence. More technically, a tree sequence -stores a biological structure known as an "Ancestral Recombination Graph", or ARG. +the location of mutations in the genome. More technically, a tree sequence +stores a population genetics object known as an Ancestral Recombination Graph (ARG). -Below are the most important {ref}`terms and concepts ` -that you'll encounter in these tutorials, but first we'll {func}`~tskit.load` a tree -sequence from a `.trees` file using the +Below are the most important terms and concepts that you'll encounter in these tutorials. +A concise glossary of these terms and concepts is available at {ref}`here `. +But first we'll {func}`~tskit.load` a tree sequence from a `.trees` file using the {ref}`tskit:sec_python_api` (which will be used in the rest of this tutorial): ```{code-cell} ipython3 @@ -97,7 +95,7 @@ ts = tskit.load("data/basics.trees") :::{note} {ref}`Workarounds` exist to represent a multi-chromosome genome as a tree -sequence, but are not covered here +sequence, but are not covered here. ::: :::: @@ -126,13 +124,12 @@ ts.draw_svg( ) ``` -Each tree records the lines of descent along which a piece of DNA has been -inherited (ignore for the moment the red symbols, which represent a mutation). +Each tree records the lines of descent along which a piece of DNA has been inherited. For example, the first tree tells us that DNA from ancestral genome 7 duplicated to produce two lineages, which ended up in genomes 1 and 4, both of which exist in the current population. In fact, since this pattern is seen in all trees, these particular lines of inheritance were taken by all the DNA in this 1000 base pair genome. - +The red symbol is a mutation, which we will describe later. (sec_terminology_nodes)= @@ -148,7 +145,6 @@ an *internal node*, representing an ancestor in which a single DNA sequence was duplicated (in forwards-time terminology) or in which multiple sequences coalesced (in backwards-time terminology). - (sec_terminology_nodes_samples)= #### Sample nodes @@ -163,7 +159,6 @@ labelled $0..5$, and also 6 non-sample nodes, labelled $6..11$, in the tree sequ print("There are", ts.num_nodes, "nodes, of which", ts.num_samples, "are sample nodes") ``` - (sec_terminology_edges)= ### Edges @@ -175,10 +170,9 @@ three trees in the example above has a branch from node 7 to node 1, but those t branches represent just a single edge. Each edge is associated with a parent node ID and a child node ID. The time of the parent -node must be -strictly greater than the time of the child node, and the difference in these times is -sometimes referred to as the "length" of the edge. Since trees in a tree sequence are -usually taken to represent marginal trees along a genome, as well as the time dimension +node must be strictly greater than the time of the child node, and the difference in these times +is sometimes referred to as the "length" of the edge. Since trees in a tree sequence are +usually taken to represent local trees along a genome, as well as the time dimension each edge also has a genomic _span_, defined by a *left* and a *right* position along the genome. There are 15 edges in the tree sequence above. Here's an example of one of them: @@ -240,9 +234,6 @@ children_of_7 = first_tree.children(7) print("Node 7's parent is", parent_of_7, "and childen are", children_of_7, "in the first tree") ``` - - - (sec_terminology_individuals_and_populations)= ### Individuals and populations @@ -332,6 +323,8 @@ homozygous for "T", Bob is homozygous for "G", and Cat is heterozygous "T/G". In other words the ancestral state and the details of any mutations at that site, when coupled with the tree topology at the site {attr}`~Site.position`, is sufficient to define the allelic state possessed by each sample. +See description for {attr}`~Mutation.parent` on how tskit handles multiple mutations along +a path in a tree. Note that even though the genome is 1000 base pairs long, the tree sequence only contains a single site, because we usually only bother defining *variable* sites in a tree @@ -340,7 +333,6 @@ that genomic location). It is perfectly possible to have a site with no mutation (or silent mutations) --- i.e. a "monomorphic" site --- but such sites are not normally used in further analysis. - (sec_terminology_provenance)= ### Provenance @@ -354,7 +346,6 @@ call to msprime that produced it, and the second the call to provenance entries are sufficient to exactly recreate the tree sequence, but this is not always possible. - (sec_concepts)= ## Concepts @@ -385,17 +376,18 @@ with 3 or more children in a particular tree (these are known as *polytomies*). ### Tree changes, ancestral recombinations, and SPRs The process of recombination usually results in trees along a genome where adjacent -trees differ by only a few "tree edit" or SPR (subtree-prune-and-regraft) operations. +trees differ by only a few "tree edit" or subtree-prune-and-regraft (SPR) operations. The result is a tree sequence in which very few edges {ref}`change from tree to tree`. This is the underlying reason that `tskit` is so efficient, and is well illustrated in the example tree sequence above. In this (simulated) tree sequence, each tree differs from the next by a single SPR. -The subtree defined by node 7 in the first tree has been pruned and regrafted onto the -branch between 0 and 10, to create the second tree. The second and third trees have the -same topology, but differ because their ultimate coalesence happened in a different -ancestor (easy to spot in a simulation, but hard to detect in real data). This is also +The subtree defined by node 7 in the first tree has been pruned (away from node 11) and +regrafted onto the branch between 0 and 9, to create the second tree. +The second and third trees have the same topology, +but differ because their ultimate coalesence happened in a different ancestor +(easy to spot in a simulation, but hard to detect in real data). This is also caused by a single SPR: looking at the second tree, either the subtree below node 8 or the subtree below node 9 must have been pruned and regrafted higher up on the same lineage to create the third tree. Because this is a fully {ref}`simplified` @@ -409,7 +401,7 @@ positions (an "infinite sites" model of breakpoints), then the number of trees i sequence equals the number of ancestral recombination events plus one. If recombinations can occur at the same physical position (e.g. if the genome is treated as a set of discrete integer positions, as in the simulation that created this tree sequence) then -moving from one tree to the next in a tree sequence might require multiple SPRs if +moving from one tree to the next in a tree sequence might require multiple SPRs if there are multiple, overlaid ancestral recombination events. (sec_concepts_args)= @@ -418,11 +410,11 @@ there are multiple, overlaid ancestral recombination events. ::::{margin} :::{note} -There is a subtle distinction between common ancestry and coalescence. In particular, all coalescent nodes are common ancestor events, but not all common ancestor events in an ARG result in coalescence in a local tree. +There is a subtle distinction between common ancestry and coalescence. In particular, all coalescent nodes are common ancestor events, but not all common ancestor events in an ARG result in coalescence in all local trees. ::: :::: -The term "Ancestral Recombination Graph", or ARG, is commonly used to describe a genetic +The term Ancestral Recombination Graph (ARG), is commonly used to describe a genetic genealogy. In particular, many (but not all) authors use it to mean a genetic genealogy in which details of the position and potentially the timing of all recombination and common ancestor events are explictly stored. For clarity @@ -438,7 +430,7 @@ which omits these extra nodes. This is for two main reasons: 2. The number of recombination and non-coalescing common ancestor events in the genealogy quickly grows to dominate the total number of nodes in the tree sequence, without actually contributing to the mutations inherited by the samples. - In other words, these nodes are redundant to the storing of genome data. + In other words, these nodes are redundant to the storing of genomic data. Therefore, compared to a full ARG, you can think of a simplified tree sequence as storing the trees *created by* recombination events, rather than attempting to record the @@ -450,6 +442,8 @@ way to put it: > whereas a [simplified] tree sequence encodes the outcome of those events" > ([Kelleher _et al._, 2019](https://doi.org/10.1534/genetics.120.303253)) +[Wong _et al._, 2024](https://doi.org/10.1093/genetics/iyae100) +review this topic in detail. ### Tables diff --git a/what_is.md b/what_is.md index faf89a6c..6fce9e48 100644 --- a/what_is.md +++ b/what_is.md @@ -79,7 +79,7 @@ they can be created by [evolutionary simulation](https://tskit.dev/software/#sim or by [inferring genealogies from empirical DNA data](https://tskit.dev/software/#infer). :::{margin} Key point -Tree sequences are used to encode and analyse large genetic datasets +Tree sequences are used to encode and analyse large genetic datasets. ::: Tree sequences provide an efficient way of storing @@ -226,7 +226,7 @@ branches of the trees, and the positions of the twelve variable sites associated mutations are shown along the X axis. :::{margin} Key point -Mutation on trees are the source of genetic variation +Mutation on trees are the source of genetic variation. ::: The trees inform us that, for example, the final mutation (at position 987) is inherited @@ -266,7 +266,7 @@ ts.draw_svg( ``` :::{margin} Key point -Tree sequences are efficient because they don't store each tree separately +Tree sequences are efficient because they don't store each tree separately. ::: A branch can be shared by many adjacent trees, but is stored as a single edge in the tree @@ -314,9 +314,9 @@ plt.show() ::::{margin} :::{note} -The genetic genealogy is sometimes referred to as an ancestral recombination graph, -or ARG, and one way to think of tskit tree sequence is as a way -to store various different sorts of ARGs (see the {ref}`ARG tutorial`) +The genetic genealogy is sometimes referred to as an ancestral recombination graph (ARG), +and one way to think of tskit tree sequence is as a way +to store various different sorts of ARGs (see the {ref}`ARG tutorial`). ::: :::: @@ -428,7 +428,7 @@ multiple correlated trees along a genome. ```{margin} Key point Most genetic calculations involve iterating over trees, which is highly efficient in -{program}`tskit` +{program}`tskit`. ``` For example, statistical measures of genetic variation can be thought of as a calculation