Skip to content

Commit

Permalink
Merge pull request #274 from hyanwong/arg-wording
Browse files Browse the repository at this point in the history
Update terminology to reflect ARG paper
  • Loading branch information
jeromekelleher authored Jul 25, 2024
2 parents 88195b7 + 88f4a42 commit 126cbd5
Show file tree
Hide file tree
Showing 3 changed files with 47 additions and 24 deletions.
35 changes: 24 additions & 11 deletions args.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,9 @@ parent to child nodes. Therefore a succinct tree sequence is equivalent to a
[directed graph](https://en.wikipedia.org/wiki/Directed_graph),
which is additionally annotated with genomic positions such that at each
position, a path through the edges exists which defines a tree. This graph
interpretation of a tree sequence is tightly connected to the concept of
interpretation of a tree sequence maps very closely to the concept of
an "ancestral recombination graph" (or ARG). See
[this preprint](https://www.biorxiv.org/content/10.1101/2023.11.03.565466v1) for further details.
[this preprint](https://www.biorxiv.org/content/10.1101/2023.11.03.565466v2) for further details.

## Full ARGs

Expand All @@ -39,12 +39,16 @@ graph structure defined by that process, see e.g.

The term "ARG" is [often used](https://doi.org/10.1086%2F508901) to refer to
a structure consisting of nodes and edges that describe the genetic genealogy of a set
of sampled chromosomes which have evolved via a process of genetic inheritance combined
with recombination. ARGs may contain not just nodes corresponding to genetic
coalescence, but also additional nodes that correspond e.g. to recombination events.
These "full ARGs" can be stored and analysed in
of sampled chromosomes which have evolved via a process of inheritance combined
with recombination. We use the term "full ARG" for a commonly-described type of
ARG that contains not just nodes that involve coalescence of ancestral material,
but also additional non-coalescent nodes. These nodes correspond to
recombination events, and common ancestor events that are not associated with
coalescence in any of the local trees. Full ARGs can be stored and analysed in
[tskit](https://tskit.dev) like any other tree sequence. A full ARG can be generated using
{func}`msprime:msprime.sim_ancestry` with the `record_full_arg=True` option, as described
{func}`msprime:msprime.sim_ancestry` by specifying `coalescing_segments_only=False` along with
`additional_nodes = msprime.NodeType.COMMON_ANCESTOR | msprime.NodeType.RECOMBINANT`
(or the equivalent `record_full_arg=True`) as described
{ref}`in the msprime docs<msprime:sec_ancestry_full_arg>`:

```{code-cell}
Expand All @@ -58,8 +62,12 @@ parameters = {
"random_seed": 333,
}
ts_arg = msprime.sim_ancestry(**parameters, record_full_arg=True, discrete_genome=False)
# NB: the strict Hudson ARG needs unique crossover positions (i.e. a continuous genome)
ts_arg = msprime.sim_ancestry(
**parameters,
discrete_genome=False, # the strict Hudson ARG needs unique crossover positions (i.e. a continuous genome)
coalescing_segments_only=False, # setting record_full_arg=True is equivalent to these last 2 parameters
additional_nodes=msprime.NodeType.COMMON_ANCESTOR | msprime.NodeType.RECOMBINANT,
)
print('Simulated a "full ARG" under the Hudson model:')
print(
Expand Down Expand Up @@ -282,7 +290,12 @@ its simplified version:
```{code-cell}
large_sim_parameters = parameters.copy()
large_sim_parameters["sequence_length"] *= 1000
large_ts_arg = msprime.sim_ancestry(**large_sim_parameters, record_full_arg=True)
large_ts_arg = msprime.sim_ancestry(
**large_sim_parameters,
discrete_genome=False, # not technically needed, as we aren't calculating likelihoods
coalescing_segments_only=False,
additional_nodes=msprime.NodeType.COMMON_ANCESTOR | msprime.NodeType.RECOMBINANT,
)
large_ts = large_ts_arg.simplify()
print(
Expand Down Expand Up @@ -478,6 +491,6 @@ Show how KwARG output can be converted to tskit form.
:::

:::{todo}
Implement conversion between the 2 RE node version and the 1 RE node version
Implement conversion between the _msprime_ 2 RE node version and the more conventional 1 RE node version. See https://github.com/tskit-dev/msprime/issues/1942 for extensive discussion on the advantages / disadvantages of using 2 nodes vs 1 node-with-metadata.
:::

32 changes: 21 additions & 11 deletions terminology_and_concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -416,28 +416,38 @@ there are multiple, overlaid ancestral recombination events.

### Tree sequences and ARGs

Much of the literature on ancestral inference concentrates on the Ancestral Recombination
Graph, or ARG, in which details of the position and potentially the timing of
recombination events are explictly stored. Although a tree sequence *can* represent such
an ARG, by incorporating nodes that represent recombination events (see the
{ref}`sec_args` tutorial), this is not normally done for two reasons:
::::{margin}
:::{note}
There is a subtle distinction between common ancestry and coalescence. In particular, all coalescent nodes are common ancestor events, but not all common ancestor events in an ARG result in coalescence in a local tree.
:::
::::

The term "Ancestral Recombination Graph", or ARG, is commonly used to describe a genetic
genealogy. In particular, many (but not all) authors use it to mean a genetic
genealogy in which details of the position and potentially the timing of all
recombination and common ancestor events are explictly stored. For clarity
we refer to this sort of genetic genealogy as a "full ARG". Succinct tree sequences can
represent many different sorts of ARGs, including "full ARGs", by incorporating extra
non-coalescent nodes (see the {ref}`sec_args` tutorial). However, tree sequences are
often shown and stored in {ref}`fully simplified<sec_simplification>` form,
which omits these extra nodes. This is for two main reasons:

1. Many recombination events are undetectable from sequence data, and even if they are
detectable, they can be logically impossible to place in the genealogy (as in the
second SPR example above).
2. The number of recombination events in the genealogy can grow to dominate the total
number of nodes in the total tree sequence, without actually contributing to the
realised sequences in the samples. In other words, recombination nodes are redundant
to the storing of genome data.
2. The number of recombination and non-coalescing common ancestor events in the genealogy
quickly grows to dominate the total number of nodes in the tree sequence,
without actually contributing to the mutations inherited by the samples.
In other words, these nodes are redundant to the storing of genome data.

Therefore, compared to an ARG, you can think of a standard tree sequence as simply
Therefore, compared to a full ARG, you can think of a simplified tree sequence as
storing the trees *created by* recombination events, rather than attempting to record the
recombination events themselves. The actual recombination events can be sometimes be
inferred from these trees but, as we have seen, it's not always possible. Here's another
way to put it:

> "an ARG encodes the events that occurred in the history of a sample,
> whereas a tree sequence encodes the outcome of those events"
> whereas a [simplified] tree sequence encodes the outcome of those events"
> ([Kelleher _et al._, 2019](https://doi.org/10.1534/genetics.120.303253))

Expand Down
4 changes: 2 additions & 2 deletions what_is.md
Original file line number Diff line number Diff line change
Expand Up @@ -307,8 +307,8 @@ plt.show()
::::{margin}
:::{note}
The genetic genealogy is sometimes referred to as an ancestral recombination graph,
or ARG, and there are {ref}`close similarities<sec_concepts_args>` between ARGs
and tree sequences (see the {ref}`ARG tutorial<sec_args>`)
or ARG, and one way to think of tskit tree sequence is as a way
to store various different sorts of ARGs (see the {ref}`ARG tutorial<sec_args>`)
:::
::::

Expand Down

0 comments on commit 126cbd5

Please sign in to comment.