You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #2948@tforest is working on adding time windows to statistics. How this should work for mode="branch" is clear; however, it is not clear for mode="site". (It doesn't make sense to have time windows for mode="node", btw.) The reason is that site mode sums over all alleles (thus mimicing what happens with real data); and if there's been multiple mutations, there could be more than one mutation at different times that led to the same allele.
So, what we'd really like to do, and what people probably mostly imagine is happening with mode="site", is to sum over all mutations, rather than alleles. This would be essentially equivalent to mode="branch", but measuring branch area by counting how many mutations are on it instead of doing span times length. So - we're proposing adding a new mode, called "mutation" to statistics, that does this.
In a bit more detail. Recall that to compute a statistic we have some weights and a summary function, $f()$. For a node $n$ let $w_T(n)$ be the total weight of all samples below the node in tree $T$. Then the branch-mode statistic is $$\sum_T |T| \sum_n \ell_T(n) f(w_T(n)) ,$$
where $|T|$ is the span of the tree and $\ell_T(n)$ is the length of the branch above $n$ in tree $T$. And, if we define $w_s(a)$ to be the total weight of all samples carrying allele $a$ at site $s$, then the site-mode stat is $$\sum_s \sum_a f(w_s(a)) .$$
We can rewrite this as a sum over trees, pedantically, as $$\sum_T \sum_{s \in T} \sum_{a \in s} f(\sum_{d_s(m) = a} w_T(n_m)),$$
where the sums are over, respectively, trees; sites in that tree; and alleles at that site; while "$\sum_{d_s(m) = a}$" means "the sum over all mutations at site $s$ whose derived allele is $a$"; and $n_m$ is the node for mutation $m$. This looks complicated but it's just what you'd imagine. So, we're proposing that mutation-mode just doesn't group together the distinct mutations producing the same allele: $$\sum_T \sum_{s \in T} \sum_{a \in s} \sum_{d_s(m) = a} f(w_T(n_m)) .$$
If polarised=False, then the sum would also include "the root", i.e., include a term for $f(\bar w - \sum_m w_T(n_m))$, as if there were a mutation to the ancestral state at the root as well. (And, maybe the stat should be polarised by default, unlike other stats?)
There have been some other requests for (essentially) this, mostly around divergence. By trying to compute exactly-what-you-get-from-sequences (in mode="site") we have in a sense removed the advantage of the tree sequence that lets us distinguish one from multiple mutations (in principle).
Finally: the branch stat is the expected value of the site stat, given the trees, under infinite sites neutral mutations. This mode would remove that last caveat: the branch stat is the expected value of the mutation stat, as long as mutations are neutral (and Poisson).
The text was updated successfully, but these errors were encountered:
In #2948 @tforest is working on adding time windows to statistics. How this should work for
mode="branch"
is clear; however, it is not clear formode="site"
. (It doesn't make sense to have time windows formode="node"
, btw.) The reason is that site mode sums over all alleles (thus mimicing what happens with real data); and if there's been multiple mutations, there could be more than one mutation at different times that led to the same allele.So, what we'd really like to do, and what people probably mostly imagine is happening with
mode="site"
, is to sum over all mutations, rather than alleles. This would be essentially equivalent tomode="branch"
, but measuring branch area by counting how many mutations are on it instead of doing span times length. So - we're proposing adding a new mode, called"mutation"
to statistics, that does this.In a bit more detail. Recall that to compute a statistic we have some weights and a summary function,$f()$ . For a node $n$ let $w_T(n)$ be the total weight of all samples below the node in tree $T$ . Then the branch-mode statistic is
$$\sum_T |T| \sum_n \ell_T(n) f(w_T(n)) ,$$ $|T|$ is the span of the tree and $\ell_T(n)$ is the length of the branch above $n$ in tree $T$ . And, if we define $w_s(a)$ to be the total weight of all samples carrying allele $a$ at site $s$ , then the site-mode stat is
$$\sum_s \sum_a f(w_s(a)) .$$
$$\sum_T \sum_{s \in T} \sum_{a \in s} f(\sum_{d_s(m) = a} w_T(n_m)),$$ $s$ whose derived allele is $a$ "; and $n_m$ is the node for mutation $m$ . This looks complicated but it's just what you'd imagine. So, we're proposing that mutation-mode just doesn't group together the distinct mutations producing the same allele:
$$\sum_T \sum_{s \in T} \sum_{a \in s} \sum_{d_s(m) = a} f(w_T(n_m)) .$$ $f(\bar w - \sum_m w_T(n_m))$ , as if there were a mutation to the ancestral state at the root as well. (And, maybe the stat should be polarised by default, unlike other stats?)
where
We can rewrite this as a sum over trees, pedantically, as
where the sums are over, respectively, trees; sites in that tree; and alleles at that site; while "$\sum_{d_s(m) = a}$" means "the sum over all mutations at site
If
polarised=False
, then the sum would also include "the root", i.e., include a term forThere have been some other requests for (essentially) this, mostly around divergence. By trying to compute exactly-what-you-get-from-sequences (in
mode="site"
) we have in a sense removed the advantage of the tree sequence that lets us distinguish one from multiple mutations (in principle).Finally: the branch stat is the expected value of the site stat, given the trees, under infinite sites neutral mutations. This mode would remove that last caveat: the branch stat is the expected value of the mutation stat, as long as mutations are neutral (and Poisson).
The text was updated successfully, but these errors were encountered: