Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phasing singletons with expectation propagation #374

Closed
nspope opened this issue May 3, 2024 · 41 comments
Closed

Phasing singletons with expectation propagation #374

nspope opened this issue May 3, 2024 · 41 comments

Comments

@nspope
Copy link
Contributor

nspope commented May 3, 2024

It'll frequently be the case that singletons are not phased. Singleton variants aren't used by tsinfer, but are crucial for correct dating. Here's how we'll handle unphased singletons:

  1. Split edges leading to the same individual into "phase blocks" -- e.g. the intersection of intervals for these edges. For example, over some region of the genome the pair of edges leading to the same individual might have coordinates $[l_i, r_i]$ and $[l_j, r_j]$. The phase block for these two edges would have coordinates $[\max(l_i, l_j), \min(r_i, r_j)]$.

  2. Use an expectation propagation update by phase block rather than separately for the two constituent edges. For example, say we have a phase block where the parents of the constituent edges have ages $t_i$ and $t_j$. The children of the edges are both present-day haplotypes (of age 0). If the phase block has (mutation-weighted) span $\mu$ and contains $y$ singleton mutations, then the total mutational target is $\lambda_{ij} = \mu (t_i + t_j)$ and the mutation likelihood is $y \sim \mathrm{Poisson}(\lambda_{ij})$.

So, if we have variational posteriors $t_i \sim \mathrm{Gamma}(\alpha_i, \beta_i), t_j \sim \mathrm{Gamma}(\alpha_j, \beta_j)$, the partition function used in the expectation propagation update for the phase block is,
$$\int_0^{\infty} \int_0^{\infty} t_i^{\alpha_i - 1} e^{-\beta_i t_i} t_j^{\alpha_j - 1} e^{-\beta_j t_j} (t_i + t_j)^{y} e^{-\mu (t_i + t_j)} dt_i dt_j = \mathcal{B}(\alpha_i, \alpha_j) \frac{\Gamma(\alpha_i + \alpha_j + y)}{(\beta_i + \mu)^{\alpha_i + \alpha_j + y}} {_2} F_1 (\alpha_j, \alpha_i + \alpha_j + y; \alpha_i + \alpha_j; 1 - \frac{\beta_j + \mu}{\beta_i + \mu})$$ and we get the updated moments of $t_i, t_j$ by differentiating this wrt $\alpha, \beta$ as usual.

  1. Finally, we will want to phase individual mutations (e.g. for normalisation, or to date the mutations). Let $z_{i}$ be an indicator that the mutation is on the branch leading to parent $i$. Conditional on node ages, $\mathbb{P}(z_i = 1 \mid t_i, t_j) = \frac{t_i}{t_i + t_j}$. We have variational posteriors for node ages rather than point estimates, so the unconditional probability is $\mathbb{P}(z_i = 1) \propto \int_0^\infty \int_0^\infty \mathbb{P}(z_i = 1 \mid t_i, t_j) t_i^{\alpha_i - 1} e^{-\beta_i t_i} t_j^{\alpha_j - 1} e^{-\beta_j t_j} dt_i dt_j$. This integral has a very similar looking solution to the partition function above (e.g. with integer perturbations to the parameters).

This gets substantially more complicated if the individual in question is not in the present day (especially if the age of the individual is unknown). So for now we'll have to assume that mutations on ancient samples are phased (I think these are often pseudohaploid sequences anyways).

@nspope
Copy link
Contributor Author

nspope commented May 3, 2024

I think the big API question is whether we want singletons to be present in the input tree sequence or not. If the former, they could be rephased (e.g. the output might swap node labels for some mutations). If the latter, the question is how to pass them into tsdate and how to return them (e.g. should they just be used for dating? Should they be added to the mutation table?)

In general we'll have to assume singletons from a particular individual are all phased or all unphased.

@hyanwong
Copy link
Member

hyanwong commented May 3, 2024

I don't think the standard tsdate should mess with anything other than the node and mutation times (and metadata). There's a cleanliness to making sure that the dating algorithm doesn't mess with the number of rows in the tables.

But if a pair of same-length arrays (individual_ids, singleton_position) is passed in, it would, I guess, be possible to pass out an array of node IDs (and times?) of the same length, specify which nodes are the most likely for the singleton at each position (or I guess you could pass out a probability of the singleton being in the first vs the second node).

You could then have a separate tsdate.util.XXX function to add these singletons to the resulting tree sequence.

@nspope
Copy link
Contributor Author

nspope commented May 3, 2024

Hmm ... so this wouldn't be usable from the command-line, then? It seems more clunky to me to have a series of functions that have to be called in a particular order rather than a single invocation, with a flag phase_singletons for example.

I also feel this way about split_disjoint_nodes, which also modifies tables, in that this seems to universally improve the performance of the dating algorithm. So it should really be done by default, not through a separate routine which will (probably) be skipped by many users.

@hyanwong
Copy link
Member

hyanwong commented May 3, 2024

this wouldn't be usable from the command-line, then

That's the same for input, right? I think you would input and output the data from/to files or from/to stdin/stdout? But maybe Jerome has a better idea?

it should really be done by default, not through a separate routine

There is a "tsdate.pre_process()" routine, which I think is how we should incorporate split_disjoint_nodes. We should recommend that this is done before dating.

@hyanwong
Copy link
Member

hyanwong commented May 3, 2024

this wouldn't be usable from the command-line, then

Just to add, you need a way to input the data that is independent of the tree sequence (which assumes phased data), so incorporating singletons via the CLI will always involve messing with extra files. It doesn't seem too bad to me to output extra files in this case too.

@hyanwong
Copy link
Member

hyanwong commented May 3, 2024

Oh, and one last thought: I suppose that you could have a wrapper function that called all this stuff in the right order. So tsdate.date() is guaranteed not to change the topology, just the times, but e.g. tsdate.date_with_pre_and_post_processing(...) wraps all the commonly used steps together. We'd need to think of a better name, of course.

@nspope
Copy link
Contributor Author

nspope commented May 3, 2024

I seem to be unusually fixated on this particular API detail, so I'll work on the lower-level stuff for the moment and we can revisit once we've got an idea of how all this works in practice. But:

Just to add, you need a way to input the data that is independent of the tree sequence (which assumes phased data)

I think this is where I'm hung up -- you don't need to supply additional data if the singletons are already in the TS. If they're mapped to a edge, then they're implicitly phased. The phasing could be wrong, of course, so it's worth having a routine to correct the mapping. This seems much cleaner to me than having multiple input/output streams and utility functions, which are confusing to use and maintain.

Maybe a compromise here is to assume that only the mutations in the TS are going to be used for dating. So, singletons are included in the TS, regardless of whether they're well-phased or randomly assigned to a branch. But there's a flag to date which makes the dating algorithm agnostic to phase of singleton branches. However the dating algorithm doesn't modify the tables-- there would be a separate utility that could adjust the phase of singletons based on the node times.

The reason I don't like a separate input/output stream for singleton mutations is because in general, we won't be able to support a mix of phased and unphased mutations in the same individual, with a principled algorithm like the one above. What do you do if the input TS was subset down from tsinfer output, so that there's singletons already in there? Throw an error, in which case we'd need another utility function to strip singleton mutations? Silently assume that they're unphased and lump them in with the additional input stream of singleton mutations? IMO it's nicer to assume the data is wrapped in a single input, and have it so that additional arguments modify the behavior of the algorithm rather than augment the input.

I suppose that you could have a wrapper function that called all this stuff in the right order

This is a nice idea!

@hyanwong
Copy link
Member

hyanwong commented May 4, 2024

If they're mapped to a edge, then they're implicitly phased.

I guess this is the main problem I have. I'm not convinced we should be adding unphased things to the tree sequence by forcing a random phasing. Or at least, we would need to think hard about how unphased stuff should be represented in a tree sequence. I think @jeromekelleher and/or @petrelharp should comment here though. There might be a way to put a mutation in a table without an associated node (perhaps tskit.NULL in the node field?) so that it doesn't actually change the genotypes. I think this is a much bigger issue though.

@jeromekelleher
Copy link
Member

How about we make the internal functions first, and decide about the external API later? The internal function would be based on numpy arrays as input and output, taking an input tree sequence as immutable? So, write the absolute minimal core piece of functionality in a way that is convenient and non-destructive with tests, and then we can talk about the external API more fruitfully I think.

I agree with @nspope that it should be easy to do the right things here from a CLI invocation.

@hyanwong
Copy link
Member

hyanwong commented May 7, 2024

That seems like an excellent plan @jeromekelleher

@nspope
Copy link
Contributor Author

nspope commented Jun 12, 2024

Finally got this working... first taking a look at ages of singleton mutations. This is the usual equillibrium population, human-ish parameters setting. We have to use the mutation posterior in this case. I'm comparing against the branch midpoint on the true trees to avoid randomness that we can't hope to infer.

Correct vs phase-agnostic, with true topologies:

and with inferred topologies:

Looks pretty much identical!

@nspope
Copy link
Contributor Author

nspope commented Jun 12, 2024

For a baseline, here's the original algorithm but with a random phase assigned to singletons:

True trees,

Inferred trees,

@hyanwong
Copy link
Member

Wow. This is very neat, and convincing too. Does adding unphased singletons make the algorithm more numerically unstable?

@nspope
Copy link
Contributor Author

nspope commented Jun 12, 2024

I think I've got the numerics under control (previous issues were a bug). But we'll see, I think the GEL 40K is a good litmus test

@petrelharp
Copy link

Holy shit, that looks great.

@nspope
Copy link
Contributor Author

nspope commented Jun 13, 2024

Whoops, mistake on my part. These definitely looked too good to be true. Here's the correct comparison,

zigzag sim phased singles
zigzag sim unphased singles
zigzag sim randphased singles

Still pretty good. While we integrate over phase to get mutation age, we can also get MAP estimates of what the phase should be. This gets it right ~80% of the time in the example above.

@duncanMR
Copy link

duncanMR commented Jun 18, 2024

To test the dating performance on singletons with new phase-agnostic algorithm, I used the 1500 individual subset I made of the 1kgp-chr20p data. That way, I could use the singletons zarr from here which has a column for the true phase. I first had to fix an oversight: there were ~300 sites with two alternatives alleles that are both singletons, so I just kept one of them in each case.

import tsdate
import tszip
import zipfile
import os
import xarray as xr
import pandas as pd
import numpy as np

zip_path = 'data/1kgp_bal1500_singletons.zarr.zip'
output_path = os.path.splitext(zip_path)[0]
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    print(f'Unzipping to {output_path}')
    zip_ref.extractall(output_path)
    ds = xr.open_dataset(output_path, engine='zarr')

ts = tszip.decompress('data/1kgp_bal1500-chr20p-filterNton23-truncate-0-0-0-mm0-post-processed.trees.tsz')
ts = ts.simplify()
ts = tsdate.util.split_disjoint_nodes(ts)

#Keep just one allele for multi-allelic singletons
dupe_pos = ds['position'].values
duplicate_mask = ~pd.Series(dupe_pos).duplicated(keep='first').values
assert(np.sum(duplicate_mask) == len(np.unique(dupe_pos)))
duplicate_mask_xarray = xr.DataArray(duplicate_mask,
                                     dims=['variants'],
                                     name='duplicate_position_mask')
ds.update({'duplicate_position_mask': duplicate_mask_xarray})

#construct arrays for adding singletons
dedupe_ds = ds.where(ds.duplicate_position_mask, drop=True)
stn_pos = dedupe_ds['position'].values
stn_sample_id = dedupe_ds['sample_id'].values
stn_ancestral = dedupe_ds['allele'][:,0].values
stn_derived = dedupe_ds['allele'][:,1].values

#map sample_ids to individual ids
sample_id_dict = {}
for individual in ts.individuals():
    sample_id = individual.metadata['sample_id']
    sample_id_dict[sample_id] = individual.id
stn_individ = np.array([sample_id_dict[sample_id] for sample_id in stn_sample_id])

unphased_ts = tsdate.phasing.insert_unphased_singletons(ts, 
                                                position = stn_pos,
                                                individual = stn_individ,
                                                ancestral_state = stn_ancestral,
                                                derived_state = stn_derived
                                                )

There were 303 339 singletons to add to the ts. The insert_phased_singletons function only took a few seconds on my laptop, so performance doesn't seem to be an issue @nspope. One odd issue I had was that split_disjoint_nodes failed if I did it after adding the singletons, hence why I simplified and split disjoint nodes first.

Click to reveal split_disjoint_nodes error
# This is unphased_ts run without simplification or splitting disjoint nodes of the ts at the beginning
unphased_ts = unphased_ts.simplify()
unphased_ts = tsdate.util.split_disjoint_nodes(unphased_ts)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[16], [line 2](vscode-notebook-cell:?execution_count=16&line=2)
      [1](vscode-notebook-cell:?execution_count=16&line=1) unphased_ts = unphased_ts.simplify()
----> [2](vscode-notebook-cell:?execution_count=16&line=2) unphased_ts = tsdate.util.split_disjoint_nodes(unphased_ts)

File ~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:550, in split_disjoint_nodes(ts, record_provenance)
    [541](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:541) node_is_sample = np.bitwise_and(ts.nodes_flags, tskit.NODE_IS_SAMPLE).astype(bool)
    [542](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:542) edges_parent, edges_child, nodes_order, split_nodes = _split_disjoint_nodes(
    [543](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:543)     ts.edges_parent,
    [544](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:544)     ts.edges_child,
   (...)
    [547](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:547)     node_is_sample,
    [548](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:548) )
--> [550](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:550) mutations_node = _relabel_mutations_node(
    [551](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:551)     ts.mutations_node,
    [552](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:552)     ts.sites_position[ts.mutations_site],
    [553](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:553)     nodes_order,
    [554](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:554)     edges_parent,
    [555](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:555)     edges_child,
    [556](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:556)     ts.edges_left,
    [557](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:557)     ts.edges_right,
    [558](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:558)     ts.indexes_edge_insertion_order,
    [559](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:559)     ts.indexes_edge_removal_order,
    [560](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:560) )
...
--> [513](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:513)     assert nodes_map[mutations_node[m]] != tskit.NULL
    [514](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:514)     output[m] = nodes_map[mutations_node[m]]
    [515](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:515)     m += 1

AssertionError:

It seems to be a problem with the node id assigned to the singletons that are added, but I haven't looked into it beyond that. Anyway, dating with singletons_phased=False was fast on my laptop in this case. The main issue is that 60% of the singletons are removed due to having 'invalid posteriors', which is higher than I expected:

import logging
logging.getLogger().setLevel(logging.INFO)

unphased_dated_ts = tsdate.date(unphased_ts, mutation_rate=1.29e-08, singletons_phased=False)
2024-06-18 11:39:13,481 [1640] INFO     root: Extracted mutations in 0.09639501571655273 seconds
2024-06-18 11:39:13,977 [1640] INFO     root: Found 119208 unphased singleton mutations
2024-06-18 11:39:13,978 [1640] INFO     root: Split unphased singleton edges into 1048995 blocks
2024-06-18 11:39:13,980 [1640] INFO     root: Phased singletons in 0.4944934844970703 seconds
2024-06-18 11:39:56,218 [1640] INFO     root: Skipped 0 edges with invalid factors
2024-06-18 11:39:56,219 [1640] INFO     root: Calculated node posteriors in 41.7481415271759 seconds
2024-06-18 11:39:56,630 [1640] INFO     root: Skipped 184131 mutations with invalid posteriors
2024-06-18 11:39:56,631 [1640] INFO     root: Calculated mutation posteriors in 0.40827155113220215 seconds
2024-06-18 11:39:56,654 [1640] INFO     root: Switched phase of 36525 singletons
2024-06-18 11:40:10,803 [1640] INFO     root: Timescale rescaled in 14.147191524505615 seconds
2024-06-18 11:40:10,874 [1640] INFO     root: Modified ages of 938 nodes to satisfy constraints
2024-06-18 11:40:10,919 [1640] INFO     root: Set ages of 184131 nonsegregating mutations to root times.
2024-06-18 11:40:10,945 [1640] INFO     root: Constrained node ages in 0.08892679214477539 seconds
2024-06-18 11:40:10,956 [1640] INFO     root: Set metadata schema on NodeTable
2024-06-18 11:40:14,065 [1640] INFO     root: Set metadata schema on MutationTable
2024-06-18 11:40:18,648 [1640] INFO     root: Inserted node and mutation metadata in 7.70140266418457 seconds
2024-06-18 11:40:20,790 [1640] INFO     root: Sorted tree sequence in 2.140110731124878 second

I then adapted the convenience function for adding singletons to add singletons with the true phases (0/1) that I stored in the zarr:

Click to reveal code to add phased singletons
import tskit


def insert_phased_singletons(
    ts,
    position,
    individual,
    ancestral_state,
    derived_state,
    phase,
):
    """
    Insert phased singletons into the tree sequence.
    :param tskit.TreeSequence ts: the tree sequence to add singletons to
    :param np.ndarray position: the position of the variants
    :param np.ndarray individual: the individual id in which the variant occurs
    :param np.ndarray ancestral_state: the ancestral state of the variant
    :param np.ndarray derived_state: the derived state of the variant
    :param np.ndarray phase: the phase of the variant (0 or 1)
    :returns: A copy of the tree sequence with singletons inserted
    """
    tables = ts.dump_tables()
    sites_id = {p: i for i, p in enumerate(ts.sites_position)}
    for pos, ind, ref, alt, phase in zip(
        position, individual, ancestral_state, derived_state, phase
    ):
        if pos in sites_id:
            if ref != ts.site(sites_id[pos]).ancestral_state:
                raise ValueError(
                    f"Existing site at position {pos} has a different ancestral state"
                )
            muts = ts.site(sites_id[pos]).mutations
            set_time = len(muts) and np.isfinite(muts[0].time)
        else:
            sites_id[pos] = tables.sites.add_row(position=pos, ancestral_state=ref)
            set_time = False
        # TODO: more efficient to do in bulk?
        site = sites_id[pos]
        node = ts.individual(ind).nodes[phase]
        time = ts.nodes_time[node] if set_time else tskit.UNKNOWN_TIME
        tables.mutations.add_row(
            site=site,
            node=node,
            time=time,
            derived_state=alt,
        )
    tables.sort()
    tables.build_index()
    tables.compute_mutation_parents()
    return tables.tree_sequence()


phased_ts = insert_phased_singletons(
    ts,
    position=stn_pos,
    individual=stn_individ,
    ancestral_state=stn_ancestral,
    derived_state=stn_derived,
    phase=stn_phase,
)
phased_ts

Dating with singletons_phased=True is still fast, but again 60% of singletons are skipped:

phased_dated_ts = tsdate.date(phased_ts, mutation_rate=1.29e-08, singletons_phased=True)
2024-06-18 12:03:42,365 [1640] INFO     root: Extracted mutations in 0.09936237335205078 seconds
2024-06-18 12:03:42,523 [1640] INFO     root: Found 0 unphased singleton mutations
2024-06-18 12:03:42,525 [1640] INFO     root: Split unphased singleton edges into 0 blocks
2024-06-18 12:03:42,526 [1640] INFO     root: Phased singletons in 0.1573958396911621 seconds
2024-06-18 12:04:23,682 [1640] INFO     root: Skipped 0 edges with invalid factors
2024-06-18 12:04:23,684 [1640] INFO     root: Calculated node posteriors in 40.6736102104187 seconds
2024-06-18 12:04:24,023 [1640] INFO     root: Skipped 184131 mutations with invalid posteriors
2024-06-18 12:04:24,024 [1640] INFO     root: Calculated mutation posteriors in 0.33773255348205566 seconds
2024-06-18 12:04:24,027 [1640] INFO     root: Switched phase of 0 singletons
2024-06-18 12:04:37,877 [1640] INFO     root: Timescale rescaled in 13.847219467163086 seconds
2024-06-18 12:04:37,942 [1640] INFO     root: Modified ages of 936 nodes to satisfy constraints
2024-06-18 12:04:37,975 [1640] INFO     root: Set ages of 184131 nonsegregating mutations to root times.
2024-06-18 12:04:37,996 [1640] INFO     root: Constrained node ages in 0.07012248039245605 seconds
2024-06-18 12:04:37,998 [1640] INFO     root: Set metadata schema on NodeTable
2024-06-18 12:04:41,254 [1640] INFO     root: Set metadata schema on MutationTable
2024-06-18 12:04:45,927 [1640] INFO     root: Inserted node and mutation metadata in 7.9303083419799805 seconds
2024-06-18 12:04:48,069 [1640] INFO     root: Sorted tree sequence in 2.1404826641082764 seconds

Finally, I used evaluations.mutations_time to compare the times of singletons with true phase (x) to the phase-agnostic singleton times:

1500_subset_singletons

I then generated a random Bernoulli vector to use for the phase and ran dating with singletons_phased=True (in hindsight I realise I could have just used the arbitarily assigned phases from insert_unphased_singletons instead):

1500_subset_singletons_random

Click to reveal code to reproduce the plots above:
from tsdate import evaluation

#True phase vs phase-agnostic
evaluation.mutations_time(
    phased_dated_ts,
    unphased_dated_ts,
    use_posterior=False,
    min_freq=None,
    max_freq=1,
    plotpath='results/1500_subset_singletons.png',
    title='1500 subset singletons: true phase vs phase-agnostic dating',
    subtending_node=False,
)

#True phase vs random phase
np.random.seed(42)
stn_phase_random = np.random.randint(2, size=len(stn_pos))
random_phase_ts = insert_phased_singletons(
    ts,
    position=stn_pos,
    individual=stn_individ,
    ancestral_state=stn_ancestral,
    derived_state=stn_derived,
    phase=stn_phase_random,
)
random_phase_dated_ts = tsdate.date(random_phase_ts, mutation_rate=1.29e-08, singletons_phased=True)

evaluation.mutations_time(
    phased_dated_ts,
    random_phase_dated_ts,
    use_posterior=False,
    min_freq=None,
    max_freq=1,
    plotpath='results/1500_subset_singletons_random.png',
    title='1500 subset singletons: true phase vs random phase',
    subtending_node=False,
)

Qualitatively, they seem similar to the simulation results, just with a tonne more variance. @nspope, is the high proportion of variants with invalid posteriors also surprising to you? If you'd like to investigate it further, here is the TS with the correctly phased singletons added. Let me know if there are other files you want:

1kgp_bal1500_phased_singletons.trees.zip

I'm going to test it in GEL next.

@duncanMR
Copy link

Just remembered that I can't install from GitHub in GEL anymore, so I unfortunately can't test it until they approve the Airlock transfer of this branch of tsdate, annoyingly! I just submitted the request, so it should be sorted within two days.

@nspope
Copy link
Contributor Author

nspope commented Jun 18, 2024

Thanks @duncanMR, very helpful and a lot to chew on.

is the high proportion of variants with invalid posteriors also surprising to you?

could you check the count of nonsegregating mutations (above the root)? ideally those'll be the only ones where we can't assign a posterior.

@duncanMR
Copy link

duncanMR commented Jun 18, 2024

Thanks @duncanMR, very helpful and a lot to chew on.

is the high proportion of variants with invalid posteriors also surprising to you?

could you check the count of nonsegregating mutations (above the root)? ideally those'll be the only ones where we can't assign a posterior.

I see: the number of mutations above root matches the number with invalid posteriors. So it's the root-to-sample edges problem rearing it's head again! It's concerning how many there are here.

@nspope
Copy link
Contributor Author

nspope commented Jun 18, 2024

it's the root-to-sample edges problem

Oh -- I think I missed the discussion of this previously. What is going on? Are these not simply mutations that are segregating in the full data but fixed in the subset?

@duncanMR
Copy link

it's the root-to-sample edges problem

Oh -- I think I missed the discussion of this previously. What is going on? Are these not simply mutations that are segregating in the full data but fixed in the subset?

Woops, just realised my mistake. The problem is simply that many of the singletons I extracted fall outside the range of site positions in the TS, because I extracted them directly from the zarr and forgot to limit the positions by the site density mask. So mutations were being placed onto nodes in the first and last null trees in the TS that cover the range of the genome that's been masked out. So it's nothing to worry about, sorry for the false alarm!

@hyanwong
Copy link
Member

Aha - of course! Easy mistake to make. We should catch this and error out, I think.

@nspope
Copy link
Contributor Author

nspope commented Jun 18, 2024

We should catch this and error out, I think

Do you think this is a problem @hyanwong? It's not unusual to have some "above-the-root" mutations introduced by tsinfer (why, I don't know).

@nspope
Copy link
Contributor Author

nspope commented Jun 18, 2024

@duncanMR two things that occurred to me -- we should be using mutation posteriors here; as that takes a weighted average of the two branch midpoints (posterior mean) rather than always choosing the longest branch (MAP estimate). I'd also be interested in seeing the parent node age-- an upper bound on mutation age may be more of interest if we're focusing on old singletons

@hyanwong
Copy link
Member

We should catch this and error out, I think

Do you think this is a problem @hyanwong? It's not unusual to have some "above-the-root" mutations introduced by tsinfer (why, I don't know).

No, I mean we should error out if if singletons are being added into regions where there are no edges

@hyanwong
Copy link
Member

hyanwong commented Jun 18, 2024

I assume the "above the root" mutations are added by parsimony, when e.g. we force the wrong ancestral state for a triallelic site

@duncanMR
Copy link

@duncanMR two things that occurred to me -- we should be using mutation posteriors here; as that takes a weighted average of the two branch midpoints (posterior mean) rather than always choosing the longest branch (MAP estimate). I'd also be interested in seeing the parent node age-- an upper bound on mutation age may be more of interest if we're focusing on old singletons

Cool, I'll look into that today. I've also got tsdate through airlock so I'll test in GEL today too.

I assume the "above the root" mutations are added by parsimony, when e.g. we force the wrong ancestral state for a triallelic site

Yeah, above-the-root mutations are extremely rare in practice: I've only seen a handful of them in the largest tree sequences. Hopefully we can figure out a solution in the long-term with my DPhil project.

@duncanMR
Copy link

When adding singletons in GeL that Sam extracted, there are sites where the alternative allele (e.g. A) is a singleton, but there is another alternate allele (e.g. T) at the same site that is at a higher frequency and already in the TS. Do we want to exclude these singletons from being added? It seems a bit inconsistent to exclude multi-allelics from inference, but allow multi-allelics if one of the alleles is a singleton. There are quite a few: 9250 sites have this problem in chr17p alone.

@duncanMR
Copy link

@nspope Adding singletons is failing for chr17p in GEL. It's a bit annoying that it didn't show up in the 1kgp data, since I can't share a reproducible example. It looks like a numerical issue: a mutation phase was generated outside [0,1]. I can try and set up a debugger to find out what the problematic phase is.

image

image

@nspope
Copy link
Contributor Author

nspope commented Jun 19, 2024

@duncanMR thanks Duncan. I can probably just clamp values to that range, if it's in the range of fp error. Would you mind inserting a print statement to see how far out of wack it is?

@duncanMR
Copy link

It turns out that it's not an issue where the phase is within fp error of 0 or 1. After updating my singletons filtering procedure, I now have 1 million singletons to add to the ~1million variants already in the chr17p TS. Of those, there are three singletons which for some reason have invalid posteriors, but they aren't above the root this time and aren't in null trees. They are on short branches at the bottom of trees.

I first tried removing the three sites, but bizarrely, five more sites popped up with invalid posteriors. The quick fix was easy: the bug is that the invalid posteriors cause there to be three nan values in the self.mutation_phase array, which then get passed into reallocate_unphased and trigger the assert statements. All you have to do is check for nan values and skip them. For some reason, this problem doesn't occur with invalid posteriors due to mutations being placed in null trees. So it seems to work fine now:

image

@nspope I will add some comments to the PR where I've had issues so far.

@hyanwong
Copy link
Member

When adding singletons in GeL that Sam extracted, there are sites where the alternative allele (e.g. A) is a singleton, but there is another alternate allele (e.g. T) at the same site that is at a higher frequency and already in the TS. Do we want to exclude these singletons from being added? It seems a bit inconsistent to exclude multi-allelics from inference, but allow multi-allelics if one of the alleles is a singleton. There are quite a few: 9250 sites have this problem in chr17p alone.

We don't ignore multiallelics in tsdate though, do we? True, they aren't used the help tsinfer inference, but once they are placed back on the ts by parsimony, they can be used for dating. I don't know if we do this for the GEL data though: it could be that Ben excludes them from the pipeline entirely.

@duncanMR
Copy link

duncanMR commented Jun 21, 2024

When adding singletons in GeL that Sam extracted, there are sites where the alternative allele (e.g. A) is a singleton, but there is another alternate allele (e.g. T) at the same site that is at a higher frequency and already in the TS. Do we want to exclude these singletons from being added? It seems a bit inconsistent to exclude multi-allelics from inference, but allow multi-allelics if one of the alleles is a singleton. There are quite a few: 9250 sites have this problem in chr17p alone.

We don't ignore multiallelics in tsdate though, do we? True, they aren't used the help tsinfer inference, but once they are placed back on the ts by parsimony, they can be used for dating. I don't know if we do this for the GEL data though: it could be that Ben excludes them from the pipeline entirely.

My impression was that we exclude all multi-allelic sites instead of just keeping one allele at random, but we can check with Ben later.

@hyanwong
Copy link
Member

We don't need to keep one at random: tsinfer can place non-inference sites back into the tree sequence using parsimony. They will then be used by tsdate.

It might be that multiallelics are discarded by the snakemake pipeline before tsinfer actually sees them, and are not placed back into the tree sequence using parsimony. There is an argument that perhaps we should (or if not, we should adjust the mutation rate to take into account the fact that some sites have been discarded)

@duncanMR
Copy link

That's a good point. We definitely aren't placing any multi-allelics by parsimony at the moment (there is one mutation per site) but I don't know if they are available in the zarr to be added later. If that's the case, then I think adding back all the multi-allelics (not just singletons) makes sense.

@hyanwong
Copy link
Member

I suspect they are in the zarr but masked out. @benjeffery will know. We can ask at 4.30 today.

@duncanMR
Copy link

After adding and dating the unphased singletons to 1kgp-chr20p, the upper bound estimates (parent time) of singleton ages have a much lighter tail than posterior means of higher frequency variants. Seems reasonable to me:

image

Click to reveal code to reproduce the plot above:
import numpy as np
import tsdate
import logging
import tszip
import pandas as pd

#generated elsewhere
singletons_df = pd.read_csv('data/unphased_chr20p_singletons.csv')

ts = tszip.decompress('data/1kgp_all-chr20p-filterNton23-truncate-0-0-0-mm0-post-processed.trees.tsz')
ts = ts.simplify()
ts = tsdate.util.split_disjoint_nodes(ts)

singletons_df.drop_duplicates(subset=['position'], keep='first', inplace=True)
left = min(ts.sites_position)
right = max(ts.sites_position)
singletons_df = singletons_df[(singletons_df.position >= left) & (singletons_df.position <= right)]

stn_pos = singletons_df['position'].values
stn_sample_id = singletons_df['sample_id'].values
stn_ancestral = singletons_df['ref'].values
stn_derived = singletons_df['alt'].values

sample_id_dict = {}
for individual in ts.individuals():
    sample_id = individual.metadata['sample_id']
    sample_id_dict[sample_id] = individual.id

stn_individ = np.array([sample_id_dict[sample_id] for sample_id in stn_sample_id])

#insert singletons and date
singletons_ts = tsdate.phasing.insert_unphased_singletons(ts, 
                                                position = stn_pos,
                                                individual = stn_individ,
                                                ancestral_state = stn_ancestral,
                                                derived_state = stn_derived
                                                )
logging.getLogger().setLevel(logging.INFO)
singletons_dated_ts = tsdate.date(singletons_ts, mutation_rate=1.29e-08, singletons_phased=False)

stn_site_id = np.where(np.isin(ts.sites_position, stn_pos))[0]
assert(len(stn_site_id) == len(stn_pos))

stn_node = np.full(len(stn_site_id), -1)
stn_post_time = np.full(len(stn_site_id), np.nan)
stn_parent_time = np.full(len(stn_site_id), np.nan)
tree = ts.first()

for index, site_id in enumerate(stn_site_id):
    site = ts.site(site_id)
    pos = site.position
    assert(len(site.mutations) == 1)
    mut = site.mutations[0]
    stn_post_time[index] = mut.metadata['mn']
    child = mut.node
    tree.seek(pos)
    parent = tree.parent(mut.node)
    stn_parent_time[index] = tree.time(parent)

non_stn_pos = ts.sites_position[~np.isin(ts.sites_position, stn_pos)]
non_stn_site_id = np.where(np.isin(ts.sites_position, non_stn_pos))[0]
non_stn_post_time = np.full(len(non_stn_site_id), np.nan)

for index, site_id in enumerate(non_stn_site_id):
    site = ts.site(site_id)
    pos = site.position
    assert(len(site.mutations) == 1)
    mut = site.mutations[0]
    non_stn_post_time[index] = mut.metadata['mn']

def compute_cdf(data):
    sorted_data = np.sort(data)
    cdf = np.arange(1, len(sorted_data) + 1) / len(sorted_data)
    return sorted_data, cdf

stn_post_time_sorted, stn_post_time_cdf = compute_cdf(stn_post_time)
stn_parent_time_sorted, stn_parent_time_cdf = compute_cdf(stn_parent_time)
non_stn_post_time_sorted, non_stn_post_time_cdf = compute_cdf(non_stn_post_time)

plt.plot(stn_post_time_sorted, stn_post_time_cdf, label='singletons: posterior time')
plt.plot(stn_parent_time_sorted, stn_parent_time_cdf, label='singletons: parent time')
plt.plot(non_stn_post_time_sorted, non_stn_post_time_cdf, label='all other variants: posterior time')

plt.xscale('log')
plt.title('CDFs for singletons vs other variants in 1kgp-chr20p-all')
plt.xlabel('mutation time')
plt.ylabel('CDF')
plt.legend()
plt.grid(True)
plt.show()

I'm going to look at the 1kgp annotations dataframe next.

@duncanMR
Copy link

In case anyone needs them, I fixed the problem with the singletons being placed outside the range of the TS. Here is the undated TS with phased singletons added:

1kgp_bal1500_phased_singletons.trees.zip

The dated TS is too big for Github so you can get it here: https://drive.google.com/drive/folders/1vn5xMlBBXTB-_sosX9LpaQ_45E36UVoa?usp=drive_link

@hyanwong
Copy link
Member

hyanwong commented Jun 24, 2024

Using a rough analysis, I reckon our current singleton phasing approach misses about 25% of the information we could incorporate if we also added haplotype lengths. See #5 (comment)

I don't think this is enough to warrant modifying out existing approach, but something like this might make an interesting observation in the paper discussion.

@hyanwong
Copy link
Member

Closing in favour of #446

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants