-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phasing singletons with expectation propagation #374
Comments
I think the big API question is whether we want singletons to be present in the input tree sequence or not. If the former, they could be rephased (e.g. the output might swap node labels for some mutations). If the latter, the question is how to pass them into tsdate and how to return them (e.g. should they just be used for dating? Should they be added to the mutation table?) In general we'll have to assume singletons from a particular individual are all phased or all unphased. |
I don't think the standard tsdate should mess with anything other than the node and mutation times (and metadata). There's a cleanliness to making sure that the dating algorithm doesn't mess with the number of rows in the tables. But if a pair of same-length arrays You could then have a separate |
Hmm ... so this wouldn't be usable from the command-line, then? It seems more clunky to me to have a series of functions that have to be called in a particular order rather than a single invocation, with a flag I also feel this way about split_disjoint_nodes, which also modifies tables, in that this seems to universally improve the performance of the dating algorithm. So it should really be done by default, not through a separate routine which will (probably) be skipped by many users. |
That's the same for input, right? I think you would input and output the data from/to files or from/to stdin/stdout? But maybe Jerome has a better idea?
There is a "tsdate.pre_process()" routine, which I think is how we should incorporate |
Just to add, you need a way to input the data that is independent of the tree sequence (which assumes phased data), so incorporating singletons via the CLI will always involve messing with extra files. It doesn't seem too bad to me to output extra files in this case too. |
Oh, and one last thought: I suppose that you could have a wrapper function that called all this stuff in the right order. So |
I seem to be unusually fixated on this particular API detail, so I'll work on the lower-level stuff for the moment and we can revisit once we've got an idea of how all this works in practice. But:
I think this is where I'm hung up -- you don't need to supply additional data if the singletons are already in the TS. If they're mapped to a edge, then they're implicitly phased. The phasing could be wrong, of course, so it's worth having a routine to correct the mapping. This seems much cleaner to me than having multiple input/output streams and utility functions, which are confusing to use and maintain. Maybe a compromise here is to assume that only the mutations in the TS are going to be used for dating. So, singletons are included in the TS, regardless of whether they're well-phased or randomly assigned to a branch. But there's a flag to The reason I don't like a separate input/output stream for singleton mutations is because in general, we won't be able to support a mix of phased and unphased mutations in the same individual, with a principled algorithm like the one above. What do you do if the input TS was subset down from tsinfer output, so that there's singletons already in there? Throw an error, in which case we'd need another utility function to strip singleton mutations? Silently assume that they're unphased and lump them in with the additional input stream of singleton mutations? IMO it's nicer to assume the data is wrapped in a single input, and have it so that additional arguments modify the behavior of the algorithm rather than augment the input.
This is a nice idea! |
I guess this is the main problem I have. I'm not convinced we should be adding unphased things to the tree sequence by forcing a random phasing. Or at least, we would need to think hard about how unphased stuff should be represented in a tree sequence. I think @jeromekelleher and/or @petrelharp should comment here though. There might be a way to put a mutation in a table without an associated node (perhaps tskit.NULL in the node field?) so that it doesn't actually change the genotypes. I think this is a much bigger issue though. |
How about we make the internal functions first, and decide about the external API later? The internal function would be based on numpy arrays as input and output, taking an input tree sequence as immutable? So, write the absolute minimal core piece of functionality in a way that is convenient and non-destructive with tests, and then we can talk about the external API more fruitfully I think. I agree with @nspope that it should be easy to do the right things here from a CLI invocation. |
That seems like an excellent plan @jeromekelleher |
Finally got this working... first taking a look at ages of singleton mutations. This is the usual equillibrium population, human-ish parameters setting. We have to use the mutation posterior in this case. I'm comparing against the branch midpoint on the true trees to avoid randomness that we can't hope to infer. Correct vs phase-agnostic, with true topologies: and with inferred topologies: Looks pretty much identical! |
Wow. This is very neat, and convincing too. Does adding unphased singletons make the algorithm more numerically unstable? |
I think I've got the numerics under control (previous issues were a bug). But we'll see, I think the GEL 40K is a good litmus test |
Holy shit, that looks great. |
To test the dating performance on singletons with new phase-agnostic algorithm, I used the 1500 individual subset I made of the 1kgp-chr20p data. That way, I could use the singletons zarr from here which has a column for the true phase. I first had to fix an oversight: there were ~300 sites with two alternatives alleles that are both singletons, so I just kept one of them in each case. import tsdate
import tszip
import zipfile
import os
import xarray as xr
import pandas as pd
import numpy as np
zip_path = 'data/1kgp_bal1500_singletons.zarr.zip'
output_path = os.path.splitext(zip_path)[0]
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
print(f'Unzipping to {output_path}')
zip_ref.extractall(output_path)
ds = xr.open_dataset(output_path, engine='zarr')
ts = tszip.decompress('data/1kgp_bal1500-chr20p-filterNton23-truncate-0-0-0-mm0-post-processed.trees.tsz')
ts = ts.simplify()
ts = tsdate.util.split_disjoint_nodes(ts)
#Keep just one allele for multi-allelic singletons
dupe_pos = ds['position'].values
duplicate_mask = ~pd.Series(dupe_pos).duplicated(keep='first').values
assert(np.sum(duplicate_mask) == len(np.unique(dupe_pos)))
duplicate_mask_xarray = xr.DataArray(duplicate_mask,
dims=['variants'],
name='duplicate_position_mask')
ds.update({'duplicate_position_mask': duplicate_mask_xarray})
#construct arrays for adding singletons
dedupe_ds = ds.where(ds.duplicate_position_mask, drop=True)
stn_pos = dedupe_ds['position'].values
stn_sample_id = dedupe_ds['sample_id'].values
stn_ancestral = dedupe_ds['allele'][:,0].values
stn_derived = dedupe_ds['allele'][:,1].values
#map sample_ids to individual ids
sample_id_dict = {}
for individual in ts.individuals():
sample_id = individual.metadata['sample_id']
sample_id_dict[sample_id] = individual.id
stn_individ = np.array([sample_id_dict[sample_id] for sample_id in stn_sample_id])
unphased_ts = tsdate.phasing.insert_unphased_singletons(ts,
position = stn_pos,
individual = stn_individ,
ancestral_state = stn_ancestral,
derived_state = stn_derived
) There were 303 339 singletons to add to the ts. The Click to reveal split_disjoint_nodes error# This is unphased_ts run without simplification or splitting disjoint nodes of the ts at the beginning
unphased_ts = unphased_ts.simplify()
unphased_ts = tsdate.util.split_disjoint_nodes(unphased_ts)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[16], [line 2](vscode-notebook-cell:?execution_count=16&line=2)
[1](vscode-notebook-cell:?execution_count=16&line=1) unphased_ts = unphased_ts.simplify()
----> [2](vscode-notebook-cell:?execution_count=16&line=2) unphased_ts = tsdate.util.split_disjoint_nodes(unphased_ts)
File ~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:550, in split_disjoint_nodes(ts, record_provenance)
[541](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:541) node_is_sample = np.bitwise_and(ts.nodes_flags, tskit.NODE_IS_SAMPLE).astype(bool)
[542](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:542) edges_parent, edges_child, nodes_order, split_nodes = _split_disjoint_nodes(
[543](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:543) ts.edges_parent,
[544](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:544) ts.edges_child,
(...)
[547](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:547) node_is_sample,
[548](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:548) )
--> [550](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:550) mutations_node = _relabel_mutations_node(
[551](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:551) ts.mutations_node,
[552](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:552) ts.sites_position[ts.mutations_site],
[553](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:553) nodes_order,
[554](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:554) edges_parent,
[555](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:555) edges_child,
[556](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:556) ts.edges_left,
[557](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:557) ts.edges_right,
[558](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:558) ts.indexes_edge_insertion_order,
[559](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:559) ts.indexes_edge_removal_order,
[560](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:560) )
...
--> [513](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:513) assert nodes_map[mutations_node[m]] != tskit.NULL
[514](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:514) output[m] = nodes_map[mutations_node[m]]
[515](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/duncan/trees/tsinfer/~/trees/tsinfer/env/lib/python3.11/site-packages/tsdate/util.py:515) m += 1
AssertionError: It seems to be a problem with the node id assigned to the singletons that are added, but I haven't looked into it beyond that. Anyway, dating with import logging
logging.getLogger().setLevel(logging.INFO)
unphased_dated_ts = tsdate.date(unphased_ts, mutation_rate=1.29e-08, singletons_phased=False)
I then adapted the convenience function for adding singletons to add singletons with the true phases (0/1) that I stored in the zarr: Click to reveal code to add phased singletonsimport tskit
def insert_phased_singletons(
ts,
position,
individual,
ancestral_state,
derived_state,
phase,
):
"""
Insert phased singletons into the tree sequence.
:param tskit.TreeSequence ts: the tree sequence to add singletons to
:param np.ndarray position: the position of the variants
:param np.ndarray individual: the individual id in which the variant occurs
:param np.ndarray ancestral_state: the ancestral state of the variant
:param np.ndarray derived_state: the derived state of the variant
:param np.ndarray phase: the phase of the variant (0 or 1)
:returns: A copy of the tree sequence with singletons inserted
"""
tables = ts.dump_tables()
sites_id = {p: i for i, p in enumerate(ts.sites_position)}
for pos, ind, ref, alt, phase in zip(
position, individual, ancestral_state, derived_state, phase
):
if pos in sites_id:
if ref != ts.site(sites_id[pos]).ancestral_state:
raise ValueError(
f"Existing site at position {pos} has a different ancestral state"
)
muts = ts.site(sites_id[pos]).mutations
set_time = len(muts) and np.isfinite(muts[0].time)
else:
sites_id[pos] = tables.sites.add_row(position=pos, ancestral_state=ref)
set_time = False
# TODO: more efficient to do in bulk?
site = sites_id[pos]
node = ts.individual(ind).nodes[phase]
time = ts.nodes_time[node] if set_time else tskit.UNKNOWN_TIME
tables.mutations.add_row(
site=site,
node=node,
time=time,
derived_state=alt,
)
tables.sort()
tables.build_index()
tables.compute_mutation_parents()
return tables.tree_sequence()
phased_ts = insert_phased_singletons(
ts,
position=stn_pos,
individual=stn_individ,
ancestral_state=stn_ancestral,
derived_state=stn_derived,
phase=stn_phase,
)
phased_ts Dating with phased_dated_ts = tsdate.date(phased_ts, mutation_rate=1.29e-08, singletons_phased=True)
Finally, I used I then generated a random Bernoulli vector to use for the phase and ran dating with Click to reveal code to reproduce the plots above:from tsdate import evaluation
#True phase vs phase-agnostic
evaluation.mutations_time(
phased_dated_ts,
unphased_dated_ts,
use_posterior=False,
min_freq=None,
max_freq=1,
plotpath='results/1500_subset_singletons.png',
title='1500 subset singletons: true phase vs phase-agnostic dating',
subtending_node=False,
)
#True phase vs random phase
np.random.seed(42)
stn_phase_random = np.random.randint(2, size=len(stn_pos))
random_phase_ts = insert_phased_singletons(
ts,
position=stn_pos,
individual=stn_individ,
ancestral_state=stn_ancestral,
derived_state=stn_derived,
phase=stn_phase_random,
)
random_phase_dated_ts = tsdate.date(random_phase_ts, mutation_rate=1.29e-08, singletons_phased=True)
evaluation.mutations_time(
phased_dated_ts,
random_phase_dated_ts,
use_posterior=False,
min_freq=None,
max_freq=1,
plotpath='results/1500_subset_singletons_random.png',
title='1500 subset singletons: true phase vs random phase',
subtending_node=False,
) Qualitatively, they seem similar to the simulation results, just with a tonne more variance. @nspope, is the high proportion of variants with invalid posteriors also surprising to you? If you'd like to investigate it further, here is the TS with the correctly phased singletons added. Let me know if there are other files you want: 1kgp_bal1500_phased_singletons.trees.zip I'm going to test it in GEL next. |
Just remembered that I can't install from GitHub in GEL anymore, so I unfortunately can't test it until they approve the Airlock transfer of this branch of tsdate, annoyingly! I just submitted the request, so it should be sorted within two days. |
Thanks @duncanMR, very helpful and a lot to chew on.
could you check the count of nonsegregating mutations (above the root)? ideally those'll be the only ones where we can't assign a posterior. |
I see: the number of mutations above root matches the number with invalid posteriors. So it's the root-to-sample edges problem rearing it's head again! It's concerning how many there are here. |
Oh -- I think I missed the discussion of this previously. What is going on? Are these not simply mutations that are segregating in the full data but fixed in the subset? |
Woops, just realised my mistake. The problem is simply that many of the singletons I extracted fall outside the range of site positions in the TS, because I extracted them directly from the zarr and forgot to limit the positions by the site density mask. So mutations were being placed onto nodes in the first and last null trees in the TS that cover the range of the genome that's been masked out. So it's nothing to worry about, sorry for the false alarm! |
Aha - of course! Easy mistake to make. We should catch this and error out, I think. |
Do you think this is a problem @hyanwong? It's not unusual to have some "above-the-root" mutations introduced by tsinfer (why, I don't know). |
@duncanMR two things that occurred to me -- we should be using mutation posteriors here; as that takes a weighted average of the two branch midpoints (posterior mean) rather than always choosing the longest branch (MAP estimate). I'd also be interested in seeing the parent node age-- an upper bound on mutation age may be more of interest if we're focusing on old singletons |
No, I mean we should error out if if singletons are being added into regions where there are no edges |
I assume the "above the root" mutations are added by parsimony, when e.g. we force the wrong ancestral state for a triallelic site |
Cool, I'll look into that today. I've also got tsdate through airlock so I'll test in GEL today too.
Yeah, above-the-root mutations are extremely rare in practice: I've only seen a handful of them in the largest tree sequences. Hopefully we can figure out a solution in the long-term with my DPhil project. |
When adding singletons in GeL that Sam extracted, there are sites where the alternative allele (e.g. A) is a singleton, but there is another alternate allele (e.g. T) at the same site that is at a higher frequency and already in the TS. Do we want to exclude these singletons from being added? It seems a bit inconsistent to exclude multi-allelics from inference, but allow multi-allelics if one of the alleles is a singleton. There are quite a few: 9250 sites have this problem in chr17p alone. |
@nspope Adding singletons is failing for chr17p in GEL. It's a bit annoying that it didn't show up in the 1kgp data, since I can't share a reproducible example. It looks like a numerical issue: a mutation phase was generated outside [0,1]. I can try and set up a debugger to find out what the problematic phase is. |
@duncanMR thanks Duncan. I can probably just clamp values to that range, if it's in the range of fp error. Would you mind inserting a print statement to see how far out of wack it is? |
It turns out that it's not an issue where the phase is within fp error of 0 or 1. After updating my singletons filtering procedure, I now have 1 million singletons to add to the ~1million variants already in the chr17p TS. Of those, there are three singletons which for some reason have invalid posteriors, but they aren't above the root this time and aren't in null trees. They are on short branches at the bottom of trees. I first tried removing the three sites, but bizarrely, five more sites popped up with invalid posteriors. The quick fix was easy: the bug is that the invalid posteriors cause there to be three @nspope I will add some comments to the PR where I've had issues so far. |
We don't ignore multiallelics in tsdate though, do we? True, they aren't used the help tsinfer inference, but once they are placed back on the ts by parsimony, they can be used for dating. I don't know if we do this for the GEL data though: it could be that Ben excludes them from the pipeline entirely. |
My impression was that we exclude all multi-allelic sites instead of just keeping one allele at random, but we can check with Ben later. |
We don't need to keep one at random: tsinfer can place non-inference sites back into the tree sequence using parsimony. They will then be used by tsdate. It might be that multiallelics are discarded by the snakemake pipeline before tsinfer actually sees them, and are not placed back into the tree sequence using parsimony. There is an argument that perhaps we should (or if not, we should adjust the mutation rate to take into account the fact that some sites have been discarded) |
That's a good point. We definitely aren't placing any multi-allelics by parsimony at the moment (there is one mutation per site) but I don't know if they are available in the zarr to be added later. If that's the case, then I think adding back all the multi-allelics (not just singletons) makes sense. |
I suspect they are in the zarr but masked out. @benjeffery will know. We can ask at 4.30 today. |
After adding and dating the unphased singletons to 1kgp-chr20p, the upper bound estimates (parent time) of singleton ages have a much lighter tail than posterior means of higher frequency variants. Seems reasonable to me: Click to reveal code to reproduce the plot above:import numpy as np
import tsdate
import logging
import tszip
import pandas as pd
#generated elsewhere
singletons_df = pd.read_csv('data/unphased_chr20p_singletons.csv')
ts = tszip.decompress('data/1kgp_all-chr20p-filterNton23-truncate-0-0-0-mm0-post-processed.trees.tsz')
ts = ts.simplify()
ts = tsdate.util.split_disjoint_nodes(ts)
singletons_df.drop_duplicates(subset=['position'], keep='first', inplace=True)
left = min(ts.sites_position)
right = max(ts.sites_position)
singletons_df = singletons_df[(singletons_df.position >= left) & (singletons_df.position <= right)]
stn_pos = singletons_df['position'].values
stn_sample_id = singletons_df['sample_id'].values
stn_ancestral = singletons_df['ref'].values
stn_derived = singletons_df['alt'].values
sample_id_dict = {}
for individual in ts.individuals():
sample_id = individual.metadata['sample_id']
sample_id_dict[sample_id] = individual.id
stn_individ = np.array([sample_id_dict[sample_id] for sample_id in stn_sample_id])
#insert singletons and date
singletons_ts = tsdate.phasing.insert_unphased_singletons(ts,
position = stn_pos,
individual = stn_individ,
ancestral_state = stn_ancestral,
derived_state = stn_derived
)
logging.getLogger().setLevel(logging.INFO)
singletons_dated_ts = tsdate.date(singletons_ts, mutation_rate=1.29e-08, singletons_phased=False)
stn_site_id = np.where(np.isin(ts.sites_position, stn_pos))[0]
assert(len(stn_site_id) == len(stn_pos))
stn_node = np.full(len(stn_site_id), -1)
stn_post_time = np.full(len(stn_site_id), np.nan)
stn_parent_time = np.full(len(stn_site_id), np.nan)
tree = ts.first()
for index, site_id in enumerate(stn_site_id):
site = ts.site(site_id)
pos = site.position
assert(len(site.mutations) == 1)
mut = site.mutations[0]
stn_post_time[index] = mut.metadata['mn']
child = mut.node
tree.seek(pos)
parent = tree.parent(mut.node)
stn_parent_time[index] = tree.time(parent)
non_stn_pos = ts.sites_position[~np.isin(ts.sites_position, stn_pos)]
non_stn_site_id = np.where(np.isin(ts.sites_position, non_stn_pos))[0]
non_stn_post_time = np.full(len(non_stn_site_id), np.nan)
for index, site_id in enumerate(non_stn_site_id):
site = ts.site(site_id)
pos = site.position
assert(len(site.mutations) == 1)
mut = site.mutations[0]
non_stn_post_time[index] = mut.metadata['mn']
def compute_cdf(data):
sorted_data = np.sort(data)
cdf = np.arange(1, len(sorted_data) + 1) / len(sorted_data)
return sorted_data, cdf
stn_post_time_sorted, stn_post_time_cdf = compute_cdf(stn_post_time)
stn_parent_time_sorted, stn_parent_time_cdf = compute_cdf(stn_parent_time)
non_stn_post_time_sorted, non_stn_post_time_cdf = compute_cdf(non_stn_post_time)
plt.plot(stn_post_time_sorted, stn_post_time_cdf, label='singletons: posterior time')
plt.plot(stn_parent_time_sorted, stn_parent_time_cdf, label='singletons: parent time')
plt.plot(non_stn_post_time_sorted, non_stn_post_time_cdf, label='all other variants: posterior time')
plt.xscale('log')
plt.title('CDFs for singletons vs other variants in 1kgp-chr20p-all')
plt.xlabel('mutation time')
plt.ylabel('CDF')
plt.legend()
plt.grid(True)
plt.show() I'm going to look at the 1kgp annotations dataframe next. |
In case anyone needs them, I fixed the problem with the singletons being placed outside the range of the TS. Here is the undated TS with phased singletons added: 1kgp_bal1500_phased_singletons.trees.zip The dated TS is too big for Github so you can get it here: https://drive.google.com/drive/folders/1vn5xMlBBXTB-_sosX9LpaQ_45E36UVoa?usp=drive_link |
Using a rough analysis, I reckon our current singleton phasing approach misses about 25% of the information we could incorporate if we also added haplotype lengths. See #5 (comment) I don't think this is enough to warrant modifying out existing approach, but something like this might make an interesting observation in the paper discussion. |
Closing in favour of #446 |
It'll frequently be the case that singletons are not phased. Singleton variants aren't used by tsinfer, but are crucial for correct dating. Here's how we'll handle unphased singletons:
Split edges leading to the same individual into "phase blocks" -- e.g. the intersection of intervals for these edges. For example, over some region of the genome the pair of edges leading to the same individual might have coordinates$[l_i, r_i]$ and $[l_j, r_j]$ . The phase block for these two edges would have coordinates $[\max(l_i, l_j), \min(r_i, r_j)]$ .
Use an expectation propagation update by phase block rather than separately for the two constituent edges. For example, say we have a phase block where the parents of the constituent edges have ages$t_i$ and $t_j$ . The children of the edges are both present-day haplotypes (of age 0). If the phase block has (mutation-weighted) span $\mu$ and contains $y$ singleton mutations, then the total mutational target is $\lambda_{ij} = \mu (t_i + t_j)$ and the mutation likelihood is $y \sim \mathrm{Poisson}(\lambda_{ij})$ .
So, if we have variational posteriors$t_i \sim \mathrm{Gamma}(\alpha_i, \beta_i), t_j \sim \mathrm{Gamma}(\alpha_j, \beta_j)$ , the partition function used in the expectation propagation update for the phase block is,
$$\int_0^{\infty} \int_0^{\infty} t_i^{\alpha_i - 1} e^{-\beta_i t_i} t_j^{\alpha_j - 1} e^{-\beta_j t_j} (t_i + t_j)^{y} e^{-\mu (t_i + t_j)} dt_i dt_j = \mathcal{B}(\alpha_i, \alpha_j) \frac{\Gamma(\alpha_i + \alpha_j + y)}{(\beta_i + \mu)^{\alpha_i + \alpha_j + y}} {_2} F_1 (\alpha_j, \alpha_i + \alpha_j + y; \alpha_i + \alpha_j; 1 - \frac{\beta_j + \mu}{\beta_i + \mu})$$ and we get the updated moments of $t_i, t_j$ by differentiating this wrt $\alpha, \beta$ as usual.
This gets substantially more complicated if the individual in question is not in the present day (especially if the age of the individual is unknown). So for now we'll have to assume that mutations on ancient samples are phased (I think these are often pseudohaploid sequences anyways).
The text was updated successfully, but these errors were encountered: