Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update ibd matrix based on a relevant subset of ancestors #2

Open
bguo068 opened this issue Dec 8, 2022 · 1 comment
Open

update ibd matrix based on a relevant subset of ancestors #2

bguo068 opened this issue Dec 8, 2022 · 1 comment

Comments

@bguo068
Copy link
Owner

bguo068 commented Dec 8, 2022

Adjacent trees are highly correlated, with the difference being a few deleted/added edges. From one tree to the next, we only need to update the IBD matrix between pairs of leaves of subtrees rooted at different child nodes of the parent node of a deleted edge.
From the Kelleher et al 2016 paper, fig4, we can sort the edge table by the 'right' coordinate and obtain which edges are deleted. We collect all 'parent' nodes of the delete edges and only update IBD matrix elements due to the changes of these nodes.

This should also work when we are only compare trees at two adjacent sampling points. In this case, we can multiple local trees between the sampling points, we update the IBD matrix related to parents nodes that have changed at any tree iteration between the two sampling points.

@bguo068
Copy link
Owner Author

bguo068 commented Dec 8, 2022

import msprime
import numpy as np
import pandas as pd

ts = msprime.sim_ancestry(
    samples=1000, # diploids
    recombination_rate=1e-8,
    sequence_length=100 * 1e6,
    population_size=10_000,
)

# turn edge table to pandas dataframe
d = ts.tables.edges.asdict()
d.pop('metadata')
d.pop("metadata_offset")
d.pop("metadata_schema")
et = pd.DataFrame(d)

# sort edge table according to 'right' (the order of edges being deleted)
# during tree iteration
et1=et.sort_values(['right', 'parent'])

# get a list tree end positions
ends = np.sort(et.right.unique())

# a total number of relevant parent nodes between sampling points
# separated by 30 local trees. This is equivalent to 0.01 cM step size.
sz_lst = []
for i in range(len(ends) - 40):
    sz = et1.loc[(et1.right>=ends[i]) & (et1.right<ends[i+30]) ,'parent'].unique().size
    sz_lst.append(sz)
print(np.unique(sz_lst, return_counts=True))

# 
# (array([39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
#         56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
#         73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86]),
#  array([    3,     3,     3,    10,    18,    56,    79,   124,   235,
#           397,   692,  1081,  1572,  2269,  2981,  4115,  5378,  7095,
#          8614, 10861, 12873, 14811, 16791, 18457, 19653, 20746, 20892,
#         20197, 19215, 17390, 15852, 13518, 11430,  9234,  7270,  5298,
#          3999,  2819,  1859,  1197,   713,   436,   226,   121,    44,
#            20,     4,     1]))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant