Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat : add reciprocal monophyly function to Tree class #2974

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Silloky
Copy link

@Silloky Silloky commented Jul 24, 2024

Description

This PR adds a method to the Tree class in the Python API : is_recip_monophylectic().
It tests the current tree for reciprocal monophyly, a population genomics phenomenon as described here.

The key algorithm can be summarized like this : if each population's MRCA's children are the same as the all the population's leaves, all populations are reciprocally monophyletic.

PR Checklist:

  • Tests that fully cover new/changed functionality (tried, but I was faced with ModuleNotFoundError: No module named '_tskit' when importing tskit from my cloned fork)
  • Documentation including tutorial content if appropriate. (code comments done, should I add to the official docs ?)

Also, this is only implemented in Python, not C, so someone needs to do that as I do not know C.

Copy link

codecov bot commented Jul 24, 2024

Codecov Report

Attention: Patch coverage is 6.66667% with 14 lines in your changes missing coverage. Please review.

Project coverage is 90.45%. Comparing base (beafeba) to head (3bafdbf).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2974      +/-   ##
==========================================
+ Coverage   89.62%   90.45%   +0.82%     
==========================================
  Files          29       26       -3     
  Lines       30170    21619    -8551     
  Branches     5867     4372    -1495     
==========================================
- Hits        27041    19555    -7486     
+ Misses       1790     1160     -630     
+ Partials     1339      904     -435     
Flag Coverage Δ
c-tests 86.20% <ø> (ø)
lwt-tests ?
python-c-tests ?
python-tests 98.82% <6.66%> (-0.20%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
python/tskit/trees.py 98.20% <6.66%> (-0.59%) ⬇️

... and 3 files with indirect coverage changes

@benjeffery
Copy link
Member

ModuleNotFoundError: No module named '_tskit' have you run make in the python directory?

@jeromekelleher
Copy link
Member

See the docs here for details on building python extension module: https://tskit.dev/tskit/docs/stable/development.html#id1

@jeromekelleher
Copy link
Member

jeromekelleher commented Jul 24, 2024

Here's a quick sketch of another approach to doing this @Silloky:

import msprime

demography = msprime.Demography()
demography.add_population(name="A", initial_size=1_000)
demography.add_population(name="B", initial_size=1_000)
demography.add_population(name="C", initial_size=10_000)
demography.add_population_split(time=10000, derived=["A", "B"], ancestral="C")
ts = msprime.sim_ancestry(samples={"A": 2, "B": 3}, demography=demography, random_seed=12)
print(ts)
print(ts.draw_text())

for pop_id in range(ts.num_populations):
    pop_samples = ts.samples(population=pop_id)
    if len(pop_samples) == 0:
        continue
    for tree in ts.trees(tracked_samples=pop_samples):
        mrca = tree.mrca(*pop_samples) # Inefficient
        monophyletic = len(pop_samples) == tree.num_tracked_samples(mrca)
        print(pop_id, mrca, monophyletic)                                                           

Here, we're computing a true/false value for whether a population (or more generally any sample subset) is monophyletic in each tree. We do this using the num_tracked_samples, which is efficiently computed for each node in the C layer. The only inefficient bit here is how we're computing the MRCA of all the samples (but in most cases would be fine).

I think a good API for this would be a TreeSequence method that looked something like this:

def is_monophyletic(self, samples):
     """
     Returns a boolean array of length ts.num_trees which is True if the corresponding tree is monophyletic
     for the given set of samples. That is, if ``u`` is the MRCA of all the specified samples in a given tree, no 
     other samples are present in the subtree rooted at u.
     """
     # Code that looks like the above.

How does this look to you? If the API matches your needs, then we can go ahead with implementation and testing, and then think about how we can do the MRCA more efficiently (if we need to)?

@Silloky
Copy link
Author

Silloky commented Jul 28, 2024

Hi, thank you very much for your answer.
Indeed, this looks more promising and definitely more efficient.
I don't really have time now to look into it sadly... but I will try to implement your suggestions in August.
Have a nice day and a nice summer !

@benjeffery
Copy link
Member

Hi @Silloky! Are you still planning to work on this?

@Silloky
Copy link
Author

Silloky commented Sep 29, 2024

Hi @benjeffery,
Yes, I am ! I do realise it's been a while, but I've been quite busy...
I plan to implement it in the next few weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants