Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from_ms raises dendropy.utility.error.UltrametricityError #30

Open
grahamgower opened this issue Jul 10, 2021 · 6 comments
Open

from_ms raises dendropy.utility.error.UltrametricityError #30

grahamgower opened this issue Jul 10, 2021 · 6 comments

Comments

@grahamgower
Copy link
Member

$ ms 3 1 -T -r 5 6 -seeds 1 2 3
ms 3 1 -T -r 5 6 -seeds 1 2 3
1 2 3

//
[2](2:0.293,(1:0.214,3:0.214):0.079);
[4](3:0.214,(1:0.077,2:0.077):0.138);
$ ms 3 1 -T -r 5 6 -seeds 1 2 3 | python -c "import tsconvert, sys; tsconvert.from_ms(sys.stdin.read())"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/grg/.local/lib/python3.9/site-packages/tsconvert/newick.py", line 79, in from_ms
    tree.calc_node_ages()
  File "/home/grg/.local/lib/python3.9/site-packages/dendropy/datamodel/treemodel.py", line 5672, in calc_node_ages
    raise error.UltrametricityError(
dendropy.utility.error.UltrametricityError: Tree is not ultrametric within threshold of 1e-05: 0.0010000000000000286.
Encountered in subtree of node <Node object at 0x7f9376295760: 'None' (None)> (edge length of
None):

    (3:0.214,(1:0.077,2:0.077):0.138)

Age of children:
-   <Node object at 0x7f9376295790: 'None' (<Taxon 0x7f9376295c70 '3'>)>: has age of 0.0 and edge length of 0.214, resulting in parent node age of 0.214
-   <Node object at 0x7f9376295730: 'None' (None)>: has age of 0.077 and edge length of 0.138, resulting in parent node age of 0.21500000000000002

It seems the precision of the branch lengths are not sufficient. The "precision" -p parameter has no effect on this.

$ ms 3 1 -T -r 5 6 -seeds 1 2 3 -p 10
ms 3 1 -T -r 5 6 -seeds 1 2 3 -p 10
1 2 3

//
[2](2:0.293,(1:0.214,3:0.214):0.079);
[4](3:0.214,(1:0.077,2:0.077):0.138);
@jeromekelleher
Copy link
Member

Sigh - I think this basically means conversion will never work for ms. Unless we have sufficient precision on the branch lengths the whole strategy fails because we rely on being able to uniquely identify nodes by their times. If we only have three digits of precision, then that's not going to work.

@grahamgower
Copy link
Member Author

Yeah, it does seem problematic. Annoyingly, the ms sources have this hardcoded to 3 decimal places. In most cases, I think mspms -p xx is a pretty reasonable substitute, which applies the precision to all output. Diff below for when you really need to test with ms. But as you say, there's no solution if you already have a ton of ms output laying around and want to do something useful with it.

$ diff -u ms.c.bak ms.c
--- ms.c.bak    2021-07-12 08:28:08.726221170 +0200
+++ ms.c        2021-07-12 08:28:12.682762440 +0200
@@ -919,7 +919,7 @@
        double time ;

    if( descl[noden] == -1 ) {
-       printf("%d:%5.3lf", noden+1, (ptree+ ((ptree+noden)->abv))->time  - (ptree+noden )->time ); /* adna */
+       printf("%d:%5.13lf", noden+1, (ptree+ ((ptree+noden)->abv))->time  - (ptree+noden )->time ); /* adna */
        }
    else{
        printf("(");
@@ -929,7 +929,7 @@
        if( (ptree+noden)->abv == 0 ) printf(");\n");
        else {
          time = (ptree + (ptree+noden)->abv )->time - (ptree+noden)->time ;
-         printf("):%5.3lf", time );
+         printf("):%5.13lf", time );
          }
         }
 }

@benjeffery
Copy link
Member

we rely on being able to uniquely identify nodes by their times

Maybe I'm missing something unique regarding converting these ms newicks, but the new newick module based parser makes no such constraint on node times, it should be possible to use it to parse these?

@benjeffery benjeffery added this to the Initial 0.1 release milestone Jul 18, 2021
@jeromekelleher
Copy link
Member

This is a separate issue to the ultrametricity thing @benjeffery.

The assumption we're making about nodes (in the tree sequence sense) in ms output @benjeffery is that they're uniquely identified by their times. There can only be one coalescence event at one time in the coalescent, so this is a good assumption (for ms at least, probably not for other programs using ms output). If the branch lenghts aren't output with high precision, then this won't work. We'll have no way of identifying the tree sequence nodes in different trees, so we'd end up with a JBOT (if we supported it).

@benjeffery
Copy link
Member

Yuck! I get it now. Not much point in putting any of effort into from_ms then.

@grahamgower
Copy link
Member Author

Well, ms output may be problematic, but there are a range of other programs that output ms format. In particular, mspms -p 12 ... seems to work just fine. I guess that scrm output would also be usable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants