-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importing large newick files #36
Comments
P.s. I'm not sure the code above is right, though. From the docs it seems like |
What's the timing for ingesting this into R (using ape)? |
Not dared try it yet. Dendropy only takes 3 mins now (has been a while since I tested it, and both my computer and the parser seems to be faster):
|
8 seconds in R!! That's more like it.
|
Dendropy is slow. We could consider doing a numba based newick parser in tsconvert, but it would probably be quite icky. |
Indeed. But still 5 times faster than what tsconvert currently uses, apparently.
Yes, it's not obvious. I wonder how R does it so quickly? Is there C code underlying it? I'm half inclined, when I have time, to write a basic parser in C that only deals with the very simplest newick format (non quoted strings, perhaps even only allowing numbers as node identifiers), just to see how fast we can get. |
Yes, and it's very ugly and tricky C. Writing a fast and general newick parser is much harder than you think... |
I can believe that. But I also think it would be easy to write a non-general one that only accepted digits, |
(also, perhaps I could just steal the R one, and list it as demonstration code somewhere. Or use the Nexus Class Library https://github.com/mtholder/ncl) |
Another data point, using the C++ Nexus Class Library (NCL):
|
Part of the reason I don't want to get into this is that it's supposed to be a one-off cost. Effort would be much better spent in convincing people that it's worth doing the conversion (i.e., with applications) than making the conversion faster. I really don't want to spend a month working on newick parsers, and that's what it would entail. |
Yes, ISWYM. On the other hand, my use case is that each time I run my pipeline it is with a slightly different version of a large newick tree, so it's not at all one-off for me. Having said which, the other stages of the pipeline completely dwarf the time spend reading in the file (bunzipping a set of 3GB compressed files, for a start) |
I guess the issue is that if you're doing some analysis over many trees/trials with a pipeline where you have to dump to newick then it isn't a one-off cost. Looking at the R parser they do a load of sanitisation on the string in R: https://github.com/cran/ape/blob/10898aebdf6661a0b81ba21bf24969336b544a60/R/read.tree.R#L10 |
Yep, e.g. for Covid trees, I assume that you might want to read the latest daily update to the tree before doing analysis. |
We're slower than we need to be as we currently load the whole tree then walk it to build the tskit nodes and edges. I think we could get "good-enough" by doing a python/numba implementation as JK suggests, following the R example of sanitising and then parsing. |
So I couldn't help having a little play with this to see how python would perform. Converting the newick string to an unit8 byte array and then processing with
I don't expect the labels, or branch length processing to make this much slower, maybe a couple of multiples. |
This is interesting, using Pure-Python (no |
Very nice @benjeffery, that's impressive! I guess if we're avoiding creating a bunch of objects along the way and creating numpy arrays instead there is a significant gain to be made. I'm happy to help getting this in to tsconvert. |
I have a slightly odd request: could we also have a method to import the newick into a table collection, rather than a tree sequence? This would allow zero length (and even negative!) "branch lengths", which would otherwise invalidate the TS requirements. Then I can then go through and spot these weird edge cases, turn them into epsilons, and leave comments in the node metadata. |
I think that's a good thing to have! |
We seem to take an inordinately long time to read large newick files. If we are claiming to be able to be useful for phylogeneticists with big trees (like those produced from covid studies), I think we should be able to do this a lot more quickly. Here's an example on my laptop of a 30Mb tree with 2.3 million samples from https://files.opentreeoflife.org/synthesis/opentree13.4/output/labelled_supertree/index.html (see labelled_supertree.tre): it takes 16 minutes.
It would be a good showcase to be able to do this in a few seconds (should be possible with disk IO at 30MB/s, right?)
The text was updated successfully, but these errors were encountered: