Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing large newick files #36

Open
hyanwong opened this issue Mar 25, 2022 · 20 comments
Open

Importing large newick files #36

hyanwong opened this issue Mar 25, 2022 · 20 comments

Comments

@hyanwong
Copy link
Member

hyanwong commented Mar 25, 2022

We seem to take an inordinately long time to read large newick files. If we are claiming to be able to be useful for phylogeneticists with big trees (like those produced from covid studies), I think we should be able to do this a lot more quickly. Here's an example on my laptop of a 30Mb tree with 2.3 million samples from https://files.opentreeoflife.org/synthesis/opentree13.4/output/labelled_supertree/index.html (see labelled_supertree.tre): it takes 16 minutes.

>>> start = datetime.datetime.now(); ts=tsconvert.newick.read("labelled_supertree.tre"); print(datetime.datetime.now() - start)
0:16:08.770515

It would be a good showcase to be able to do this in a few seconds (should be possible with disk IO at 30MB/s, right?)

@hyanwong
Copy link
Member Author

P.s. I'm not sure the code above is right, though. From the docs it seems like tsconvert.from_newick takes a string, which wouldn't be sensible here. But tsconvert.newick.read("labelled_supertree.tre") returns a list of node objects, not a tree sequence.

@jeromekelleher
Copy link
Member

What's the timing for ingesting this into R (using ape)?

@hyanwong
Copy link
Member Author

hyanwong commented Mar 25, 2022

Not dared try it yet. Dendropy only takes 3 mins now (has been a while since I tested it, and both my computer and the parser seems to be faster):

>>> start = datetime.datetime.now(); tree = dendropy.Tree.get(schema="newick", path="labelled_supertree.tre"); print(datetime.datetime.now() - start)
0:03:05.524834

@hyanwong
Copy link
Member Author

What's the timing for ingesting this into R (using ape)?

8 seconds in R!! That's more like it.

ptm <- proc.time(); read.tree(file="/Users/yan/Downloads/opentree13.4_tree/labelled_supertree/labelled_supertree.tre"); proc.time() - ptm

Phylogenetic tree with 2392042 tips and 291478 internal nodes.

Tip labels:
  ott2, ott6, ott22687, ott471450, ott16, ott471456, ...
Node labels:
  ott93302, ott304358, mrcaott2ott3973, mrcaott2ott276, mrcaott2ott148, mrcaott2ott142555, ...

Unrooted; no branch lengths.
   user  system elapsed 
  8.549   0.276   9.241 

@jeromekelleher
Copy link
Member

Dendropy is slow.

We could consider doing a numba based newick parser in tsconvert, but it would probably be quite icky.

@hyanwong
Copy link
Member Author

Dendropy is slow.

Indeed. But still 5 times faster than what tsconvert currently uses, apparently.

We could consider doing a numba based newick parser in tsconvert, but it would probably be quite icky.

Yes, it's not obvious.

I wonder how R does it so quickly? Is there C code underlying it?

I'm half inclined, when I have time, to write a basic parser in C that only deals with the very simplest newick format (non quoted strings, perhaps even only allowing numbers as node identifiers), just to see how fast we can get.

@jeromekelleher
Copy link
Member

I wonder how R does it so quickly? Is there C code underlying it?

Yes, and it's very ugly and tricky C. Writing a fast and general newick parser is much harder than you think...

@hyanwong
Copy link
Member Author

hyanwong commented Mar 25, 2022

I can believe that. But I also think it would be easy to write a non-general one that only accepted digits, (, ), ;, and :.

@hyanwong
Copy link
Member Author

(also, perhaps I could just steal the R one, and list it as demonstration code somewhere. Or use the Nexus Class Library https://github.com/mtholder/ncl)

@hyanwong
Copy link
Member Author

Another data point, using the C++ Nexus Class Library (NCL):

> ptm <- proc.time(); a <- rncl(file="/Users/yan/Downloads/opentree13.4_tree/labelled_supertree/labelled_supertree.tre", file.format="newick"); proc.time() - ptm
   user  system elapsed 
 29.554   1.576  35.070 

@jeromekelleher
Copy link
Member

Part of the reason I don't want to get into this is that it's supposed to be a one-off cost. Effort would be much better spent in convincing people that it's worth doing the conversion (i.e., with applications) than making the conversion faster.

I really don't want to spend a month working on newick parsers, and that's what it would entail.

@hyanwong
Copy link
Member Author

Yes, ISWYM. On the other hand, my use case is that each time I run my pipeline it is with a slightly different version of a large newick tree, so it's not at all one-off for me.

Having said which, the other stages of the pipeline completely dwarf the time spend reading in the file (bunzipping a set of 3GB compressed files, for a start)

@benjeffery
Copy link
Member

I guess the issue is that if you're doing some analysis over many trees/trials with a pipeline where you have to dump to newick then it isn't a one-off cost. Looking at the R parser they do a load of sanitisation on the string in R: https://github.com/cran/ape/blob/10898aebdf6661a0b81ba21bf24969336b544a60/R/read.tree.R#L10
and then pass over to a C routine for the actual parsing: https://github.com/cran/ape/blob/390386e67f9ff6cd8e6e523b7c43379a1551c565/src/tree_build.c#L348

@hyanwong
Copy link
Member Author

Yep, e.g. for Covid trees, I assume that you might want to read the latest daily update to the tree before doing analysis.

@benjeffery
Copy link
Member

We're slower than we need to be as we currently load the whole tree then walk it to build the tskit nodes and edges. I think we could get "good-enough" by doing a python/numba implementation as JK suggests, following the R example of sanitising and then parsing.

@benjeffery
Copy link
Member

benjeffery commented Mar 29, 2022

So I couldn't help having a little play with this to see how python would perform. Converting the newick string to an unit8 byte array and then processing with numba: https://gist.github.com/benjeffery/fe87b1253dc16c10103c4031f9fd2b31
This is a very rough proof-of-concept that doesn't handle labels or branch lengths as I didn't want to commit much time to it.
Here's the timings:

  • tsconvert.from_newick (DendroPy): 13min
  • newick.load (Pure-python library, doesn't make a tree sequence): 12min
  • R (Ape library, doesn't make a tree sequence): 8s (On Yan's machine)
  • My proof-of-concept (including making and integrity checking the tree-sequence): 5s

I don't expect the labels, or branch length processing to make this much slower, maybe a couple of multiples.

@benjeffery
Copy link
Member

benjeffery commented Mar 29, 2022

This is interesting, using Pure-Python (no numba) you can get 14s if you don't convert to a bytearray!!! I guess this is hitting all the fast paths in Python. Python is great if you avoid function calls! Adding numba doubles the runtime! https://gist.github.com/benjeffery/e6e37fbf36b63403fda36a0391f2f891

@jeromekelleher
Copy link
Member

Very nice @benjeffery, that's impressive! I guess if we're avoiding creating a bunch of objects along the way and creating numpy arrays instead there is a significant gain to be made. I'm happy to help getting this in to tsconvert.

@hyanwong
Copy link
Member Author

hyanwong commented Apr 1, 2022

I have a slightly odd request: could we also have a method to import the newick into a table collection, rather than a tree sequence? This would allow zero length (and even negative!) "branch lengths", which would otherwise invalidate the TS requirements. Then I can then go through and spot these weird edge cases, turn them into epsilons, and leave comments in the node metadata.

@benjeffery
Copy link
Member

I think that's a good thing to have!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants