Importing large newick files #36

hyanwong · 2022-03-25T12:52:58Z

We seem to take an inordinately long time to read large newick files. If we are claiming to be able to be useful for phylogeneticists with big trees (like those produced from covid studies), I think we should be able to do this a lot more quickly. Here's an example on my laptop of a 30Mb tree with 2.3 million samples from https://files.opentreeoflife.org/synthesis/opentree13.4/output/labelled_supertree/index.html (see labelled_supertree.tre): it takes 16 minutes.

>>> start = datetime.datetime.now(); ts=tsconvert.newick.read("labelled_supertree.tre"); print(datetime.datetime.now() - start)
0:16:08.770515

It would be a good showcase to be able to do this in a few seconds (should be possible with disk IO at 30MB/s, right?)

The text was updated successfully, but these errors were encountered:

hyanwong · 2022-03-25T12:56:36Z

P.s. I'm not sure the code above is right, though. From the docs it seems like tsconvert.from_newick takes a string, which wouldn't be sensible here. But tsconvert.newick.read("labelled_supertree.tre") returns a list of node objects, not a tree sequence.

jeromekelleher · 2022-03-25T13:18:25Z

What's the timing for ingesting this into R (using ape)?

hyanwong · 2022-03-25T13:27:46Z

Not dared try it yet. Dendropy only takes 3 mins now (has been a while since I tested it, and both my computer and the parser seems to be faster):

>>> start = datetime.datetime.now(); tree = dendropy.Tree.get(schema="newick", path="labelled_supertree.tre"); print(datetime.datetime.now() - start)
0:03:05.524834

hyanwong · 2022-03-25T13:35:14Z

What's the timing for ingesting this into R (using ape)?

8 seconds in R!! That's more like it.

ptm <- proc.time(); read.tree(file="/Users/yan/Downloads/opentree13.4_tree/labelled_supertree/labelled_supertree.tre"); proc.time() - ptm

Phylogenetic tree with 2392042 tips and 291478 internal nodes.

Tip labels:
  ott2, ott6, ott22687, ott471450, ott16, ott471456, ...
Node labels:
  ott93302, ott304358, mrcaott2ott3973, mrcaott2ott276, mrcaott2ott148, mrcaott2ott142555, ...

Unrooted; no branch lengths.
   user  system elapsed 
  8.549   0.276   9.241

jeromekelleher · 2022-03-25T14:36:59Z

Dendropy is slow.

We could consider doing a numba based newick parser in tsconvert, but it would probably be quite icky.

hyanwong · 2022-03-25T15:33:06Z

Dendropy is slow.

Indeed. But still 5 times faster than what tsconvert currently uses, apparently.

We could consider doing a numba based newick parser in tsconvert, but it would probably be quite icky.

Yes, it's not obvious.

I wonder how R does it so quickly? Is there C code underlying it?

I'm half inclined, when I have time, to write a basic parser in C that only deals with the very simplest newick format (non quoted strings, perhaps even only allowing numbers as node identifiers), just to see how fast we can get.

jeromekelleher · 2022-03-25T15:42:22Z

I wonder how R does it so quickly? Is there C code underlying it?

Yes, and it's very ugly and tricky C. Writing a fast and general newick parser is much harder than you think...

hyanwong · 2022-03-25T15:44:14Z

I can believe that. But I also think it would be easy to write a non-general one that only accepted digits, (, ), ;, and :.

hyanwong · 2022-03-25T15:48:01Z

(also, perhaps I could just steal the R one, and list it as demonstration code somewhere. Or use the Nexus Class Library https://github.com/mtholder/ncl)

hyanwong · 2022-03-25T15:54:51Z

Another data point, using the C++ Nexus Class Library (NCL):

> ptm <- proc.time(); a <- rncl(file="/Users/yan/Downloads/opentree13.4_tree/labelled_supertree/labelled_supertree.tre", file.format="newick"); proc.time() - ptm
   user  system elapsed 
 29.554   1.576  35.070

jeromekelleher · 2022-03-25T16:15:29Z

Part of the reason I don't want to get into this is that it's supposed to be a one-off cost. Effort would be much better spent in convincing people that it's worth doing the conversion (i.e., with applications) than making the conversion faster.

I really don't want to spend a month working on newick parsers, and that's what it would entail.

hyanwong · 2022-03-25T16:26:25Z

Yes, ISWYM. On the other hand, my use case is that each time I run my pipeline it is with a slightly different version of a large newick tree, so it's not at all one-off for me.

Having said which, the other stages of the pipeline completely dwarf the time spend reading in the file (bunzipping a set of 3GB compressed files, for a start)

benjeffery · 2022-03-25T16:26:47Z

I guess the issue is that if you're doing some analysis over many trees/trials with a pipeline where you have to dump to newick then it isn't a one-off cost. Looking at the R parser they do a load of sanitisation on the string in R: https://github.com/cran/ape/blob/10898aebdf6661a0b81ba21bf24969336b544a60/R/read.tree.R#L10
and then pass over to a C routine for the actual parsing: https://github.com/cran/ape/blob/390386e67f9ff6cd8e6e523b7c43379a1551c565/src/tree_build.c#L348

hyanwong · 2022-03-25T16:28:03Z

Yep, e.g. for Covid trees, I assume that you might want to read the latest daily update to the tree before doing analysis.

benjeffery · 2022-03-25T16:30:03Z

We're slower than we need to be as we currently load the whole tree then walk it to build the tskit nodes and edges. I think we could get "good-enough" by doing a python/numba implementation as JK suggests, following the R example of sanitising and then parsing.

benjeffery · 2022-03-29T17:17:31Z

So I couldn't help having a little play with this to see how python would perform. Converting the newick string to an unit8 byte array and then processing with numba: https://gist.github.com/benjeffery/fe87b1253dc16c10103c4031f9fd2b31
This is a very rough proof-of-concept that doesn't handle labels or branch lengths as I didn't want to commit much time to it.
Here's the timings:

tsconvert.from_newick (DendroPy): 13min
newick.load (Pure-python library, doesn't make a tree sequence): 12min
R (Ape library, doesn't make a tree sequence): 8s (On Yan's machine)
My proof-of-concept (including making and integrity checking the tree-sequence): 5s

I don't expect the labels, or branch length processing to make this much slower, maybe a couple of multiples.

benjeffery · 2022-03-29T17:33:01Z

This is interesting, using Pure-Python (no numba) you can get 14s if you don't convert to a bytearray!!! I guess this is hitting all the fast paths in Python. Python is great if you avoid function calls! Adding numba doubles the runtime! https://gist.github.com/benjeffery/e6e37fbf36b63403fda36a0391f2f891

jeromekelleher · 2022-04-01T15:36:03Z

Very nice @benjeffery, that's impressive! I guess if we're avoiding creating a bunch of objects along the way and creating numpy arrays instead there is a significant gain to be made. I'm happy to help getting this in to tsconvert.

hyanwong · 2022-04-01T17:52:23Z

I have a slightly odd request: could we also have a method to import the newick into a table collection, rather than a tree sequence? This would allow zero length (and even negative!) "branch lengths", which would otherwise invalidate the TS requirements. Then I can then go through and spot these weird edge cases, turn them into epsilons, and leave comments in the node metadata.

benjeffery · 2022-04-01T17:56:50Z

I think that's a good thing to have!

benjeffery mentioned this issue Apr 6, 2022

High performance newick parsing. tskit-dev/tskit#2187

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importing large newick files #36

Importing large newick files #36

hyanwong commented Mar 25, 2022 •

edited

Loading

hyanwong commented Mar 25, 2022

jeromekelleher commented Mar 25, 2022

hyanwong commented Mar 25, 2022 •

edited

Loading

hyanwong commented Mar 25, 2022

jeromekelleher commented Mar 25, 2022

hyanwong commented Mar 25, 2022

jeromekelleher commented Mar 25, 2022

hyanwong commented Mar 25, 2022 •

edited

Loading

hyanwong commented Mar 25, 2022

hyanwong commented Mar 25, 2022

jeromekelleher commented Mar 25, 2022

hyanwong commented Mar 25, 2022

benjeffery commented Mar 25, 2022

hyanwong commented Mar 25, 2022

benjeffery commented Mar 25, 2022

benjeffery commented Mar 29, 2022 •

edited

Loading

benjeffery commented Mar 29, 2022 •

edited

Loading

jeromekelleher commented Apr 1, 2022

hyanwong commented Apr 1, 2022

benjeffery commented Apr 1, 2022

Importing large newick files #36

Importing large newick files #36

Comments

hyanwong commented Mar 25, 2022 • edited Loading

hyanwong commented Mar 25, 2022

jeromekelleher commented Mar 25, 2022

hyanwong commented Mar 25, 2022 • edited Loading

hyanwong commented Mar 25, 2022

jeromekelleher commented Mar 25, 2022

hyanwong commented Mar 25, 2022

jeromekelleher commented Mar 25, 2022

hyanwong commented Mar 25, 2022 • edited Loading

hyanwong commented Mar 25, 2022

hyanwong commented Mar 25, 2022

jeromekelleher commented Mar 25, 2022

hyanwong commented Mar 25, 2022

benjeffery commented Mar 25, 2022

hyanwong commented Mar 25, 2022

benjeffery commented Mar 25, 2022

benjeffery commented Mar 29, 2022 • edited Loading

benjeffery commented Mar 29, 2022 • edited Loading

jeromekelleher commented Apr 1, 2022

hyanwong commented Apr 1, 2022

benjeffery commented Apr 1, 2022

hyanwong commented Mar 25, 2022 •

edited

Loading

hyanwong commented Mar 25, 2022 •

edited

Loading

hyanwong commented Mar 25, 2022 •

edited

Loading

benjeffery commented Mar 29, 2022 •

edited

Loading

benjeffery commented Mar 29, 2022 •

edited

Loading