High performance newick parsing. #2187

benjeffery · 2022-04-06T17:04:43Z

Our current newick parsing relies on the newick python library, which for trees we've tested is ~120x slower than parsing with a common (C extension) R library. My initial rough experiments with a pure-python one-pass pre-allocated state machine in tskit-dev/tsconvert#36 are very promising, timing at 1-2x the R library. A C implementation could follow if we see the benefit and the Python implementation is amenable to conversion.

Requirements:

Parses labels storing these in metadata. (Open question, should labels go into node-associated individuals?)
Parse comments storing in node metadata.
Parse branch lengths and post-process these to node times taking care to handle the numerical precision issues therein.
Support files with multiple roots
Support for unary nodes

Support for unicode is not required.

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2022-04-06T18:05:37Z

Also, support for unary nodes would be good (but I suspect this will happen automatically)

benjeffery · 2022-04-06T18:24:13Z

Also, support for unary nodes would be good (but I suspect this will happen automatically)

Yes, thanks! Added to the list. @hyanwong @szhan do you have anything to add?

hyanwong · 2022-04-06T19:47:28Z

I made this point somewhere else, but the ability to parse into a table collection would mean that we could cope with zero-length (or even negative-length) branches.

Also, I'm not sure if we want to worry about trees beginning with [&U] and [&R]. We only "do" rooted trees in tskit, AFAIK.

benjeffery · 2022-04-06T19:57:42Z

I made this point somewhere else, but the ability to parse into a table collection would mean that we could cope with zero-length (or even negative-length) branches.

What would one then do with the table collection? There's currently no way to get to a Tree without satisfying tskit's node time constraints, right?

hyanwong · 2022-04-06T20:21:18Z

What would one then do with the table collection?

I would run through the node times and adjust them to make a valid TS, while adding stuff to the metadata to say what I had done.

benjeffery · 2022-04-06T20:22:34Z

What would one then do with the table collection?

I would run through the node times and adjust them to make a valid TS, while adding stuff to the metadata to say what I had done.

Great - was checking we didn't need to add a way to get to the quintuple array representation without the integrity constraints. Now you say it I remember you mentioning the fixing-and-recording.

hyanwong · 2022-04-08T11:42:17Z

It should be easy to find the "bad" nodes by doing

bad_edges = np.where(tables.nodes.time[tables.edges.parent] <= tables.nodes.time[tables.edges.child])[0]

right? No need to traverse the tree. Fixing them might be more difficult, however, although zero length branches can probably be fixed by adding an epsilon or using last after, as we do in split_polytomies

benjeffery · 2022-04-08T12:14:37Z

Yes good idea - I think that fixing non-conformant tree sequences sounds like a job for a separate function/set of functions though and not part of this work.

hyanwong · 2022-04-08T13:20:29Z

Yes, 100%

hyanwong · 2022-11-01T09:19:54Z

Incidentally, what is the current status of tskit's Newick parsing capabilities, @benjeffery? Do we have (relatively) fast parsing in a released version yet?

benjeffery · 2022-11-01T09:39:08Z

Still a proof of concept I'm afraid! It's on the medium term Todo list...

hyanwong · 2022-11-02T10:10:22Z

A quick additional note here: I wonder if we should allow nodes without branch lengths in the Newick file, and then set the time_units to uncalibrated. We could have a parameter from_newick(missing_length=X) which specifies the value to use if the branch length is missing. Perhaps if None we error out on missing branch lengths. missing_length=0 could also be allowed, to create an invalid tree sequence but a valid set of tables (see tskit-dev/tsconvert#36 (comment)).

This would also be useful for the "introduction to tskit for phylogeneticists" tutorial (tskit-dev/tutorials#163)

benjeffery added enhancement New feature or request Python API Issue is about the Python API labels Apr 6, 2022

benjeffery added this to the Python 0.6.0 milestone Apr 6, 2022

benjeffery modified the milestones: Python 0.5.2, Python 0.5.1 Apr 6, 2022

jeromekelleher modified the milestones: Python 0.5.6, Python 0.5.7 Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High performance newick parsing. #2187

High performance newick parsing. #2187

benjeffery commented Apr 6, 2022 •

edited

Loading

jeromekelleher commented Apr 6, 2022

benjeffery commented Apr 6, 2022

hyanwong commented Apr 6, 2022

benjeffery commented Apr 6, 2022

hyanwong commented Apr 6, 2022

benjeffery commented Apr 6, 2022

hyanwong commented Apr 8, 2022

benjeffery commented Apr 8, 2022

hyanwong commented Apr 8, 2022

hyanwong commented Nov 1, 2022

benjeffery commented Nov 1, 2022

hyanwong commented Nov 2, 2022 •

edited

Loading

High performance newick parsing. #2187

High performance newick parsing. #2187

Comments

benjeffery commented Apr 6, 2022 • edited Loading

jeromekelleher commented Apr 6, 2022

benjeffery commented Apr 6, 2022

hyanwong commented Apr 6, 2022

benjeffery commented Apr 6, 2022

hyanwong commented Apr 6, 2022

benjeffery commented Apr 6, 2022

hyanwong commented Apr 8, 2022

benjeffery commented Apr 8, 2022

hyanwong commented Apr 8, 2022

hyanwong commented Nov 1, 2022

benjeffery commented Nov 1, 2022

hyanwong commented Nov 2, 2022 • edited Loading

benjeffery commented Apr 6, 2022 •

edited

Loading

hyanwong commented Nov 2, 2022 •

edited

Loading