-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High performance newick parsing. #2187
Comments
Also, support for unary nodes would be good (but I suspect this will happen automatically) |
I made this point somewhere else, but the ability to parse into a table collection would mean that we could cope with zero-length (or even negative-length) branches. Also, I'm not sure if we want to worry about trees beginning with |
What would one then do with the table collection? There's currently no way to get to a |
I would run through the node times and adjust them to make a valid TS, while adding stuff to the metadata to say what I had done. |
Great - was checking we didn't need to add a way to get to the quintuple array representation without the integrity constraints. Now you say it I remember you mentioning the fixing-and-recording. |
It should be easy to find the "bad" nodes by doing
right? No need to traverse the tree. Fixing them might be more difficult, however, although zero length branches can probably be fixed by adding an epsilon or using |
Yes good idea - I think that fixing non-conformant tree sequences sounds like a job for a separate function/set of functions though and not part of this work. |
Yes, 100% |
Incidentally, what is the current status of tskit's Newick parsing capabilities, @benjeffery? Do we have (relatively) fast parsing in a released version yet? |
Still a proof of concept I'm afraid! It's on the medium term Todo list... |
A quick additional note here: I wonder if we should allow nodes without branch lengths in the Newick file, and then set the This would also be useful for the "introduction to tskit for phylogeneticists" tutorial (tskit-dev/tutorials#163) |
Our current newick parsing relies on the
newick
python library, which for trees we've tested is ~120x slower than parsing with a common (C extension)R
library. My initial rough experiments with a pure-python one-pass pre-allocated state machine in tskit-dev/tsconvert#36 are very promising, timing at 1-2x the R library. A C implementation could follow if we see the benefit and the Python implementation is amenable to conversion.Requirements:
individuals
?)Support for unicode is not required.
The text was updated successfully, but these errors were encountered: