-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parsing from_root
can be slow
#23
Comments
At the risk of piling yet more on this, any thoughts about building it on a professional parser like Lark? It's a joy to use (see my tutorials on it) and its LALR parser is reportedly fast. Simple languages like NumExpr and TTree::Draw should be easy to express in the non-ambiguous rules that LALR requires. I could get involved in this, particularly to support TTree::Draw's looping constructs with an awkward-array backend. I don't know when that would happen, but it's something I might want to help with someday. I've heard that TTree::Draw has about 100 unit tests in the ROOT testing suite, so there's a good place to start. At some point, I do have to integrate NumExpr with awkward-array better, and this could be part of that. |
@jpivarski yes I'm also happy to get involved here (i.e. in providing a TTree::Draw drop-in replacement). Do you have a pointer to the TTree::Draw test suite? One of the beauties of using I'm currently using @cachetools.cached(cache = {})
def expression_from_root(expr):
return formulate.from_root(expr) and then do something like vars_a = expression_from_root(cut_weight).variables
vars_b = expression_from_root(var_name).variables
branches = vars_a.union(vars_b)
def stringy_observable(table,**kwargs):
observable = kwargs.get('var_name')
cut_weight = kwargs.get('cut_weight')
observable = expression_from_root(observable).to_numexpr()
cut_weight = expression_from_root(cut_weight).to_numexpr()
observable = ne.evaluate(observable,table)
cut_weight = ne.evaluate(cut_weight,table)
return observable, cut_weight
def process_chunk(chunk):
table = aatable.Table(chunk)
observable, weights = stringy_observable(table, var_name = var_name, cut_weight = cut_weight)
chunk_histo, chunk_edges = np.histogram(observable, weights = weights, bins = bin_edges)
return chunk_histo, chunk_edges
#map
chunk_results = [
process_chunk(chunk) for chunk in uproot.iterate(
fname,
branches = branches,
treepath = treename,
namedecode="utf-8",
)] #possibe parallelization here
#reduce
for chunk_histo, chunk_edges in chunk_results:
histo = chunk_histo if histo is None else histo + chunk_histo
edges = chunk_edges which is more or less equivalent to |
Until I saw that you're caching the parsed result, it didn't sink in what "slow" meant. 1.8 sec to parse a 65 character string is unbelievably slow. I ran it in I ran the profiler and saw a lot of pyparsing (in the |
You know what? I bet it's recreating the parser every time it parses. I know there's a "cached" something in the profile output, but I'd find this time understandable if we're looking at the time to parse the grammar, create a state machine, and then parse the string, rather than starting with a state machine. I don't think it's necessary to save the state machine to avoid 1 second during the |
Do we have any news on this or ideas on how to proceed? Unfortunately, if the parsing string gets long, the timing blows up, e.g.
is unparseable. |
The fundamental problem is that this project doesn't have a maintainer. (Correct me if I'm wrong!) There's no reason why the above string should run a parser out of resources, whether that's memory or a reasonably small amount of time. I don't know PyParsing all that well, but ply is fast and Lark is fast and modern. Last year, I gave a suite of tutorials on parsing, interpreting, and code transformations (using Lark): https://github.com/jpivarski-talks/2019-05-06-adl-language-tools These are fairly simple languages—basically calculator formulae—if anyone's willing to take it on, I can help you get started. |
Let's see what @chrisburr has to comment om support and maintenance ... FWIW I bet Lark will be faster as a parser, yes. It does a very good job in |
It also gives you the option between parsers with different time complexity. (One is faster but less flexible. For languages like the ones this project targets, the faster one would be just as easy to use.) |
Yes, totally agree. |
Hi,
this is an excellent library. I am incorporating it into my ATLAS analysis code together with @jpivarski's
uproot
andawkward
as a drop-in replacement forTTree::Draw
I notices that some expressions can be slow to parse e.g. this takes almost 2 seconds:
is there any way this can be sped up?
Thanks,
Lukas
Note: it's not slow of course in an absolute sense, but if you do such an operation very often it can accumulate quite quickly and I was surprised parsing such an expression would have a noticable time cost
The text was updated successfully, but these errors were encountered: