parsing `from_root` can be slow #23

lukasheinrich · 2019-06-11T13:22:41Z

Hi,

this is an excellent library. I am incorporating it into my ATLAS analysis code together with @jpivarski's uproot and awkward as a drop-in replacement for TTree::Draw

I notices that some expressions can be slow to parse e.g. this takes almost 2 seconds:

$> cat slow.py 
import formulate
formulate.from_root('((weight * (n_mu > 0)) * ((tt_cat==0 || tt_cat==3 || tt_cat==6)))')
$> time python slow.py 
real	0m1.803s
user	0m1.995s
sys	0m0.094s

is there any way this can be sped up?

Thanks,
Lukas

Note: it's not slow of course in an absolute sense, but if you do such an operation very often it can accumulate quite quickly and I was surprised parsing such an expression would have a noticable time cost

The text was updated successfully, but these errors were encountered:

jpivarski · 2019-06-11T13:53:39Z

At the risk of piling yet more on this, any thoughts about building it on a professional parser like Lark? It's a joy to use (see my tutorials on it) and its LALR parser is reportedly fast. Simple languages like NumExpr and TTree::Draw should be easy to express in the non-ambiguous rules that LALR requires.

I could get involved in this, particularly to support TTree::Draw's looping constructs with an awkward-array backend. I don't know when that would happen, but it's something I might want to help with someday. I've heard that TTree::Draw has about 100 unit tests in the ROOT testing suite, so there's a good place to start. At some point, I do have to integrate NumExpr with awkward-array better, and this could be part of that.

lukasheinrich · 2019-06-11T17:52:39Z

@jpivarski yes I'm also happy to get involved here (i.e. in providing a TTree::Draw drop-in replacement). Do you have a pointer to the TTree::Draw test suite? One of the beauties of using formulate is that it can return a list of variables it needs, and thus branches that need to be read.

I'm currently using cachetools to just pay the parsing cost once

@cachetools.cached(cache = {})
def expression_from_root(expr):
    return formulate.from_root(expr)

and then do something like

vars_a = expression_from_root(cut_weight).variables
vars_b = expression_from_root(var_name).variables
branches = vars_a.union(vars_b)

def stringy_observable(table,**kwargs):
    observable = kwargs.get('var_name')
    cut_weight = kwargs.get('cut_weight')

    observable = expression_from_root(observable).to_numexpr()
    cut_weight = expression_from_root(cut_weight).to_numexpr()

    observable = ne.evaluate(observable,table)
    cut_weight = ne.evaluate(cut_weight,table)
    return observable, cut_weight

def process_chunk(chunk):
    table = aatable.Table(chunk)
    observable, weights = stringy_observable(table, var_name = var_name, cut_weight = cut_weight)
    chunk_histo, chunk_edges = np.histogram(observable, weights = weights, bins = bin_edges)
    return chunk_histo, chunk_edges

#map
chunk_results = [
    process_chunk(chunk) for chunk in uproot.iterate(
        fname,
        branches = branches,
        treepath = treename,
        namedecode="utf-8",
)] #possibe parallelization here

#reduce
for chunk_histo, chunk_edges in chunk_results:
    histo = chunk_histo if histo is None else histo + chunk_histo
    edges = chunk_edges

which is more or less equivalent to TTree::Draw('<observable>','<cuts>')

jpivarski · 2019-06-11T19:00:23Z

Until I saw that you're caching the parsed result, it didn't sink in what "slow" meant. 1.8 sec to parse a 65 character string is unbelievably slow. I ran it in timeit to get more statistics and the mean is 0.8 sec, still unbelievably slow. There's something seriously wrong.

I ran the profiler and saw a lot of pyparsing (in the parseImpl function). Sorry I suggested earlier a "professional parser like Lark" because I hadn't noticed that you're using a professional parser. PyParsing has some performance tips, but I don't know—this is really surprising!

jpivarski · 2019-06-11T22:15:29Z

You know what? I bet it's recreating the parser every time it parses. I know there's a "cached" something in the profile output, but I'd find this time understandable if we're looking at the time to parse the grammar, create a state machine, and then parse the string, rather than starting with a state machine. I don't think it's necessary to save the state machine to avoid 1 second during the import as PLY does, but it should keep the state machine for the duration of the Python process.

jonas-eschle · 2020-11-19T10:30:43Z

Do we have any news on this or ideas on how to proceed? Unfortunately, if the parsing string gets long, the timing blows up, e.g.

(abs( TCKCat<0 ) * B_plus_Hlt1TrackMVADecision_TOS + abs( ~ ( TCKCat<0 ) ) * ((e_plus_Hlt1TrackMVADecision_TOS) & (log(e_plus_IPCHI2_OWNPV) > (1.0/((e_plus_TRACK_PT * 1000.0-1.) ** 2) + ((abs( TCKCat==1 ) * 2.3 + abs( ~ ( TCKCat==1 ) ) * (abs( TCKCat==0 | TCKCat==2 ) * 1.1 + abs( ~ ( TCKCat==0 | TCKCat==2 ) ) * -9999999.9))/25) * (25-e_plus_TRACK_PT * 1000.0) + log(7.4) ))) | ((e_minus_Hlt1TrackMVADecision_TOS) & (log(e_minus_IPCHI2_OWNPV) > (1.0/((e_minus_TRACK_PT * 1000.0-1.) ** 2) + ((abs( TCKCat==1 ) * 2.3 + abs( ~ ( TCKCat==1 ) ) * (abs( TCKCat==0 | TCKCat==2 ) * 1.1 + abs( ~ ( TCKCat==0 | TCKCat==2 ) ) * -9999999.9))/25) * (25-e_minus_TRACK_PT * 1000.0) + log(7.4) ))) | ((K_Kst_Hlt1TrackMVADecision_TOS) & (log(K_Kst_IPCHI2_OWNPV) > (1.0/((K_Kst_TRACK_PT * 1000.0-1.) ** 2) + ((abs( TCKCat==1 ) * 2.3 + abs( ~ ( TCKCat==1 ) ) * (abs( TCKCat==0 | TCKCat==2 ) * 1.1 + abs( ~ ( TCKCat==0 | TCKCat==2 ) ) * -9999999.9))/25) * (25-K_Kst_TRACK_PT * 1000.0) + log(7.4) ))))

is unparseable.

jpivarski · 2020-11-19T17:42:41Z

The fundamental problem is that this project doesn't have a maintainer. (Correct me if I'm wrong!) There's no reason why the above string should run a parser out of resources, whether that's memory or a reasonably small amount of time. I don't know PyParsing all that well, but ply is fast and Lark is fast and modern.

Last year, I gave a suite of tutorials on parsing, interpreting, and code transformations (using Lark):

https://github.com/jpivarski-talks/2019-05-06-adl-language-tools

These are fairly simple languages—basically calculator formulae—if anyone's willing to take it on, I can help you get started.

eduardo-rodrigues · 2020-11-19T19:15:42Z

Let's see what @chrisburr has to comment om support and maintenance ...

FWIW I bet Lark will be faster as a parser, yes. It does a very good job in DecayLanguage parsing a master decay file with 11000 lines of decay descriptions in about 1.5 seconds, storing all parsed info in a structured dictionary.

jpivarski · 2020-11-19T19:21:31Z

It also gives you the option between parsers with different time complexity. (One is faster but less flexible. For languages like the ones this project targets, the faster one would be just as easy to use.)

eduardo-rodrigues · 2020-11-19T20:33:51Z

Yes, totally agree.

jpivarski mentioned this issue Nov 3, 2022

feat: warn about TBranch name, alias name conflict. scikit-hep/uproot5#776

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsing `from_root` can be slow #23

parsing `from_root` can be slow #23

lukasheinrich commented Jun 11, 2019 •

edited

Loading

jpivarski commented Jun 11, 2019

lukasheinrich commented Jun 11, 2019 •

edited

Loading

jpivarski commented Jun 11, 2019

jpivarski commented Jun 11, 2019

jonas-eschle commented Nov 19, 2020

jpivarski commented Nov 19, 2020

eduardo-rodrigues commented Nov 19, 2020

jpivarski commented Nov 19, 2020

eduardo-rodrigues commented Nov 19, 2020

parsing from_root can be slow #23

parsing from_root can be slow #23

Comments

lukasheinrich commented Jun 11, 2019 • edited Loading

jpivarski commented Jun 11, 2019

lukasheinrich commented Jun 11, 2019 • edited Loading

jpivarski commented Jun 11, 2019

jpivarski commented Jun 11, 2019

jonas-eschle commented Nov 19, 2020

jpivarski commented Nov 19, 2020

eduardo-rodrigues commented Nov 19, 2020

jpivarski commented Nov 19, 2020

eduardo-rodrigues commented Nov 19, 2020

parsing `from_root` can be slow #23

parsing `from_root` can be slow #23

lukasheinrich commented Jun 11, 2019 •

edited

Loading

lukasheinrich commented Jun 11, 2019 •

edited

Loading