Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a regular expression package to be replaced by another package. #87

Open
aka-ao opened this issue May 14, 2024 · 2 comments
Open

Comments

@aka-ao
Copy link

aka-ao commented May 14, 2024

#59 Related to this PR, the regexp package is very slow.
I suggest creating an interface and changing it so that processing related to regular expressions can be injected externally as a dependency.

@dgoldstein0
Copy link
Contributor

Since that pr, we've added lru caching to the library, which makes frequently seen user agents parse almost instantly. Is the performance with that really a problem?

I'd be open to considering faster regex implementations if it gives a large benefit, though I'd think it'd make more sense if the regex engine we use was not exposed for dependency injection - if any regex engine is missing features we need or implements a slightly different flavor of regex which ends up having observable behavior differences, that would be problematic.

@masklinn
Copy link

masklinn commented Oct 26, 2024

Since that pr, we've added lru caching to the library, which makes frequently seen user agents parse almost instantly. Is the performance with that really a problem?

While that is true, it is obviously only for user agents which have been seen before and are still in the cache by their second hit. FWIW a while back a dailymotion employee kindly provided a sample of their access logs1, which should be realistic2, it has ~75k entries of which ~20k unique, with a very long tail of "one hit wonders", user agents seen only once and never again (this is relevant because as I learned LRU is pretty bad at one hit wonders at low sizes). On that file, when I investigated different cache algorithms I got the following hit rates at cachesize 1000 (which I understand is what you're using):

belady3 (1000): 60% hit rate
lru (1000): 37% hit rate
s3fifo (1000): 50% hit rate
sieve (1000): 48% hit rate

So even with a cache, the regex parsing performance can be quite impactful.

With that said, rather than switching the regex library what I would recommend is looking if somebody has implemented FilteredRE2 in Go: regexes.yaml has a lot of regexes, there are more than 600 device parsers as of 0.18, a faster regex engine will not make that much of a difference in the end I think.

What FilteredRE2 does is prefilter the regexes, and then trying only those which have a chance of matching (using something like aho-corasick to quickly eliminate regex which can not match because they require literal strings not present in the input). Needing to only try 10-15 regexes is a much more significant gain than trying 600 regexes a bit faster.

An other thing you may want to look at — but this one you'll really have to bench for go specifically as regexp may or may not have this issue — is using Regexp.Match first, and only performing capture (retrieving the submatches) on matches: in both re2 and rust-regex the overhead of capture is very large (2-3x the cost of a match check), so it only takes 1-3 failures for "n match checks followed by one capture" to outperform "n captures". For this reason FilteredRE2 does not support capturing at all, after prefiltering it will only PartialMatch the selected regex: https://github.com/google/re2/blob/main/re2/filtered_re2.cc#L98-L122

Footnotes

  1. https://github.com/ua-parser/uap-python/pull/163#issuecomment-1536412054

  2. although obviously it's realistic traffic for dailymotion, a different site can have just as realistic a traffic with a different pattern

  3. Bélády's algorithm is a non-practical cache algorithm which can see in the future and has perfect knowledge of which entries to reject, evict, or accept, it is essentially the best that can be done at that cache size, making it a useful point of reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants