-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify and regularise non-functional bounded repetitions #596
Comments
Per previous commits on the rough comparisons between regex-filtered and re2, while regex-filtered is very competitive indeed on the CPU side it suffers from memory usage issues. This stems from two issues: character classes ================= `re2` uses [ASCII-only perl character classes][1], regex uses [full-unicode Perl character classes][2] defined in terms of [UTS#18 properties][3], this leads to *much* large state graphs for `\d` and `\w` (`\s` seems to cause much less trouble). While uap-core doesn't actually specify regex semantics, [Javascript perl character classes are ASCII-only][4]. As such, a valid mitigation *for ua-parser* is to convert `\d`, `\D`, `\w`, and `\W` to the corresponding ASCII classes (I used literal enumerations from MDN but POSIX-style classes [would have worked too][5]). This was way helped by regex supporting [nesting enumerated character classes][6] as it means I don't need to special-case expanding perl-style character classes inside enumerations. Because capture amplifies the issue, this conversion reduces memory consumption by between 30% for non-captured digits: > echo -n "\d+" | cargo r -qr -- -q 13496 8826 > echo -n "[0-9]+" | cargo r -qr -- -q 10946 1322 and *two orders of magnitude* for captured word characters: > echo -n "(\w+)" | cargo r -qr -- -q 605008 73786 > echo -n "([a-zA-Z0-9_]+)" | cargo r -qr -- -q 6968 3332 Bounded repetitions =================== A large amount of bounded repetitions (`{a,b}`) was added to regexes.yaml [for catastrophic backtracking migitation][7]. While this is nice for backracking based engines, it's not relevant to regex which is based on finite automata, however bounded repetitions *does* cost significantly more than unbounded repetitions: > echo -n ".+" | cargo r -qr -- -q 7784 4838 > echo -n ".{0,100}" | cargo r -qr -- -q 140624 118326 And this also compounds with the previous item when bounded repetition is used with a perl character class (although that's not very common in `regexes.yaml`, it's mostly tacked on `.`). This can be mitigated by converting "large" bounded repetitions (arbitrarily defined as an upper bound of two digits or more) to unbounded repetitions. Results ======= The end results of that work is a 22% reduction in peak memory footprint when running ua-parser over the entire sample using core's `regexes.yaml`... and even a ~4% gain in runtime despite doing more work up-front and not optimising for that[^1]. before ------ > /usr/bin/time -l ../target/release/examples/bench -r 10 ~/sources/thirdparty/uap-core/regexes.yaml ~/sources/thirdparty/uap-python/samples/useragents.txt Lines: 751580 Total time: 9.363202625s 12µs / line 9.71 real 9.64 user 0.04 sys 254590976 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 15647 page reclaims 13 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 33 involuntary context switches 84520306010 instructions retired 31154406450 cycles elapsed 245909184 peak memory footprint after ----- > /usr/bin/time -l ../target/release/examples/bench -r 10 ~/sources/thirdparty/uap-core/regexes.yaml ~/sources/thirdparty/uap-python/samples/useragents.txt Lines: 751580 Total time: 8.754590666s 11µs / line 9.37 real 8.95 user 0.03 sys 196198400 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 12083 page reclaims 13 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 11 voluntary context switches 40 involuntary context switches 80119011397 instructions retired 28903938853 cycles elapsed 192169408 peak memory footprint the world that almost was ------------------------- Sadly as it turns out there are a few large-ish *functional* bounded repetitions, for instance ; {0,2}(moto)(.{0,50})(?: Build|\) AppleWebKit) mis-captures if it's converted to `.*`. This means my original threshold of converting any repetition with two digits upper bound was a bust and I had to move up to 3 (there are no upper bounds above 50 but below 100). Opened ua-parser/uap-core#596 in case this could be improved with a cleaner project-supported signal. With the original two-digit versions, we reached *47%* peak memory footprint reduction and 9% runtime improvement: > /usr/bin/time -l ../target/release/examples/bench -r 10 ~/sources/thirdparty/uap-core/regexes.yaml ~/sources/thirdparty/uap-python/samples/useragents.txt Lines: 751580 Total time: 8.541360667s 11µs / line 8.75 real 8.70 user 0.02 sys 135331840 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 8367 page reclaims 13 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 25 involuntary context switches 78422091147 instructions retired 28079764502 cycles elapsed 130106688 peak memory footprint Fixes #2 [^1]: that surprised me but the gains seem consistent from one run to the next and we can clearly see a reduction in both cycles elapsed and instructions retired so I'll take it ¯\_(ツ)_/¯ IPC even increases slightly from 2.7 to 2.8 yipee [1]: https://github.com/google/re2/wiki/Syntax [2]: https://docs.rs/regex/latest/regex/#perl-character-classes-unicode-friendly [3]: https://www.unicode.org/reports/tr18/#Compatibility_Properties [4]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Character_classes [5]: https://docs.rs/regex/latest/regex/#ascii-character-classes [6]: https://docs.rs/regex/latest/regex/#character-classes [7]: ua-parser/uap-core@6e65445
Per previous commits on the rough comparisons between regex-filtered and re2, while regex-filtered is very competitive indeed on the CPU side it suffers from memory usage issues. This stems from two issues: character classes ================= `re2` uses [ASCII-only perl character classes][1], regex uses [full-unicode Perl character classes][2] defined in terms of [UTS#18 properties][3], this leads to *much* large state graphs for `\d` and `\w` (`\s` seems to cause much less trouble). While uap-core doesn't actually specify regex semantics, [Javascript perl character classes are ASCII-only][4]. As such, a valid mitigation *for ua-parser* is to convert `\d`, `\D`, `\w`, and `\W` to the corresponding ASCII classes (I used literal enumerations from MDN but POSIX-style classes [would have worked too][5]). This was way helped by regex supporting [nesting enumerated character classes][6] as it means I don't need to special-case expanding perl-style character classes inside enumerations. Because capture amplifies the issue, this conversion reduces memory consumption by between 30% for non-captured digits: > echo -n "\d+" | cargo r -qr -- -q 13496 8826 > echo -n "[0-9]+" | cargo r -qr -- -q 10946 1322 and *two orders of magnitude* for captured word characters: > echo -n "(\w+)" | cargo r -qr -- -q 605008 73786 > echo -n "([a-zA-Z0-9_]+)" | cargo r -qr -- -q 6968 3332 Bounded repetitions =================== A large amount of bounded repetitions (`{a,b}`) was added to regexes.yaml [for catastrophic backtracking migitation][7]. While this is nice for backracking based engines, it's not relevant to regex which is based on finite automata, however bounded repetitions *does* cost significantly more than unbounded repetitions: > echo -n ".+" | cargo r -qr -- -q 7784 4838 > echo -n ".{0,100}" | cargo r -qr -- -q 140624 118326 And this also compounds with the previous item when bounded repetition is used with a perl character class (although that's not very common in `regexes.yaml`, it's mostly tacked on `.`). This can be mitigated by converting "large" bounded repetitions (arbitrarily defined as an upper bound of two digits or more) to unbounded repetitions. Results ======= The end results of that work is a 22% reduction in peak memory footprint when running ua-parser over the entire sample using core's `regexes.yaml`... and even a ~4% gain in runtime despite doing more work up-front and not optimising for that[^1]. before ------ > /usr/bin/time -l ../target/release/examples/bench -r 10 ~/sources/thirdparty/uap-core/regexes.yaml ~/sources/thirdparty/uap-python/samples/useragents.txt Lines: 751580 Total time: 9.363202625s 12µs / line 9.71 real 9.64 user 0.04 sys 254590976 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 15647 page reclaims 13 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 33 involuntary context switches 84520306010 instructions retired 31154406450 cycles elapsed 245909184 peak memory footprint after ----- > /usr/bin/time -l ../target/release/examples/bench -r 10 ~/sources/thirdparty/uap-core/regexes.yaml ~/sources/thirdparty/uap-python/samples/useragents.txt Lines: 751580 Total time: 8.754590666s 11µs / line 9.37 real 8.95 user 0.03 sys 196198400 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 12083 page reclaims 13 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 11 voluntary context switches 40 involuntary context switches 80119011397 instructions retired 28903938853 cycles elapsed 192169408 peak memory footprint the world that almost was ------------------------- Sadly as it turns out there are a few large-ish *functional* bounded repetitions, for instance ; {0,2}(moto)(.{0,50})(?: Build|\) AppleWebKit) mis-captures if it's converted to `.*`. This means my original threshold of converting any repetition with two digits upper bound was a bust and I had to move up to 3 (there are no upper bounds above 50 but below 100). Opened ua-parser/uap-core#596 in case this could be improved with a cleaner project-supported signal. With the original two-digit versions, we reached *47%* peak memory footprint reduction and 9% runtime improvement: > /usr/bin/time -l ../target/release/examples/bench -r 10 ~/sources/thirdparty/uap-core/regexes.yaml ~/sources/thirdparty/uap-python/samples/useragents.txt Lines: 751580 Total time: 8.541360667s 11µs / line 8.75 real 8.70 user 0.02 sys 135331840 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 8367 page reclaims 13 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 25 involuntary context switches 78422091147 instructions retired 28079764502 cycles elapsed 130106688 peak memory footprint Fixes #2 [^1]: that surprised me but the gains seem consistent from one run to the next and we can clearly see a reduction in both cycles elapsed and instructions retired so I'll take it ¯\_(ツ)_/¯ IPC even increases slightly from 2.7 to 2.8 yipee [1]: https://github.com/google/re2/wiki/Syntax [2]: https://docs.rs/regex/latest/regex/#perl-character-classes-unicode-friendly [3]: https://www.unicode.org/reports/tr18/#Compatibility_Properties [4]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Character_classes [5]: https://docs.rs/regex/latest/regex/#ascii-character-classes [6]: https://docs.rs/regex/latest/regex/#character-classes [7]: ua-parser/uap-core@6e65445
Note: an alternate would be to revert 6e65445 entirely, and have a build-time expansion of unbounded to bounded in a separate file when a release is cut, backtracking engine could use the second file for the backtracking protection. Compiling unbounded repetitions to bounded repetitions is very easy, the inverse much less so. Since the original report I've also discovered that some finite automaton engines can suffer from threshold effects due to the graph growth from bounded repetition: GraalVM's TRegex may not be able to JIT many of the Device regexps, and have to fall back to a significantly slower and backtracking engine. |
6e65445 converted a bunch of unbounded repetitions (
*
and+
) to bounded ({0,...}
and{1,...}
). These bounded repetitions are non-functional, in the sense that they're there to mitigate catastrophic backtracking, they're not otherwise relevant to the pattern's correctness.Now while it's great (or at least useful) for backtracking engines, they're not the only ones on the scene especially since the rise of redos / catastrophic backtracking concerns: the re2 C++ library (with bindings in many languages), Go's
regexp
and Rust'sregex
(which has C bindings) are all non-backtracking engines, and by definition don't suffer from catastrophic backtracking1.The problem is that they need to keep track of the upper bound (in order to fail if it's exceeded), which has a cost: compiling the regex
.*
in re2 allocates 650 bytes,.{0,100}
allocates 4800,.{0,300}
allocates 11420. Matching it also adds a small cost (although in my experiments I've gotten on the order of 10% increased runtime for bounded repetitions inregex
). Now obviously for such a small regex on its own it's not a big concern, but with uap-core'sregexes.yaml
having a thousand regexes many more complicated than that it starts to add.Now initially I thought that was not a big deal because those repetitions could just be pre-processed back into unbounded repetitions and bob's your uncle. The issue is, a few functional bounded repetitions exist, there may be others but so far I've found that
; {0,2}(moto)(.{0,50})(?: Build|\) AppleWebKit)
fails its test if it's converted to; {0,2}(moto)(.*)(?: Build|\) AppleWebKit)
:the original version captures
e6 (XT2005DL)
, while the relaxed version capturese6 (XT2005DL) Build/PPBS29.73-81-5-14-5-4-2-6-7; wv
.The issue is that
{0,50}
is one of the ranges which was used in 6e65445: there were 4 occurrences in the file before that commit, it added 21 to 25, and there are 32 now. It's unclear how many of those extra 7 are functional (aside from the one I know is).What this means is that it's difficult to pre-process regexes back to unbounded repetitions for the benefit of non-backtracking engines, it can either be done heuristically with the risk of breaking things, or not at all and we have to leave with unnecessary memory footprint.
It would be nice if these non-functional bounded repetitions used peculiar and easily recognisable ranges, such that those could be safely reverted for memory optimisation purposes e.g. maybe the upper bound could end in
86
or93
, bounds which seem unlikely to be picked either "at random" or for a real reason, and those would signal to maintainers that those ranges are fair game for unbounding. Or maybe all the ranges added by 6e65445 could be bumped to an upper bound of 100 which would become the threshold between functional and non-functional.Footnotes
there are also hybrid backtracking engines which are very resilient to the issue, I never managed to trip the Spencer engine used by postgresql ↩
The text was updated successfully, but these errors were encountered: