Actually document the `regex` dialect and semantics #594

masklinn · 2024-07-07T17:04:53Z

While many regex dialects / implementations use similar symbols they don't necessarily ascribe the same semantics to those e.g. \d, w, \s and their reverse may be ascii only or partially or fully unicode, the latter would be a lot more expensive than the former, possibly unnecessarily.

Furthermore from a performance / memory standpoint 6e65445 modified regexes to limit redos risk, however it did so inconsistently so it's not entirely clear whether and which rules non-backtracking engines which are not sensitive to catastrophic backtracking (e.g. re2, regex, regexp, ...) may convert the regexes back to unbounded repetition, as bounded repetitions are also used in semantically relevant contexts. Having a well defined and consistent substitute for * and + (and maybe some rules ensuring new ones don't get added improperly) would allow engines to track and substitute them on the fly, which can positively impact their memory use and runtime as they don't need to track the number of iterations anymore.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actually document the `regex` dialect and semantics #594

Actually document the `regex` dialect and semantics #594

masklinn commented Jul 7, 2024

Actually document the regex dialect and semantics #594

Actually document the regex dialect and semantics #594

Comments

masklinn commented Jul 7, 2024

Actually document the `regex` dialect and semantics #594

Actually document the `regex` dialect and semantics #594