Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actually document the regex dialect and semantics #594

Open
masklinn opened this issue Jul 7, 2024 · 0 comments
Open

Actually document the regex dialect and semantics #594

masklinn opened this issue Jul 7, 2024 · 0 comments

Comments

@masklinn
Copy link
Contributor

masklinn commented Jul 7, 2024

While many regex dialects / implementations use similar symbols they don't necessarily ascribe the same semantics to those e.g. \d, w, \s and their reverse may be ascii only or partially or fully unicode, the latter would be a lot more expensive than the former, possibly unnecessarily.

Furthermore from a performance / memory standpoint 6e65445 modified regexes to limit redos risk, however it did so inconsistently so it's not entirely clear whether and which rules non-backtracking engines which are not sensitive to catastrophic backtracking (e.g. re2, regex, regexp, ...) may convert the regexes back to unbounded repetition, as bounded repetitions are also used in semantically relevant contexts. Having a well defined and consistent substitute for * and + (and maybe some rules ensuring new ones don't get added improperly) would allow engines to track and substitute them on the fly, which can positively impact their memory use and runtime as they don't need to track the number of iterations anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@masklinn and others