Skip to content

Replace current regex engine with PCRE2#4033

Closed
Rot127 wants to merge 1 commit intorizinorg:devfrom
Rot127:pcre2
Closed

Replace current regex engine with PCRE2#4033
Rot127 wants to merge 1 commit intorizinorg:devfrom
Rot127:pcre2

Conversation

@Rot127
Copy link
Copy Markdown
Member

@Rot127 Rot127 commented Dec 13, 2023

Your checklist for this pull request

  • I've read the guidelines for contributing to this repository
  • I made sure to follow the project's coding style
  • I've documented or updated the documentation of every function and struct this PR changes. If not so I've explained why.
  • I've added tests that prove my fix is effective or that my feature works (if possible)
  • I've updated the rizin book with the relevant information (if needed)

Detailed description

Replace the current regex engine with PCRE2.

Test plan

All green.

Closing issues

closes #3730

Partially addresses #4055

Todo

  • Add a rz_vector_pop_ptr() function which doesn't use memcpy() and use it in match_all_flat(). Or add a rz_vector_concat() function with the same functionality.
  • Terminology for groups and matches is weird.
  • docs
  • Open bug report for tcc and pcre2 compilation and __asm
  • Test rz_regex_get_match_name

@Rot127 Rot127 marked this pull request as draft December 13, 2023 23:04
@Rot127 Rot127 changed the title Add PCRE2 as dependency. Replace current regex engine with PCRE2 Dec 13, 2023
@Rot127 Rot127 added the performance A performance problem/enhancement label Dec 13, 2023
Copy link
Copy Markdown
Member

@XVilka XVilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please send it from dist- branch from rizinorg account, to check that it builds and runs fine on all supported platforms and configurations
  2. Once it's done, I think it is worth removing the old implementation as well.

@wargio
Copy link
Copy Markdown
Member

wargio commented Dec 14, 2023

would be nice to also allow PCRE (v1) for older distro

@Rot127
Copy link
Copy Markdown
Member Author

Rot127 commented Dec 14, 2023

would be nice to also allow PCRE (v1) for older distro

Currently for every distro which has no pcre2 it is compiled as subproject. Shouldn't this be enough?

@XVilka
Copy link
Copy Markdown
Member

XVilka commented Dec 14, 2023

would be nice to also allow PCRE (v1) for older distro

Currently for every distro which has no pcre2 it is compiled as subproject. Shouldn't this be enough?

Yes, I think it should be enough. We have the same strategy for Capstone dependency as well.

@Rot127
Copy link
Copy Markdown
Member Author

Rot127 commented Dec 14, 2023

The PCRE2 defines a huge amount of options and flags. Is it fine if we do not add an RZ_REGEX_... variant for each of it? Simply to avoid duplicates?
Or is it always preferable to have our API consistently named?

@wargio
Copy link
Copy Markdown
Member

wargio commented Dec 15, 2023

you should only use PCRE2_CASELESS | PCRE2_MULTILINE
where caseless is applied only if the grep configuration is auto or case insensitive.

Comment thread subprojects/packagefiles/pcre2/meson.build Outdated
Comment thread subprojects/packagefiles/pcre2/meson.build
Comment thread meson.build Outdated
Comment thread librz/util/regex.c Outdated
@Rot127 Rot127 force-pushed the pcre2 branch 2 times, most recently from 558652e to 4168046 Compare January 3, 2024 17:28
@Rot127
Copy link
Copy Markdown
Member Author

Rot127 commented Jan 4, 2024

Tests/build still fail, but I'd like to know your opinion on the API first. Because now there are some changes in actual code now.

Copy link
Copy Markdown
Member

@XVilka XVilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new API looks fine. Have you tried comparing the performance with and without these changes?

Comment thread subprojects/pcre2.wrap
@@ -0,0 +1,5 @@
[wrap-git]
url = https://github.com/PCRE2Project/pcre2.git
revision = 52c08847921a324c804cabf2814549f50bce1265
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use release by default?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like they are preparing to make a new release soon, which would be great if done before our release: https://github.com/PCRE2Project/pcre2/commits/master/

Comment thread librz/util/list.c Outdated
Comment thread librz/util/regex.c Outdated
pcre2_code_free(regex);
}

RZ_OWN RzRegexMatchData *rz_regex_match_data_new(const RzRegex *regex, RzRegexGeneralContext *context) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doxygen for these functions is more important than oxygen for _free() function ;-)

Comment thread librz/util/regex.c Outdated
@Rot127 Rot127 marked this pull request as ready for review February 2, 2024 08:06
Comment thread librz/include/rz_util/rz_regex.h Outdated
Comment thread librz/include/rz_util/rz_regex.h Outdated
Copy link
Copy Markdown
Member

@XVilka XVilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also fix test\db\archos\windows-x64\dbg_dts - remove \n or just don't include it in the capture.
macOS (Darwin) regexes should be fixed too, in particular with whitespace and newlines handling: https://ci.rizin.re/repos/27/pipeline/4083/7#L616

OpenBSD error message is puzzling:

ERROR: Regex compilation failed at 0: no more memory

Same happens on NetBSD too.

Comment thread librz/cons/pager.c
@XVilka

This comment was marked as resolved.

@XVilka

This comment was marked as resolved.

@Rot127
Copy link
Copy Markdown
Member Author

Rot127 commented Feb 3, 2024

Yep, aware of it. Just wait and bundle it with other fixes, so the CI is not triggered again and again.

@Rot127
Copy link
Copy Markdown
Member Author

Rot127 commented Feb 3, 2024

BSD problems seem to be a bug in PCRE2 (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276252). Hopefully only affecting the JIT compiler?

Going make a minimal example and open an issue later. If it is indeed a BSD only problem, I would just exclude BSD from JIT/

@XVilka
Copy link
Copy Markdown
Member

XVilka commented Feb 3, 2024

@Rot127 yes, though FreeBSD works just fine. The problem occurs only in OpenBSD and NetBSD, please disable JIT only for those, but not the FreeBSD.

@XVilka
Copy link
Copy Markdown
Member

XVilka commented Feb 3, 2024

Once this PR is green after the rebase, please send one from the dist-fuzz- branch, to check all configurations. Also, do you want to squash all those commits into one?

@Rot127
Copy link
Copy Markdown
Member Author

Rot127 commented Feb 3, 2024

Also, do you want to squash all those commits into one?

Absolutely. The commit history is a mess

@Rot127 Rot127 force-pushed the pcre2 branch 4 times, most recently from 5ff6680 to 5a5d5af Compare February 3, 2024 08:41
@thestr4ng3r

This comment was marked as resolved.

@Rot127
Copy link
Copy Markdown
Member Author

Rot127 commented Feb 3, 2024

Should be fixed now.

Comment thread meson.build Outdated
Comment thread subprojects/packagefiles/pcre2/meson.build Outdated
Comment thread librz/magic/softmagic.c
PCRE2 has way better performance than the OpenBSD
library (something around 20 times faster).

The following flags are enabled for every pattern:

- PCRE2_UTF
- PCRE2_MATCH_INVALID_UTF
- PCRE2_NO_UTF_CHECK

All the others are optional.

Changes made:

- Adds PCRE2 as subproject.
- Changes the API away from POSIX to PCRE2.
- Edits many regex patterns because:
 - ' ' is skipped in patterns, if the EXTENDED flag is set for matching. '\s' must be set now.
 - '.' doesn't match newlines by default.
- Changes the API so matches and their groups are bundled into PVectors.
- Moves the regex component to rz_util.
Comment thread test/unit/test_str.c
@XVilka
Copy link
Copy Markdown
Member

XVilka commented Feb 3, 2024

@Rot127 please send a new PR from inside the rizinorg account with the dist-fuzz- branch to trigger everything.

@Rot127 Rot127 mentioned this pull request Feb 4, 2024
10 tasks
@Rot127 Rot127 closed this Feb 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regex is slow

5 participants