Implement the new set-based character classes #523

NWilson · 2024-10-09T13:14:39Z

This is a mostly-complete implementation!

It needs some tidy-up, in places where I've left "XXX" comments. But I've hammered out the code and I'll start testing it.

I'm afraid I used recursive descent in the operator-precedence parser, but it only goes down to a maximum depth of 15 levels, so it should be OK. I just wanted to prototype quickly. I don't know "how much stack is too much" in PCRE2 - it's only pushing ~hundreds of bytes on the stack at most.

The two big bits of logic that need to be shared with existing code, are the OP_XCLASS interpreter code, and the compiler code to build the OP_CLASS/NCLASS/XCLASS code. The current code is inlined into the one place where it's used.

Anyway, this is a draft of where it's going.

zherczeg · 2024-10-09T13:28:17Z

Pattern (bracket) parsing is already recursive. Although you can also a use a simple array for simulating recursion. I suspect a pointer in the META code is enough to track the status.

PhilipHazel · 2024-10-09T15:17:40Z

I do know that there are applications running PCRE2 in a large number of threads, each of which has very limited stack, but I don't know how limited. One would hope a couple of hundred bytes would be OK. I suppose we do need a new option to turn this on because changes the meaning of things, but it seems a shame to use one of the only two remaining main options bits. If/when the Perl syntax is implemented, it shouldn't need to be enabled.

I got a compile error when trying to build your branch:

src/pcre2_compile.c:6110:28: error: passing argument 1 of 'compile_class_nested' from incompatible pointer type [-Wincompatible-pointer-types]
 6110 |       compile_class_nested(&pptr, &code);
      |                            ^~~~~
      |                            |
      |                            uint32_t ** {aka unsigned int **}
src/pcre2_compile.c:2669:39: note: expected 'const uint32_t **' {aka 'const unsigned int **'} but argument is of type 'uint32_t **' {aka 'unsigned int **'}
 2669 | compile_class_nested(const uint32_t **pptr, PCRE2_UCHAR **pcode)

I fixed this by inserting the obvious cast, but is that right? Haven't done any testing yet. I see the CI tests are failing; probably the same thing.

NWilson · 2024-10-09T15:39:06Z

I'm expecting that the Perl syntax will be able to reuse the same interpreter support and OPcodes.

I won't tackle implementing the parser & compiler for the Perl classes in this PR though, there's enough in it already. I agree it should be OK to add in a follow-on PR, and it wouldn't need any flags.

I imagined that it might be worth spending a "normal" parse bit on this flag? This flag implements a behaviour that's in JavaScript/Python/Rust and other engines, so it's more mainstream than all the other minor things I've recently added to the extra options. But you're quite possibly right, those bits are too precious now to spend on non-Perl-related syntax extensions.

PhilipHazel · 2024-10-09T15:51:18Z

How ready is this for testing? I added the obvious to pcre2test to set the flag, but my first test failed:

PCRPCRE2 version 10.45-DEV 2024-06-09 (8-bit)
/[[a-c] && [b-d]]/set_class
Failed: error 189 at offset 16: internal error: unknown code in parsed pattern

This happens at line 6441 in pcre2_compile.c.

PhilipHazel · 2024-10-09T16:04:14Z

Oops sorry, I told a lie. It's around line 8682 (didn't look for multiple occurrences of this error).

PhilipHazel · 2024-10-10T07:23:06Z

Thinking about the name PCRE2_SET_CLASS ... I assume that some time in the future, the Perl syntax will always be recognized, so what this option is enabling is an alternative syntax for what Perl calls "extended bracketed character classes". So perhaps a name like PCRE2_ALT_EXTENDED_CLASS is more descriptive. This is the same style as other PCRE2_ALT_xxx options.

NWilson · 2024-10-10T10:06:52Z

How ready is this for testing? I added the obvious to pcre2test to set the flag, but my first test failed:

Not quite ready. It's just for getting feedback, if anyone's interested in looking (thanks for trying!).

I'm happy to rename from "set[-based] class" to "extended class". That will be easy enough to do.

NWilson · 2024-10-10T13:21:42Z

Simple testcases now pass, such as [abc--b].

The one thing that's broken (that I know of) is that you can only include single characters like 'a', not ranges like 'a-z' nor properties like \p{..} or \d. I still need to pull out the delicate code for doing that, into a reusable chunk I can call.

src/pcre2_match.c

src/pcre2_compile.c

NWilson · 2024-10-18T15:13:14Z

I've added docs.

I'm down to these tasks:

JIT support - is that needed for this PR, or can it come in a future one?
Need a lot more tests, I'm currently writing them. Will be done in a day or two
One chunk of code (marked by a big "XXX") needs to be pulled out into a separate function.
I'll move all the code into pcre2_compile_class.c as requested

zherczeg · 2024-10-19T05:10:29Z

Some random comments.

I don't see a negate operation (can be implemented as ^ 0x1). For example, it is needed for [^\PL-\x{100}-\x{200}].

There is no point to duplicate bitsets (for character 0..255). If one xclass has a bitset, the full bitset should be computed for the entire class. It is faster to evaluate and requires less space.

The parser can be implemented with using a simple stack in the parser context (e.g. 32 uint32_t values, which contains the offsets of the currently active operators). This way the recursion does not require any recursion.

There should be tests for reaching the maximum allowed depth. They should throw parse errors.

Is [&&] allowed? Is it simply matches to nothing? Does [[]&&[]] depends on allow empty classes flag? Btw how "allow empty classes" affects the parsing?

There should be optimizations for ||. [[[\pS]||]||[[[\pL]]]] is just [\pS\pL]. This is easy to do during parsing.

Personally I would just implement [] and || at the first stage, do all the optimizations, parse errors, discuss all the minor details. This only requires parser modifications, so you don't need to worry about matching.

NWilson · 2024-10-22T17:12:16Z

Thank you @zherczeg for reviewing!

I don't see a negate operation (can be implemented as ^ 0x1). For example, it is needed for [^\PL-\x{100}-\x{200}].

It was there in the PR, it's OP_ECLASS_NOT. There isn't a META for it, because it's encoded as META_CLASS / META_CLASS_NOT distinction. But I had implemented the interpreter support.

There is no point to duplicate bitsets (for character 0..255). If one xclass has a bitset, the full bitset should be computed for the entire class. It is faster to evaluate and requires less space.

True. I haven't done this optimisation.

The parser can be implemented with using a simple stack in the parser context (e.g. 32 uint32_t values, which contains the offsets of the currently active operators). This way the recursion does not require any recursion.

Hmm. Can it really? We have a grammar with two operators, with different precedence levels. I can code up a simple shunting-yard thing... but it would be more code than recursive descent.

Note that the META parser (the one that consumes characters and produces META) is just a simple for-loop, with an int depth state. It's the parser that consumes the META and produces OPs which has to parse the recursive grammar.

The grammar is certainly recursive - the question is simply whether (given that it has a fixed nesting depth of 15) we need to reduce the stack consumption, for embedded devices.

I can check how much stack space it actually uses. Currently it looks like it chews up around 100 bytes per recursion, in a 64-bit Release build, so that comes out to a couple of KiB of stack space consumed. The shunting-yarn approach would cut that down a bit, but I'm really not sure it's worth it.

There should be tests for reaching the maximum allowed depth. They should throw parse errors.

Added. Lots more tests added.

Is [&&] allowed? Is it simply matches to nothing? Does [[]&&[]] depends on allow empty classes flag? Btw how "allow empty classes" affects the parsing?

Good question. [&&] is not allowed: an operator must have something non-empty on either side.

The [[]&&[]] case does indeed depend on allow_empty_class. That flag affects not just the outermost [...], but all of the nested classes opened by a '[' character. The allow_empty_class flag affects the inner nested classes in the same way as an outermost class: if the first literal is a ']', then the flag determines whether it's a literal or a class close ']'.

There should be optimizations for ||. [[[\pS]||]||[[[\pL]]]] is just [\pS\pL]. This is easy to do during parsing.

That would be nice... but I'm so far totally focussed on completeness and correctness. I'd rather get the PR merged, than hold it up for edge-case optimisations.

Personally I would just implement [] and || at the first stage, do all the optimizations, parse errors, discuss all the minor details. This only requires parser modifications, so you don't need to worry about matching.

Even with just ||, you need changes to the matcher: [a[^b]] is pretty simple, but requires runtime op support.

NWilson · 2024-10-22T17:13:52Z

Updated task list:

Could add some more tests, there's a handful of missing cases. But it's somewhat thoroughly exercised already.
Still no JIT support. Unsure whether I need to do this to get it merged.
And certainly no (?[...]) support. That can come in a follow-up PR, soon after
I haven't implemented the special-cased support for [ab]+ (following quantifier) that XCLASS has. I guess I could do that for ECLASS too, although it works fine without.
Still a question of whether it should be parsable without any recursion, to save a KiB of stack space

NWilson · 2024-10-22T17:20:00Z

Also - regarding Zoltan's example of [[[\pS]||]||[[[\pL]]]]. The [thing||] syntax is not legal. And the extra brackets [[[thing]]] syntax is already optimised away (compiles to the same OPs as [thing] automatically).

PhilipHazel · 2024-10-23T07:32:30Z

Still no JIT support. Unsure whether I need to do this to get it merged.

Merging without JIT support should be possible. Because you have added new opcodes, a call to pcre2_jit_compile() will fail, and matching should fall back to the interpreter. There are still, I think, one or two other constructions that JIT does not support.

zherczeg · 2024-10-23T08:03:20Z

Hmm. Can it really? We have a grammar with two operators, with different precedence levels. I can code up a simple shunting-
yard thing... but it would be more code than recursive descent.

It is usually easier for optimizations, since it is easy to drop operators, and rewrite the META stream. This is hard for actual recursion, since you cannot drop functions form the function stack. Not in C at least.

Example: [[[a-f]]||[i-k||m-o]]

PUSH [ -> add OP_???CLASS to meta stream, push a pointer to the token onto the pointer (or unsigned offset) stack
PUSH [ -> similar to before, adds data to the META stream, and a pointer to it in the stack
PUSH [
COPY a-f -> just adds data to the META stream, pushes nothing
Now a ] comes, we need to find what happens. Since we have an actual class on the stack ([a-f), we terminate it.
PUSH ], and drop '[' from the stack.
Now another ] comes, and we see, that a [ on the stack, which is followed by a class in the META stream. Since that class is NOT pushed onto the stack, it is a full class literal. Then we delete the OP???_CLASS form the META stream by a memove to the offset of it (it is known in the stack). We don't add anything to the META stream, and pop the top element of the stack. Essentially [[a-f]] is converted to [a-f]

Then a || comes, and we check what is on the top of the stack. It is a '[', and it followed by a class which is not pushed onto the stack. We again delete the [] part with a memove, and the pattern is basically turned to an [a-f||...in the META stream.
PUSH || -> we don't know how the pattern will turn out, so we need to push the operator.
PUSH [
COPY i-k
PUSH ||
COPY m-o

And now the interesting part comes, since we encounter an ']'. The top item is ||, and if we go back one item in the stack, and it is || or '[', then we can delete the || from the stack, which turns [i-k||m-o] to a i-km-o. We don't push anything to the stack.

Then another ']' comes, and the [a-f||i-km-o] is turned to [a-fi-km-o]. The topmost [] is never removed.

Adding precedence to operators is not hard, and you need to handle ^ as well, since it inverts a sub group. Tracking the current stack depth is also possible (cannot be > 32). This is why I said that supporting [ and || should be the first step, and then you have a strong foundation to build things further. If you do it in the opposite way, you might need to rewrite the entire algorithm, and that is a lot of effort.

The challenging part is that the META stream size is limited, and perl has single character operators, so we need to encode everything into one uint32_t value.

Btw [a&&bc] is [a&&[b]c] or [a&&[bc]]?

NWilson · 2024-10-23T08:28:19Z

Well, as I said, "I can code up a simple shunting-yard thing... but it would be more code than recursive descent."

You are describing the classic shunting-yard thing... and it really is more code, and more complexity.

At the moment, I have implemented everything (^, ||, &&, and --), and it's both robust, and correct. So I don't see it as a bad foundation at all. I could rewrite the second parsing phase, but in my opinion the only reason to do so would be to save a tiny bit of stack space.

I have implemented most of the optimisations you describe. I haven't implemented merging [[a-f][i-k]] into a single class, which would be a bit fiddly, but I guess could be possible.

Note also that the || is very unlikely to be used in practice. It's simply there for completeness. JavaScript chose not to add it, but Rust and Python regex module did. I don't think it's worth doing any special optimisations for it.

Btw [a&&bc] is [a&&[b]c] or [a&&[bc]]?

[a&&bc] is [a&&[bc]] (also [a&&[b]c] is [a&&[ [b]c ]]). The ^ character is not an operator, it's a modifier on nested classes to make them into "yes-classes" or "not-classes". The implicit-union operator (juxtaposition) has tightest precedence (implicit union is any juxtaposition of atoms like single characters or ranges, and also nested classes; for example, [[a-b][^c-e]]). The explicit operators && || -- have looser precedence (and undefined precedence relative to each other - you aren't allowed to mix them).

I'm not at all opposed to changing the parser. I just want to be clear on the goal of the improvement.

PhilipHazel · 2024-10-23T13:44:32Z

If it's all implemented and working, I think we (that is, you :-) should proceed and get on with what is left to be done. Internal algorithms can always be changed later if that proves to be beneficial.

NWilson · 2024-10-23T16:34:59Z

Great! I am pretty-much done here.

I can see why Zoltan wants more optimisations - that's his role in life, adding a JIT after all!

I'm warming up to the idea of doing some extra work to satisfy that.

The branch is ready for review; I'll only be adding some quite local changes in the parser if I do that extra work.

The stuff I don't understand well enough:

changes in pcre2_study, and quantification, and the interpreter...
particularly auto-possessification
the "reqcu" flags
and other PCRE-specific stuff, where I just don't have the background, and I may have goofed.

The I think I do understand:

I'm confident with my parsing and matching functions. They're pure, side-effect-free things and just do what it says on the tin.

zherczeg

Unfortunately a lots of choices is confusing for me here. For me these should work like a usual expression evaluator, with unary/binary operators and precedence. The primary expressions should be boolean values, [] can be used for grouping.

zherczeg · 2024-10-23T16:40:43Z

testdata/testoutput2

+          cls:[A]
+          cls:[B]
+          op: ||
+          op: ^


Even if ^ is a modifier, it is an operator in practice. Btw is [^[^a-z]] optimized to [a-z]?

Ah I see. In the grammar, it's not an operator, because it's attached to the [^, so it's parsed as a special form of "open paren". But in the evaluator, it is an operator. The close-paren "]" emits a final OP_ECLASS_NOT if it was opened with a "[^".

Btw is [^[^a-z]] optimized to [a-z]?

Not yet, no. I will do optimisations (although these are pathological cases you suggest...). If possible I'll do it in a PR next week, although I can include it in the current PR if you require.

There are no requirements. Since I prefer to have these optimizations, I should do them.

I really will do them, I'd just rather split into a PR next week, as an extra piece of work.

Just give me a few days to get onto it. I'm away this weekend.

zherczeg · 2024-10-23T16:46:28Z

testdata/testoutput2

+    ]
+ 0: ]
+
+/[A-C--B]/B,alt_extended_class


Is there tests for [A--B--C--D] [A--^B] [A---B] [A----B] [^--A].
Is there tests to reach the maximum depth?

I have most of those tests.

The expression [A--^B] has a literal ASCII ^ character, and so it's equivalent to simply [A].

The expression [A---B] is a parse error. So are [A----B] [^--A].

I have tests to reach maximum depth - a successful test at max depth, and a parse error test at max+1.

zherczeg · 2024-10-23T16:48:28Z

testdata/testoutput2

+        eclass[
+          cls:[a]
+          cls:[\-]
+          op: ||


We should optimize these before the release.

zherczeg · 2024-10-23T16:49:44Z

testdata/testoutput2

+Failed: error 209 at offset 3: unexpected operator in character class (no preceding operand)
+
+/[a---b]/alt_extended_class
+Failed: error 208 at offset 5: invalid operator in character class


Should be [a--[-b]]

Why? We're following UTS#18 syntax here, not inventing our own. Other implementations forbid tripled operator characters.

A single - hyphen is either a literal or a range separator, with complex disambiguation. A double hyphen -- is always an operator (never one end of a range: [--a] used to mean "characters from a to hyphen" but in UTS#18 syntax it's invalid and you must instead use [\--a]). A triple hyphen is just invalid.

There is no tripled operator here. [#--] and [--a] are currently valid.

It looks like unicode recommendation lists several chars, which cannot be in a pattern without \. Do you have tests for them?

There is no tripled operator here. [#--] and [--a] are currently valid.

Sorry, maybe the comment was moved by GitHub when a file was updated? I thought you were commenting on the case /[a---b]/, saying it should act the same as [a--[-b]].

[--a] are currently valid

Correct. But it won't be legal if you enable UTS#18 mode. The sequence "--" becomes an operator, which requires an operand.

It looks like unicode recommendation lists several chars, which cannot be in a pattern without . Do you have tests for them?

That's true. But the spec is just so vague... For example: "Different variants of SYNTAX_CHAR, SPECIAL_CHAR, and NON_SYNTAX_CHAR can be used for particular contexts to maintain compatibility". And also: "This is only a sample syntax for the purposes of examples in this document. Regular expression syntax varies widely: the issues discussed here would need to be adapted to the syntax of the particular implementation."

Basically, they want each regex engine to be able to choose its own syntax, rather than require any more than the bare minimum of changes to their historical behaviour, in order to claim to be UTS#18-compliant.

It's mainly intended to discuss the Unicode matching issues in regex engines, and not force them to adopt a common syntax.

That's why I was most of all concerned to carefully match the behaviour of other regex engines (ECMAScript, Python).

Short answer: we shouldn't randomly forbid characters that are accepted in PCRE2 and Perl currently (such as requiring { to be escaped inside classes). We have to add the [ and -- metacharacters, but the rest can (and should, I believe) be kept as-is.

zherczeg · 2024-10-23T16:50:36Z

testdata/testoutput2

+Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class
+
+/[a--b&&c]/alt_extended_class
+Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class


Good question. I address this in the manpage updates included in the PR.

There's also the (overly?) detailed preliminary survey I did of other implementations of UTS#18 syntax, which I linked in an email thread: https://github.com/NWilson/pcre2/blob/user/niwilson/utr18/NEW-CLASSES.md

UTS#18 doesn't define the relative precedence of the operators

Some implementations ban mixing operators (ECMAScript spec requires this)

Some implementations give them all the same precedence... but some implementations give them a different precedence

So it's a compatibility nightmare, what a mess. I don't think we can justify picking a "side" in the battle. It would be really unfortunate if we picked some precedence for the operators, and then they updated UTS#18 with a different choice.

This choice definitely simplify the current code. However, perl supports multiple operators, so the code needs to be able to handle it. This might be another thing I need to implement.

I'll be implementing the Perl syntax, don't worry. It will drop into the current parser very easily.

PhilipHazel · 2024-10-24T07:15:51Z

I see there is still one failing CI test.

NWilson · 2024-10-24T11:50:26Z

For me these should work like a usual expression evaluator, with unary/binary operators and precedence. The primary expressions should be boolean values, [] can be used for grouping.

That is exactly how it works, I think. I didn't do anything "unusual" on purpose. There are primary expressions (atoms) which are characters/ranges/POSIX-props/Unicode-props, and there are binary operators (no unary), and [] can be used for grouping. Just as you say.

Very classic recursive descent with precedence, producing a RPN sequence of operators.

I see there is still one failing CI test.

Yes, it's weird. Moving some code around triggers the Werror=maybe-uninitialized warning in alpine's version of GCC, although I didn't really touch that code. It's full of gotos; I guess the detector is a bit sensitive.

PhilipHazel · 2024-10-24T15:09:42Z

Is this a bug?

/[abc -- b]+/alt_extended_class
    acacbac
 0: ac

I expected it to match "acac".

PhilipHazel · 2024-10-24T15:16:19Z

Re my previous comment: under DFA matching it does match "acac".

PhilipHazel · 2024-10-24T15:43:16Z

Another example where DFA does what I expect, but non-DFA doesn't. May be same bug as before.

/[\pL || [:digit:]]+/alt_extended_class
    +a03+
No match
    +a03+\=dfa
 0: a03

NWilson · 2024-10-24T15:56:57Z

Yes, it is a bug. I found a similar case earlier this afternoon while adding more tests of my own.

Very odd, something do with the fact that I changed OP_ECLASS to take part in auto-possessification, in the same way that OP_XCLASS does. I introduced the regression at that point.

It turns out to be a small one-line copy-and-paste error, when I copied a bit of code from XCLASS to the ECLASS case, but changed a break into a return (oops!).

(Anyway, all of Philip's cases are now fixed.)

HACKING

doc/pcre2_compile.3

doc/pcre2pattern.3

src/pcre2.h.in

src/pcre2_auto_possess.c

src/pcre2_compile.h

carenas · 2024-10-24T15:14:20Z

src/pcre2_extuni.c

@@ -40,7 +40,7 @@ POSSIBILITY OF SUCH DAMAGE.

 /* This module contains an internal function that is used to match a Unicode
 extended grapheme sequence. It is used by both pcre2_match() and
-pcre2_def_match(). However, it is called only when Unicode support is being
+pcre2_dfa_match(). However, it is called only when Unicode support is being


is this (the need for the dummy when unicode support is not compiled that will be never called) still true?, it is confusing but it would seem that ECLASS calls this unconditionally.

I don't think that the ECLASS code calls into pcre2_extuni at all.

I haven't changed anything in here, except noticed a typo in a comment that was copy-and-pasted into multiple files.

src/pcre2_jit_compile.c

zherczeg · 2024-10-25T05:02:12Z

testdata/testoutput2

+          cls:[a]
+          cls:[B]
+          op: ||
+        ]


I just realized an optimization is missing here.

PCRE2 traditionally uses a bitset for the first 256 characters of sets. Even for the new ECLASS, there should only be one bitset.

Should work like this:

The ECLASS contains only XCLASS byte codes.

If, and only if an XCLASS has a bitset, then the ECLASS has a bitset.

The ECLASS bitset is pre-computed, no extra operation is needed

The ECLASS should be turned to CLASS / NCLASS if nothing else only the bitset is present

No XCLASS has a bitset

I know this is a complex, although important optimization, since it reduces storage and improves performance. I can do it in a follow-up patch.

If we do full constant-folding for the CLASS/NCLASS cases, we'd not have many redundant bitsets, in practice.

I agree we could lift out the bitset, if that's important. I guess we'd need a way to fold things like CLASS && XCLASS and so on - remove the bitset from the XCLASS, and since the CLASS includes all chars >255 we can reduce CLASS && XCLASS → XCLASS.

I think I can do this fine in next week's PR, when I do the CLASS op CLASS → CLASS constant-folding.

Thanks though @zherczeg, it's a great observation, and worth doing.

If you have other priorities, I can also do these things. I don't want to put more work on you.

You previously said you wouldn't do work for Excel for free! I'll finish what I started, if you give me a bit of time to chip away at it.

NWilson · 2024-10-25T09:42:22Z

I'm done with this!!

Whew. I'm not aware of anything broken, and it's feature-complete, as far as the ALT_EXTENDED_CLASS flag goes.

I've filed tickets for the follow-up work, so it's not forgotten (optimizations; and (?[...]) support), and I'll commit to doing it beginning next week.

I've either addressed all of Zoltan and Carlo's feedback so far, or asked for the points raised to be deferred.

Many, many thanks for all the help reviewing and testing this!

zherczeg

We have discussed a lot of things in this PR, and would be good to add a specific comment (with a fixed string which is easy to search), to the code where we will likely do more things. This way we could not forget about them.

zherczeg · 2024-10-25T12:03:39Z

src/pcre2_auto_possess.c

@@ -479,13 +480,13 @@ switch(c)

  case OP_NCLASS:
  case OP_CLASS:
-#ifdef SUPPORT_WIDE_CHARS


I would add some /* revert */ comments because we should revert these after ECLASS will have a single bitset. It is easy to forget them otherwise.

I'm not quite sure what you intend here. Even when ECLASS has a single bitset, it will behave roughly the same as XCLASS, for the purposes of the auto-possessification code at least. There will always need to be the LINK_SIZE slot, holding the size of the stack which follows.

I'll add the TODO comment as you request.

Wide chars mean unicode or > 8 bit. In 8 bit mode, without utf, the regular OP_CLASS with 256 bits must handle the extended class, so it should never encounter with this opcode. In general, eclasses should be turned to normal classes/nclasses whenever it is possible, just like xclasses.

As for the implementation, eclasses should have a prediction system, which predicts their size if they cannot be turned to class/nclass. The prediction is discarded, if they can. The prediction bit (can or cannot) can be saved in the META code (you have 16 free bits for that) during the byte code length computation phase, and reused when the actual byte code is generated.

Of course, I see! If the constant-folding is sufficiently aggressive, then OP_ECLASS will never be emitted without wide-char support, the same as XCLASS. OK, that makes good sense.

Btw, both the "need a bitset for eclass" and "eclass can be turned to class/nclass" bits can be computed when the byte code length is computed (two bits in the META code), and this simplifies the actual byte code generation later, since "the oracle" tell us what to do.

src/pcre2_auto_possess.c

src/pcre2_compile.c

PhilipHazel · 2024-10-25T12:52:11Z

Concerning the documentation: Many thanks for what you have done so far, Nick. Once the Perl syntax is also implemented, I was thinking that it might be helpful to have an overall section called "Extended character classes" that explains that there is Perl syntax and everybody else's syntax and then have two separate sections, one for Perl and one for the other. I am happy to work on the documentation when the time comes.

Incidentally, do we need to add (*ALT_EXTENDED_CLASS) ?

carenas · 2024-10-25T21:29:20Z

do we need to add (*ALT_EXTENDED_CLASS) ?

I actually had a patch implementing (?X), but didn't send it because the "incompatibilities" between the old syntax and the Unicode one are so minimal that I think it would be better not to have one, and wait to see what happens after it gets released.

src/pcre2_compile_class.c

src/pcre2_xclass.c

carenas · 2024-10-26T16:34:39Z

src/pcre2_xclass.c

+    default. */
+
+    default:
+    PCRE2_DEBUG_UNREACHABLE();


this should be PCRE2_UNREACHABLE() with no return after;

probably better not to have it, do we know of any compiler complaining?

I just copied it from the existing code above.

I'm happy with return statements after DEBUG_UNREACHABLE() assertions, because in release builds, the assertions aren't emitted at all, so something has to happen if a customer actually triggers this condition.

Regardless of whether compilers complain, I complain if I see switch statements without a default condition! An assertion is highly beneficial for maintainability.

fair enough, but then the argument still stands for using a PCRE2_UNREACHABLE() instead as that return should never happen in a release build anyway.

BTW, the original code you copied from is fairly new and was using PCRE2_DEBUG_UNREACHABLE() because it was "safer" and because the implementation for MSVC was an afterthought, and based on what we learned online with zero practical experience, so it will be also nice if you could confirm that using an __assume(0) in production makes sense in this case (which is what will be emitted in a non debug build)

Currently it looks like the macros emit:

|---|---|---|

Debug Non-debug

PCRE2_DEBUG_UNREACHABLE __assume(0) do {} while (0)

PCRE2_UNREACHABLE __assume(0) __assume(0)

This looks correct to me.

The DEBUG_UNREACHABLE version should have error-recovery code, such as a return statement, but the UNREACHABLE() version shouldn't. I like that.

I think __assume(0) is correct for MSVC.

Actually, I think it should emit the following instead:

Debug Non-debug

PCRE2_DEBUG_UNREACHABLE PCRE2_ASSERT(FALSE) do {} while (0)

PCRE2_UNREACHABLE PCRE2_ASSERT(FALSE) __assume(0)

We want some actual runtime code to call abort() when we're in a debug build.

I can make a PR to do that separately.

I have noticed recently that assert() built-in function has pretty nice output, looks better than abort()

src/pcre2_compile_class.c

NWilson · 2024-10-28T11:15:40Z

Incidentally, do we need to add (*ALT_EXTENDED_CLASS) ?

I think not Philip. We are currently missing start-of-pattern, and in-pattern, controls for all of the following: ALLOW_EMPTY_CLASS, ALT_BSUX, DOLLAR_ENDONLY, MATCH_UNSET_BACKREF, ALT_CIRCUMFLEX, and others.

These are all flags which affect the regex dialect in some way, rather than fundamentally change what is or isn't expressible.

I don't think we need a start-of-pattern flag for any of these, nor for ALT_EXTENDED_CLASS: if a pattern author knows to add the flag, they can just as easily edit the pattern itself to match PCRE's dialect. (Admittedly, that would be fiddly for a flag like ALT_CIRCUMFLEX, but certainly possible to rewrite it in terms of ^ and \z. Similarly, anyone who wants ALT_EXTENDED_CLASS can just rewrite to use the Perl extended syntax.)

NWilson · 2024-10-28T11:17:56Z

I have addressed Carlo and Zoltan's latest review comments: more TODOs; Carlo's minor improvements.

zherczeg

The commits should be squashed. I have only a few questions. Overall this patch is a good first step.

Let's wait a bit longer for @PhilipHazel , but if he is away, I can also land this patch.

zherczeg · 2024-10-30T08:21:36Z

src/pcre2_internal.h

+the operand stack in a uint32_t. A nesting limit of 15 implies (15*2+1)=31
+stack operands required. */
+
+#define ECLASS_NEST_LIMIT  15


I thought 1 bit is enough for an xclass result value. Why do you need two?

Aha, because you can have two operators within a '[...]'. So if you have two brackets: [A--B[C--DE]] then the worst-case and peak stack usage is that the leafmost [C--DE] takes three slots on the stack (result of C, D and E, before applying || to DE and --), and the non-leafmost [A--B ...foo...] takes two slots on the stack (need to store result of matching A and B).

Hence 5 stack slots for two [ ... [...] ] brackets. It's 2n+1.

It would be worse if we allowed mixing precedence within the brackets; then each [...] could consume even more stack slots.

I thought the pushes/pops are tracked individually. Something like /[A]&&[B]||[C]--[D]/ is push [A], push [B], &&, push [C], ||, push[D], -- and the max stack size is 2.

Recursive descent should be able to compute it. This looks like another thing to do then.

For your example (/[A]&&[B]||[C]--[D]/), yes, the maximum stack size is 2.

The worst case is a stack size of three, if you have A op1 B op2 C where op2 is higher (tighter) precedence.

We have a hard limit of 15 nested brackets; which gives us a maximum stack consumption of 31 slots.

Of course we can calculate (if we want) what the actual stack slots consumed will be, but we guarantee it will fit into a uint32_t if we limit the brackets to at most 15 deep.

zherczeg · 2024-10-30T08:23:05Z

src/pcre2_pattern_info.c

@@ -384,11 +384,10 @@ while (TRUE)
 #endif
    break;

-#if defined SUPPORT_UNICODE || PCRE2_CODE_UNIT_WIDTH != 8


This should have the revert comment as well.

I thought that it was OK to just leave the (test-only) code here clean of ifdefs. But I can do if you prefer the ifdefs to come back.

zherczeg · 2024-10-30T08:24:02Z

src/pcre2_printint.c

+        if (PRINTABLE(j)) fprintf(f, "%c", j);
+          else fprintf(f, "\\x%02x", j);
+        }
+      i = j;


It seems all code is moved here. Is this correct?

Yes. I lifted out the existing code for printing character classes into a function.

Similarly, much of the "new" code in pcre2_compile_class.c is lifted out of pcre2_compile.c.

PhilipHazel · 2024-10-30T09:09:14Z

I am not away, but I have been letting you guys get on with this patch. I have not looked at it in detail.Please go ahead and merge when you are all happy. (I am also distracted getting a new desktop PC up and configured because my old one is over 10 years old and I'm not sure how long it will go on.)

* Move some existing character class code into pcre2_compile_class.c * Add a new flag PCRE2_ALT_EXTENDED_CLASS to change the behaviour of parsing [...] character classes, to emit new META codes, and new OP_ECLASS codes for nested character classes with operators * Document the behaviour relative to the UTS#18 standard * No JIT support; it falls back to the interpreter. DFA is supported.

NWilson · 2024-10-30T09:57:41Z

Thank you Philip! I've squashed again, and added the TODO comment in printint.c requested by Zoltan.

I'm ready to merge, since it's fully-functional (as far as I'm aware), and I'm eager to crack on with the follow-up PRs.

zherczeg · 2024-10-30T10:34:12Z

Ok, patch is landed. I except a lot of follow up works, but it is the nature of these large features.

NWilson · 2024-10-30T11:14:21Z

Great, thank you! My next PR should follow later this week.

zherczeg mentioned this pull request Oct 15, 2024

Improve character range matching with binary search #524

Merged

zherczeg reviewed Oct 15, 2024

View reviewed changes

src/pcre2_match.c Outdated Show resolved Hide resolved

zherczeg reviewed Oct 15, 2024

View reviewed changes

src/pcre2_compile.c Outdated Show resolved Hide resolved

NWilson force-pushed the user/niwilson/sclass branch from 0dabd15 to 7bd2a0d Compare October 22, 2024 15:35

NWilson changed the title ~~(Draft) Implement the new set-based character classes~~ Implement the new set-based character classes Oct 22, 2024

zherczeg reviewed Oct 23, 2024

View reviewed changes

carenas reviewed Oct 24, 2024

View reviewed changes

NWilson force-pushed the user/niwilson/sclass branch from e83f93b to cf38b78 Compare October 24, 2024 20:07

zherczeg reviewed Oct 25, 2024

View reviewed changes

carenas approved these changes Oct 26, 2024

View reviewed changes

NWilson mentioned this pull request Oct 29, 2024

Move character lists data before the byte code in a pattern #540

Merged

zherczeg reviewed Oct 30, 2024

View reviewed changes

NWilson force-pushed the user/niwilson/sclass branch from f28d637 to 07e6c4d Compare October 30, 2024 09:54

NWilson force-pushed the user/niwilson/sclass branch from 07e6c4d to d898150 Compare October 30, 2024 09:55

zherczeg merged commit fc38d9e into PCRE2Project:master Oct 30, 2024
15 checks passed

NWilson deleted the user/niwilson/sclass branch November 6, 2024 17:23

	Debug	Non-debug
PCRE2_DEBUG_UNREACHABLE	__assume(0)	do {} while (0)
PCRE2_UNREACHABLE	__assume(0)	__assume(0)

	Debug	Non-debug
PCRE2_DEBUG_UNREACHABLE	PCRE2_ASSERT(FALSE)	do {} while (0)
PCRE2_UNREACHABLE	PCRE2_ASSERT(FALSE)	__assume(0)

Implement the new set-based character classes #523

Implement the new set-based character classes #523

Conversation

NWilson commented Oct 9, 2024

zherczeg commented Oct 9, 2024

PhilipHazel commented Oct 9, 2024

NWilson commented Oct 9, 2024 • edited Loading

PhilipHazel commented Oct 9, 2024

PhilipHazel commented Oct 9, 2024

PhilipHazel commented Oct 10, 2024

NWilson commented Oct 10, 2024

NWilson commented Oct 10, 2024

NWilson commented Oct 18, 2024 • edited Loading

zherczeg commented Oct 19, 2024

NWilson commented Oct 22, 2024 • edited Loading

NWilson commented Oct 22, 2024 • edited Loading

NWilson commented Oct 22, 2024

PhilipHazel commented Oct 23, 2024

zherczeg commented Oct 23, 2024

NWilson commented Oct 23, 2024 • edited Loading

PhilipHazel commented Oct 23, 2024

NWilson commented Oct 23, 2024

zherczeg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PhilipHazel commented Oct 24, 2024

NWilson commented Oct 24, 2024

PhilipHazel commented Oct 24, 2024

PhilipHazel commented Oct 24, 2024

PhilipHazel commented Oct 24, 2024

NWilson commented Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zherczeg Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NWilson commented Oct 25, 2024 • edited Loading

zherczeg left a comment

Choose a reason for hiding this comment

zherczeg Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PhilipHazel commented Oct 25, 2024

carenas commented Oct 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carenas Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NWilson commented Oct 28, 2024

NWilson commented Oct 28, 2024

zherczeg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NWilson commented Oct 9, 2024 •

edited

Loading

NWilson commented Oct 18, 2024 •

edited

Loading

NWilson commented Oct 22, 2024 •

edited

Loading

NWilson commented Oct 22, 2024 •

edited

Loading

NWilson commented Oct 23, 2024 •

edited

Loading

NWilson commented Oct 24, 2024 •

edited

Loading

zherczeg Oct 25, 2024 •

edited

Loading

NWilson commented Oct 25, 2024 •

edited

Loading

zherczeg Oct 25, 2024 •

edited

Loading

carenas Oct 28, 2024 •

edited

Loading