Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the new set-based character classes #523

Merged
merged 1 commit into from
Oct 30, 2024

Conversation

NWilson
Copy link
Contributor

@NWilson NWilson commented Oct 9, 2024

This is a mostly-complete implementation!

It needs some tidy-up, in places where I've left "XXX" comments. But I've hammered out the code and I'll start testing it.

I'm afraid I used recursive descent in the operator-precedence parser, but it only goes down to a maximum depth of 15 levels, so it should be OK. I just wanted to prototype quickly. I don't know "how much stack is too much" in PCRE2 - it's only pushing ~hundreds of bytes on the stack at most.

The two big bits of logic that need to be shared with existing code, are the OP_XCLASS interpreter code, and the compiler code to build the OP_CLASS/NCLASS/XCLASS code. The current code is inlined into the one place where it's used.

Anyway, this is a draft of where it's going.

@zherczeg
Copy link
Collaborator

zherczeg commented Oct 9, 2024

Pattern (bracket) parsing is already recursive. Although you can also a use a simple array for simulating recursion. I suspect a pointer in the META code is enough to track the status.

@PhilipHazel
Copy link
Collaborator

I do know that there are applications running PCRE2 in a large number of threads, each of which has very limited stack, but I don't know how limited. One would hope a couple of hundred bytes would be OK. I suppose we do need a new option to turn this on because changes the meaning of things, but it seems a shame to use one of the only two remaining main options bits. If/when the Perl syntax is implemented, it shouldn't need to be enabled.

I got a compile error when trying to build your branch:

src/pcre2_compile.c:6110:28: error: passing argument 1 of 'compile_class_nested' from incompatible pointer type [-Wincompatible-pointer-types]
 6110 |       compile_class_nested(&pptr, &code);
      |                            ^~~~~
      |                            |
      |                            uint32_t ** {aka unsigned int **}
src/pcre2_compile.c:2669:39: note: expected 'const uint32_t **' {aka 'const unsigned int **'} but argument is of type 'uint32_t **' {aka 'unsigned int **'}
 2669 | compile_class_nested(const uint32_t **pptr, PCRE2_UCHAR **pcode)

I fixed this by inserting the obvious cast, but is that right? Haven't done any testing yet. I see the CI tests are failing; probably the same thing.

@NWilson
Copy link
Contributor Author

NWilson commented Oct 9, 2024

I'm expecting that the Perl syntax will be able to reuse the same interpreter support and OPcodes.

I won't tackle implementing the parser & compiler for the Perl classes in this PR though, there's enough in it already. I agree it should be OK to add in a follow-on PR, and it wouldn't need any flags.

I imagined that it might be worth spending a "normal" parse bit on this flag? This flag implements a behaviour that's in JavaScript/Python/Rust and other engines, so it's more mainstream than all the other minor things I've recently added to the extra options. But you're quite possibly right, those bits are too precious now to spend on non-Perl-related syntax extensions.

@PhilipHazel
Copy link
Collaborator

How ready is this for testing? I added the obvious to pcre2test to set the flag, but my first test failed:

PCRPCRE2 version 10.45-DEV 2024-06-09 (8-bit)
/[[a-c] && [b-d]]/set_class
Failed: error 189 at offset 16: internal error: unknown code in parsed pattern

This happens at line 6441 in pcre2_compile.c.

@PhilipHazel
Copy link
Collaborator

Oops sorry, I told a lie. It's around line 8682 (didn't look for multiple occurrences of this error).

@PhilipHazel
Copy link
Collaborator

Thinking about the name PCRE2_SET_CLASS ... I assume that some time in the future, the Perl syntax will always be recognized, so what this option is enabling is an alternative syntax for what Perl calls "extended bracketed character classes". So perhaps a name like PCRE2_ALT_EXTENDED_CLASS is more descriptive. This is the same style as other PCRE2_ALT_xxx options.

@NWilson
Copy link
Contributor Author

NWilson commented Oct 10, 2024

How ready is this for testing? I added the obvious to pcre2test to set the flag, but my first test failed:

Not quite ready. It's just for getting feedback, if anyone's interested in looking (thanks for trying!).

I'm happy to rename from "set[-based] class" to "extended class". That will be easy enough to do.

@NWilson
Copy link
Contributor Author

NWilson commented Oct 10, 2024

Simple testcases now pass, such as [abc--b].

The one thing that's broken (that I know of) is that you can only include single characters like 'a', not ranges like 'a-z' nor properties like \p{..} or \d. I still need to pull out the delicate code for doing that, into a reusable chunk I can call.

src/pcre2_match.c Outdated Show resolved Hide resolved
src/pcre2_compile.c Outdated Show resolved Hide resolved
@NWilson
Copy link
Contributor Author

NWilson commented Oct 18, 2024

I've added docs.

I'm down to these tasks:

  • JIT support - is that needed for this PR, or can it come in a future one?
  • Need a lot more tests, I'm currently writing them. Will be done in a day or two
  • One chunk of code (marked by a big "XXX") needs to be pulled out into a separate function.
  • I'll move all the code into pcre2_compile_class.c as requested

@zherczeg
Copy link
Collaborator

Some random comments.

I don't see a negate operation (can be implemented as ^ 0x1). For example, it is needed for [^\PL-\x{100}-\x{200}].

There is no point to duplicate bitsets (for character 0..255). If one xclass has a bitset, the full bitset should be computed for the entire class. It is faster to evaluate and requires less space.

The parser can be implemented with using a simple stack in the parser context (e.g. 32 uint32_t values, which contains the offsets of the currently active operators). This way the recursion does not require any recursion.

There should be tests for reaching the maximum allowed depth. They should throw parse errors.

Is [&&] allowed? Is it simply matches to nothing? Does [[]&&[]] depends on allow empty classes flag? Btw how "allow empty classes" affects the parsing?

There should be optimizations for ||. [[[\pS]||]||[[[\pL]]]] is just [\pS\pL]. This is easy to do during parsing.

Personally I would just implement [] and || at the first stage, do all the optimizations, parse errors, discuss all the minor details. This only requires parser modifications, so you don't need to worry about matching.

@NWilson
Copy link
Contributor Author

NWilson commented Oct 22, 2024

Thank you @zherczeg for reviewing!

I don't see a negate operation (can be implemented as ^ 0x1). For example, it is needed for [^\PL-\x{100}-\x{200}].

It was there in the PR, it's OP_ECLASS_NOT. There isn't a META for it, because it's encoded as META_CLASS / META_CLASS_NOT distinction. But I had implemented the interpreter support.

There is no point to duplicate bitsets (for character 0..255). If one xclass has a bitset, the full bitset should be computed for the entire class. It is faster to evaluate and requires less space.

True. I haven't done this optimisation.

The parser can be implemented with using a simple stack in the parser context (e.g. 32 uint32_t values, which contains the offsets of the currently active operators). This way the recursion does not require any recursion.

Hmm. Can it really? We have a grammar with two operators, with different precedence levels. I can code up a simple shunting-yard thing... but it would be more code than recursive descent.

Note that the META parser (the one that consumes characters and produces META) is just a simple for-loop, with an int depth state. It's the parser that consumes the META and produces OPs which has to parse the recursive grammar.

The grammar is certainly recursive - the question is simply whether (given that it has a fixed nesting depth of 15) we need to reduce the stack consumption, for embedded devices.

I can check how much stack space it actually uses. Currently it looks like it chews up around 100 bytes per recursion, in a 64-bit Release build, so that comes out to a couple of KiB of stack space consumed. The shunting-yarn approach would cut that down a bit, but I'm really not sure it's worth it.

There should be tests for reaching the maximum allowed depth. They should throw parse errors.

Added. Lots more tests added.

Is [&&] allowed? Is it simply matches to nothing? Does [[]&&[]] depends on allow empty classes flag? Btw how "allow empty classes" affects the parsing?

Good question. [&&] is not allowed: an operator must have something non-empty on either side.

The [[]&&[]] case does indeed depend on allow_empty_class. That flag affects not just the outermost [...], but all of the nested classes opened by a '[' character. The allow_empty_class flag affects the inner nested classes in the same way as an outermost class: if the first literal is a ']', then the flag determines whether it's a literal or a class close ']'.

There should be optimizations for ||. [[[\pS]||]||[[[\pL]]]] is just [\pS\pL]. This is easy to do during parsing.

That would be nice... but I'm so far totally focussed on completeness and correctness. I'd rather get the PR merged, than hold it up for edge-case optimisations.

Personally I would just implement [] and || at the first stage, do all the optimizations, parse errors, discuss all the minor details. This only requires parser modifications, so you don't need to worry about matching.

Even with just ||, you need changes to the matcher: [a[^b]] is pretty simple, but requires runtime op support.

@NWilson
Copy link
Contributor Author

NWilson commented Oct 22, 2024

Updated task list:

  • Could add some more tests, there's a handful of missing cases. But it's somewhat thoroughly exercised already.
  • Still no JIT support. Unsure whether I need to do this to get it merged.
  • And certainly no (?[...]) support. That can come in a follow-up PR, soon after
  • I haven't implemented the special-cased support for [ab]+ (following quantifier) that XCLASS has. I guess I could do that for ECLASS too, although it works fine without.
  • Still a question of whether it should be parsable without any recursion, to save a KiB of stack space

@NWilson
Copy link
Contributor Author

NWilson commented Oct 22, 2024

Also - regarding Zoltan's example of [[[\pS]||]||[[[\pL]]]]. The [thing||] syntax is not legal. And the extra brackets [[[thing]]] syntax is already optimised away (compiles to the same OPs as [thing] automatically).

@NWilson NWilson changed the title (Draft) Implement the new set-based character classes Implement the new set-based character classes Oct 22, 2024
@PhilipHazel
Copy link
Collaborator

Still no JIT support. Unsure whether I need to do this to get it merged.

Merging without JIT support should be possible. Because you have added new opcodes, a call to pcre2_jit_compile() will fail, and matching should fall back to the interpreter. There are still, I think, one or two other constructions that JIT does not support.

@zherczeg
Copy link
Collaborator

Hmm. Can it really? We have a grammar with two operators, with different precedence levels. I can code up a simple shunting-
yard thing... but it would be more code than recursive descent.

It is usually easier for optimizations, since it is easy to drop operators, and rewrite the META stream. This is hard for actual recursion, since you cannot drop functions form the function stack. Not in C at least.

Example: [[[a-f]]||[i-k||m-o]]

PUSH [ -> add OP_???CLASS to meta stream, push a pointer to the token onto the pointer (or unsigned offset) stack
PUSH [ -> similar to before, adds data to the META stream, and a pointer to it in the stack
PUSH [
COPY a-f -> just adds data to the META stream, pushes nothing
Now a ] comes, we need to find what happens. Since we have an actual class on the stack ([a-f), we terminate it.
PUSH ], and drop '[' from the stack.
Now another ] comes, and we see, that a [ on the stack, which is followed by a class in the META stream. Since that class is NOT pushed onto the stack, it is a full class literal. Then we delete the OP
???_CLASS form the META stream by a memove to the offset of it (it is known in the stack). We don't add anything to the META stream, and pop the top element of the stack. Essentially [[a-f]] is converted to [a-f]

Then a || comes, and we check what is on the top of the stack. It is a '[', and it followed by a class which is not pushed onto the stack. We again delete the [] part with a memove, and the pattern is basically turned to an [a-f||...in the META stream.
PUSH || -> we don't know how the pattern will turn out, so we need to push the operator.
PUSH [
COPY i-k
PUSH ||
COPY m-o

And now the interesting part comes, since we encounter an ']'. The top item is ||, and if we go back one item in the stack, and it is || or '[', then we can delete the || from the stack, which turns [i-k||m-o] to a i-km-o. We don't push anything to the stack.

Then another ']' comes, and the [a-f||i-km-o] is turned to [a-fi-km-o]. The topmost [] is never removed.

Adding precedence to operators is not hard, and you need to handle ^ as well, since it inverts a sub group. Tracking the current stack depth is also possible (cannot be > 32). This is why I said that supporting [ and || should be the first step, and then you have a strong foundation to build things further. If you do it in the opposite way, you might need to rewrite the entire algorithm, and that is a lot of effort.

The challenging part is that the META stream size is limited, and perl has single character operators, so we need to encode everything into one uint32_t value.

Btw [a&&bc] is [a&&[b]c] or [a&&[bc]]?

@NWilson
Copy link
Contributor Author

NWilson commented Oct 23, 2024

Well, as I said, "I can code up a simple shunting-yard thing... but it would be more code than recursive descent."

You are describing the classic shunting-yard thing... and it really is more code, and more complexity.

At the moment, I have implemented everything (^, ||, &&, and --), and it's both robust, and correct. So I don't see it as a bad foundation at all. I could rewrite the second parsing phase, but in my opinion the only reason to do so would be to save a tiny bit of stack space.

I have implemented most of the optimisations you describe. I haven't implemented merging [[a-f][i-k]] into a single class, which would be a bit fiddly, but I guess could be possible.

Note also that the || is very unlikely to be used in practice. It's simply there for completeness. JavaScript chose not to add it, but Rust and Python regex module did. I don't think it's worth doing any special optimisations for it.

Btw [a&&bc] is [a&&[b]c] or [a&&[bc]]?

[a&&bc] is [a&&[bc]] (also [a&&[b]c] is [a&&[ [b]c ]]). The ^ character is not an operator, it's a modifier on nested classes to make them into "yes-classes" or "not-classes". The implicit-union operator (juxtaposition) has tightest precedence (implicit union is any juxtaposition of atoms like single characters or ranges, and also nested classes; for example, [[a-b][^c-e]]). The explicit operators && || -- have looser precedence (and undefined precedence relative to each other - you aren't allowed to mix them).


I'm not at all opposed to changing the parser. I just want to be clear on the goal of the improvement.

@PhilipHazel
Copy link
Collaborator

If it's all implemented and working, I think we (that is, you :-) should proceed and get on with what is left to be done. Internal algorithms can always be changed later if that proves to be beneficial.

@NWilson
Copy link
Contributor Author

NWilson commented Oct 23, 2024

Great! I am pretty-much done here.

I can see why Zoltan wants more optimisations - that's his role in life, adding a JIT after all!

I'm warming up to the idea of doing some extra work to satisfy that.

The branch is ready for review; I'll only be adding some quite local changes in the parser if I do that extra work.

The stuff I don't understand well enough:

  • changes in pcre2_study, and quantification, and the interpreter...
  • particularly auto-possessification
  • the "reqcu" flags
  • and other PCRE-specific stuff, where I just don't have the background, and I may have goofed.

The I think I do understand:

  • I'm confident with my parsing and matching functions. They're pure, side-effect-free things and just do what it says on the tin.

Copy link
Collaborator

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately a lots of choices is confusing for me here. For me these should work like a usual expression evaluator, with unary/binary operators and precedence. The primary expressions should be boolean values, [] can be used for grouping.

cls:[A]
cls:[B]
op: ||
op: ^
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if ^ is a modifier, it is an operator in practice. Btw is [^[^a-z]] optimized to [a-z]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. In the grammar, it's not an operator, because it's attached to the [^, so it's parsed as a special form of "open paren". But in the evaluator, it is an operator. The close-paren "]" emits a final OP_ECLASS_NOT if it was opened with a "[^".

Btw is [^[^a-z]] optimized to [a-z]?

Not yet, no. I will do optimisations (although these are pathological cases you suggest...). If possible I'll do it in a PR next week, although I can include it in the current PR if you require.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no requirements. Since I prefer to have these optimizations, I should do them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really will do them, I'd just rather split into a PR next week, as an extra piece of work.

Just give me a few days to get onto it. I'm away this weekend.

]
0: ]

/[A-C--B]/B,alt_extended_class
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there tests for [A--B--C--D] [A--^B] [A---B] [A----B] [^--A].
Is there tests to reach the maximum depth?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have most of those tests.

The expression [A--^B] has a literal ASCII ^ character, and so it's equivalent to simply [A].

The expression [A---B] is a parse error. So are [A----B] [^--A].

I have tests to reach maximum depth - a successful test at max depth, and a parse error test at max+1.

eclass[
cls:[a]
cls:[\-]
op: ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should optimize these before the release.

Failed: error 209 at offset 3: unexpected operator in character class (no preceding operand)

/[a---b]/alt_extended_class
Failed: error 208 at offset 5: invalid operator in character class
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be [a--[-b]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? We're following UTS#18 syntax here, not inventing our own. Other implementations forbid tripled operator characters.

A single - hyphen is either a literal or a range separator, with complex disambiguation. A double hyphen -- is always an operator (never one end of a range: [--a] used to mean "characters from a to hyphen" but in UTS#18 syntax it's invalid and you must instead use [\--a]). A triple hyphen is just invalid.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no tripled operator here. [#--] and [--a] are currently valid.

It looks like unicode recommendation lists several chars, which cannot be in a pattern without \. Do you have tests for them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no tripled operator here. [#--] and [--a] are currently valid.

Sorry, maybe the comment was moved by GitHub when a file was updated? I thought you were commenting on the case /[a---b]/, saying it should act the same as [a--[-b]].

[--a] are currently valid

Correct. But it won't be legal if you enable UTS#18 mode. The sequence "--" becomes an operator, which requires an operand.

It looks like unicode recommendation lists several chars, which cannot be in a pattern without . Do you have tests for them?

That's true. But the spec is just so vague... For example: "Different variants of SYNTAX_CHAR, SPECIAL_CHAR, and NON_SYNTAX_CHAR can be used for particular contexts to maintain compatibility". And also: "This is only a sample syntax for the purposes of examples in this document. Regular expression syntax varies widely: the issues discussed here would need to be adapted to the syntax of the particular implementation."

Basically, they want each regex engine to be able to choose its own syntax, rather than require any more than the bare minimum of changes to their historical behaviour, in order to claim to be UTS#18-compliant.

It's mainly intended to discuss the Unicode matching issues in regex engines, and not force them to adopt a common syntax.

That's why I was most of all concerned to carefully match the behaviour of other regex engines (ECMAScript, Python).

Short answer: we shouldn't randomly forbid characters that are accepted in PCRE2 and Perl currently (such as requiring { to be escaped inside classes). We have to add the [ and -- metacharacters, but the rest can (and should, I believe) be kept as-is.

Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class

/[a--b&&c]/alt_extended_class
Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I address this in the manpage updates included in the PR.

There's also the (overly?) detailed preliminary survey I did of other implementations of UTS#18 syntax, which I linked in an email thread: https://github.com/NWilson/pcre2/blob/user/niwilson/utr18/NEW-CLASSES.md

  • UTS#18 doesn't define the relative precedence of the operators
  • Some implementations ban mixing operators (ECMAScript spec requires this)
  • Some implementations give them all the same precedence... but some implementations give them a different precedence

So it's a compatibility nightmare, what a mess. I don't think we can justify picking a "side" in the battle. It would be really unfortunate if we picked some precedence for the operators, and then they updated UTS#18 with a different choice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This choice definitely simplify the current code. However, perl supports multiple operators, so the code needs to be able to handle it. This might be another thing I need to implement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be implementing the Perl syntax, don't worry. It will drop into the current parser very easily.

@PhilipHazel
Copy link
Collaborator

I see there is still one failing CI test.

@NWilson
Copy link
Contributor Author

NWilson commented Oct 24, 2024

For me these should work like a usual expression evaluator, with unary/binary operators and precedence. The primary expressions should be boolean values, [] can be used for grouping.

That is exactly how it works, I think. I didn't do anything "unusual" on purpose. There are primary expressions (atoms) which are characters/ranges/POSIX-props/Unicode-props, and there are binary operators (no unary), and [] can be used for grouping. Just as you say.

Very classic recursive descent with precedence, producing a RPN sequence of operators.

I see there is still one failing CI test.

Yes, it's weird. Moving some code around triggers the Werror=maybe-uninitialized warning in alpine's version of GCC, although I didn't really touch that code. It's full of gotos; I guess the detector is a bit sensitive.

@PhilipHazel
Copy link
Collaborator

Is this a bug?

/[abc -- b]+/alt_extended_class
    acacbac
 0: ac

I expected it to match "acac".

@PhilipHazel
Copy link
Collaborator

Re my previous comment: under DFA matching it does match "acac".

@PhilipHazel
Copy link
Collaborator

Another example where DFA does what I expect, but non-DFA doesn't. May be same bug as before.

/[\pL || [:digit:]]+/alt_extended_class
    +a03+
No match
    +a03+\=dfa
 0: a03

@NWilson
Copy link
Contributor Author

NWilson commented Oct 24, 2024

Yes, it is a bug. I found a similar case earlier this afternoon while adding more tests of my own.

Very odd, something do with the fact that I changed OP_ECLASS to take part in auto-possessification, in the same way that OP_XCLASS does. I introduced the regression at that point.

It turns out to be a small one-line copy-and-paste error, when I copied a bit of code from XCLASS to the ECLASS case, but changed a break into a return (oops!).

(Anyway, all of Philip's cases are now fixed.)

HACKING Outdated Show resolved Hide resolved
doc/pcre2_compile.3 Outdated Show resolved Hide resolved
doc/pcre2pattern.3 Show resolved Hide resolved
src/pcre2.h.in Show resolved Hide resolved
src/pcre2_auto_possess.c Show resolved Hide resolved
src/pcre2_compile.h Show resolved Hide resolved
@@ -40,7 +40,7 @@ POSSIBILITY OF SUCH DAMAGE.

/* This module contains an internal function that is used to match a Unicode
extended grapheme sequence. It is used by both pcre2_match() and
pcre2_def_match(). However, it is called only when Unicode support is being
pcre2_dfa_match(). However, it is called only when Unicode support is being
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this (the need for the dummy when unicode support is not compiled that will be never called) still true?, it is confusing but it would seem that ECLASS calls this unconditionally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the ECLASS code calls into pcre2_extuni at all.

I haven't changed anything in here, except noticed a typo in a comment that was copy-and-pasted into multiple files.

src/pcre2_jit_compile.c Show resolved Hide resolved
cls:[a]
cls:[B]
op: ||
]
Copy link
Collaborator

@zherczeg zherczeg Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized an optimization is missing here.

PCRE2 traditionally uses a bitset for the first 256 characters of sets. Even for the new ECLASS, there should only be one bitset.

Should work like this:

  • The ECLASS contains only XCLASS byte codes.
  • If, and only if an XCLASS has a bitset, then the ECLASS has a bitset.
  • The ECLASS bitset is pre-computed, no extra operation is needed
  • The ECLASS should be turned to CLASS / NCLASS if nothing else only the bitset is present
  • No XCLASS has a bitset

I know this is a complex, although important optimization, since it reduces storage and improves performance. I can do it in a follow-up patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do full constant-folding for the CLASS/NCLASS cases, we'd not have many redundant bitsets, in practice.

I agree we could lift out the bitset, if that's important. I guess we'd need a way to fold things like CLASS && XCLASS and so on - remove the bitset from the XCLASS, and since the CLASS includes all chars >255 we can reduce CLASS && XCLASS → XCLASS.

I think I can do this fine in next week's PR, when I do the CLASS op CLASS → CLASS constant-folding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks though @zherczeg, it's a great observation, and worth doing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have other priorities, I can also do these things. I don't want to put more work on you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You previously said you wouldn't do work for Excel for free! I'll finish what I started, if you give me a bit of time to chip away at it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@NWilson
Copy link
Contributor Author

NWilson commented Oct 25, 2024

I'm done with this!!

Whew. I'm not aware of anything broken, and it's feature-complete, as far as the ALT_EXTENDED_CLASS flag goes.

I've filed tickets for the follow-up work, so it's not forgotten (optimizations; and (?[...]) support), and I'll commit to doing it beginning next week.

I've either addressed all of Zoltan and Carlo's feedback so far, or asked for the points raised to be deferred.

Many, many thanks for all the help reviewing and testing this!

Copy link
Collaborator

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have discussed a lot of things in this PR, and would be good to add a specific comment (with a fixed string which is easy to search), to the code where we will likely do more things. This way we could not forget about them.

@@ -479,13 +480,13 @@ switch(c)

case OP_NCLASS:
case OP_CLASS:
#ifdef SUPPORT_WIDE_CHARS
Copy link
Collaborator

@zherczeg zherczeg Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add some /* revert */ comments because we should revert these after ECLASS will have a single bitset. It is easy to forget them otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what you intend here. Even when ECLASS has a single bitset, it will behave roughly the same as XCLASS, for the purposes of the auto-possessification code at least. There will always need to be the LINK_SIZE slot, holding the size of the stack which follows.

I'll add the TODO comment as you request.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wide chars mean unicode or > 8 bit. In 8 bit mode, without utf, the regular OP_CLASS with 256 bits must handle the extended class, so it should never encounter with this opcode. In general, eclasses should be turned to normal classes/nclasses whenever it is possible, just like xclasses.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the implementation, eclasses should have a prediction system, which predicts their size if they cannot be turned to class/nclass. The prediction is discarded, if they can. The prediction bit (can or cannot) can be saved in the META code (you have 16 free bits for that) during the byte code length computation phase, and reused when the actual byte code is generated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, I see! If the constant-folding is sufficiently aggressive, then OP_ECLASS will never be emitted without wide-char support, the same as XCLASS. OK, that makes good sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, both the "need a bitset for eclass" and "eclass can be turned to class/nclass" bits can be computed when the byte code length is computed (two bits in the META code), and this simplifies the actual byte code generation later, since "the oracle" tell us what to do.

src/pcre2_auto_possess.c Show resolved Hide resolved
src/pcre2_auto_possess.c Show resolved Hide resolved
src/pcre2_compile.c Outdated Show resolved Hide resolved
@PhilipHazel
Copy link
Collaborator

Concerning the documentation: Many thanks for what you have done so far, Nick. Once the Perl syntax is also implemented, I was thinking that it might be helpful to have an overall section called "Extended character classes" that explains that there is Perl syntax and everybody else's syntax and then have two separate sections, one for Perl and one for the other. I am happy to work on the documentation when the time comes.

Incidentally, do we need to add (*ALT_EXTENDED_CLASS) ?

@carenas
Copy link
Contributor

carenas commented Oct 25, 2024

do we need to add (*ALT_EXTENDED_CLASS) ?

I actually had a patch implementing (?X), but didn't send it because the "incompatibilities" between the old syntax and the Unicode one are so minimal that I think it would be better not to have one, and wait to see what happens after it gets released.

src/pcre2_compile_class.c Outdated Show resolved Hide resolved
src/pcre2_xclass.c Outdated Show resolved Hide resolved
default. */

default:
PCRE2_DEBUG_UNREACHABLE();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be PCRE2_UNREACHABLE() with no return after;

probably better not to have it, do we know of any compiler complaining?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied it from the existing code above.

I'm happy with return statements after DEBUG_UNREACHABLE() assertions, because in release builds, the assertions aren't emitted at all, so something has to happen if a customer actually triggers this condition.

Regardless of whether compilers complain, I complain if I see switch statements without a default condition! An assertion is highly beneficial for maintainability.

Copy link
Contributor

@carenas carenas Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough, but then the argument still stands for using a PCRE2_UNREACHABLE() instead as that return should never happen in a release build anyway.

BTW, the original code you copied from is fairly new and was using PCRE2_DEBUG_UNREACHABLE() because it was "safer" and because the implementation for MSVC was an afterthought, and based on what we learned online with zero practical experience, so it will be also nice if you could confirm that using an __assume(0) in production makes sense in this case (which is what will be emitted in a non debug build)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it looks like the macros emit:

|---|---|---|

Debug Non-debug
PCRE2_DEBUG_UNREACHABLE __assume(0) do {} while (0)
PCRE2_UNREACHABLE __assume(0) __assume(0)

This looks correct to me.

The DEBUG_UNREACHABLE version should have error-recovery code, such as a return statement, but the UNREACHABLE() version shouldn't. I like that.

I think __assume(0) is correct for MSVC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it should emit the following instead:

Debug Non-debug
PCRE2_DEBUG_UNREACHABLE PCRE2_ASSERT(FALSE) do {} while (0)
PCRE2_UNREACHABLE PCRE2_ASSERT(FALSE) __assume(0)

We want some actual runtime code to call abort() when we're in a debug build.

I can make a PR to do that separately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have noticed recently that assert() built-in function has pretty nice output, looks better than abort()

src/pcre2_compile_class.c Outdated Show resolved Hide resolved
@NWilson
Copy link
Contributor Author

NWilson commented Oct 28, 2024

Incidentally, do we need to add (*ALT_EXTENDED_CLASS) ?

I think not Philip. We are currently missing start-of-pattern, and in-pattern, controls for all of the following: ALLOW_EMPTY_CLASS, ALT_BSUX, DOLLAR_ENDONLY, MATCH_UNSET_BACKREF, ALT_CIRCUMFLEX, and others.

These are all flags which affect the regex dialect in some way, rather than fundamentally change what is or isn't expressible.

I don't think we need a start-of-pattern flag for any of these, nor for ALT_EXTENDED_CLASS: if a pattern author knows to add the flag, they can just as easily edit the pattern itself to match PCRE's dialect. (Admittedly, that would be fiddly for a flag like ALT_CIRCUMFLEX, but certainly possible to rewrite it in terms of ^ and \z. Similarly, anyone who wants ALT_EXTENDED_CLASS can just rewrite to use the Perl extended syntax.)

@NWilson
Copy link
Contributor Author

NWilson commented Oct 28, 2024

I have addressed Carlo and Zoltan's latest review comments: more TODOs; Carlo's minor improvements.

Copy link
Collaborator

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commits should be squashed. I have only a few questions. Overall this patch is a good first step.

Let's wait a bit longer for @PhilipHazel , but if he is away, I can also land this patch.

the operand stack in a uint32_t. A nesting limit of 15 implies (15*2+1)=31
stack operands required. */

#define ECLASS_NEST_LIMIT 15
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought 1 bit is enough for an xclass result value. Why do you need two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, because you can have two operators within a '[...]'. So if you have two brackets: [A--B[C--DE]] then the worst-case and peak stack usage is that the leafmost [C--DE] takes three slots on the stack (result of C, D and E, before applying || to DE and --), and the non-leafmost [A--B ...foo...] takes two slots on the stack (need to store result of matching A and B).

Hence 5 stack slots for two [ ... [...] ] brackets. It's 2n+1.

It would be worse if we allowed mixing precedence within the brackets; then each [...] could consume even more stack slots.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the pushes/pops are tracked individually. Something like /[A]&&[B]||[C]--[D]/ is push [A], push [B], &&, push [C], ||, push[D], -- and the max stack size is 2.

Recursive descent should be able to compute it. This looks like another thing to do then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For your example (/[A]&&[B]||[C]--[D]/), yes, the maximum stack size is 2.

The worst case is a stack size of three, if you have A op1 B op2 C where op2 is higher (tighter) precedence.

We have a hard limit of 15 nested brackets; which gives us a maximum stack consumption of 31 slots.

Of course we can calculate (if we want) what the actual stack slots consumed will be, but we guarantee it will fit into a uint32_t if we limit the brackets to at most 15 deep.

@@ -384,11 +384,10 @@ while (TRUE)
#endif
break;

#if defined SUPPORT_UNICODE || PCRE2_CODE_UNIT_WIDTH != 8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have the revert comment as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that it was OK to just leave the (test-only) code here clean of ifdefs. But I can do if you prefer the ifdefs to come back.

if (PRINTABLE(j)) fprintf(f, "%c", j);
else fprintf(f, "\\x%02x", j);
}
i = j;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems all code is moved here. Is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I lifted out the existing code for printing character classes into a function.

Similarly, much of the "new" code in pcre2_compile_class.c is lifted out of pcre2_compile.c.

@PhilipHazel
Copy link
Collaborator

I am not away, but I have been letting you guys get on with this patch. I have not looked at it in detail.Please go ahead and merge when you are all happy. (I am also distracted getting a new desktop PC up and configured because my old one is over 10 years old and I'm not sure how long it will go on.)

* Move some existing character class code into pcre2_compile_class.c
* Add a new flag PCRE2_ALT_EXTENDED_CLASS to change the behaviour of
  parsing [...] character classes, to emit new META codes, and new
  OP_ECLASS codes for nested character classes with operators
* Document the behaviour relative to the UTS#18 standard
* No JIT support; it falls back to the interpreter. DFA is supported.
@NWilson
Copy link
Contributor Author

NWilson commented Oct 30, 2024

Thank you Philip! I've squashed again, and added the TODO comment in printint.c requested by Zoltan.

I'm ready to merge, since it's fully-functional (as far as I'm aware), and I'm eager to crack on with the follow-up PRs.

@zherczeg zherczeg merged commit fc38d9e into PCRE2Project:master Oct 30, 2024
15 checks passed
@zherczeg
Copy link
Collaborator

Ok, patch is landed. I except a lot of follow up works, but it is the nature of these large features.

@NWilson
Copy link
Contributor Author

NWilson commented Oct 30, 2024

Great, thank you! My next PR should follow later this week.

@NWilson NWilson deleted the user/niwilson/sclass branch November 6, 2024 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants