-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement the new set-based character classes #523
Conversation
Pattern (bracket) parsing is already recursive. Although you can also a use a simple array for simulating recursion. I suspect a pointer in the META code is enough to track the status. |
I do know that there are applications running PCRE2 in a large number of threads, each of which has very limited stack, but I don't know how limited. One would hope a couple of hundred bytes would be OK. I suppose we do need a new option to turn this on because changes the meaning of things, but it seems a shame to use one of the only two remaining main options bits. If/when the Perl syntax is implemented, it shouldn't need to be enabled. I got a compile error when trying to build your branch:
I fixed this by inserting the obvious cast, but is that right? Haven't done any testing yet. I see the CI tests are failing; probably the same thing. |
I'm expecting that the Perl syntax will be able to reuse the same interpreter support and OPcodes. I won't tackle implementing the parser & compiler for the Perl classes in this PR though, there's enough in it already. I agree it should be OK to add in a follow-on PR, and it wouldn't need any flags. I imagined that it might be worth spending a "normal" parse bit on this flag? This flag implements a behaviour that's in JavaScript/Python/Rust and other engines, so it's more mainstream than all the other minor things I've recently added to the extra options. But you're quite possibly right, those bits are too precious now to spend on non-Perl-related syntax extensions. |
How ready is this for testing? I added the obvious to pcre2test to set the flag, but my first test failed:
This happens at line 6441 in pcre2_compile.c. |
Oops sorry, I told a lie. It's around line 8682 (didn't look for multiple occurrences of this error). |
Thinking about the name PCRE2_SET_CLASS ... I assume that some time in the future, the Perl syntax will always be recognized, so what this option is enabling is an alternative syntax for what Perl calls "extended bracketed character classes". So perhaps a name like PCRE2_ALT_EXTENDED_CLASS is more descriptive. This is the same style as other PCRE2_ALT_xxx options. |
Not quite ready. It's just for getting feedback, if anyone's interested in looking (thanks for trying!). I'm happy to rename from "set[-based] class" to "extended class". That will be easy enough to do. |
Simple testcases now pass, such as The one thing that's broken (that I know of) is that you can only include single characters like 'a', not ranges like 'a-z' nor properties like |
I've added docs. I'm down to these tasks:
|
Some random comments. I don't see a negate operation (can be implemented as There is no point to duplicate bitsets (for character 0..255). If one xclass has a bitset, the full bitset should be computed for the entire class. It is faster to evaluate and requires less space. The parser can be implemented with using a simple stack in the parser context (e.g. 32 uint32_t values, which contains the offsets of the currently active operators). This way the recursion does not require any recursion. There should be tests for reaching the maximum allowed depth. They should throw parse errors. Is There should be optimizations for Personally I would just implement |
0dabd15
to
7bd2a0d
Compare
Thank you @zherczeg for reviewing!
It was there in the PR, it's OP_ECLASS_NOT. There isn't a META for it, because it's encoded as META_CLASS / META_CLASS_NOT distinction. But I had implemented the interpreter support.
True. I haven't done this optimisation.
Hmm. Can it really? We have a grammar with two operators, with different precedence levels. I can code up a simple shunting-yard thing... but it would be more code than recursive descent. Note that the META parser (the one that consumes characters and produces META) is just a simple for-loop, with an The grammar is certainly recursive - the question is simply whether (given that it has a fixed nesting depth of 15) we need to reduce the stack consumption, for embedded devices. I can check how much stack space it actually uses. Currently it looks like it chews up around 100 bytes per recursion, in a 64-bit Release build, so that comes out to a couple of KiB of stack space consumed. The shunting-yarn approach would cut that down a bit, but I'm really not sure it's worth it.
Added. Lots more tests added.
Good question. The
That would be nice... but I'm so far totally focussed on completeness and correctness. I'd rather get the PR merged, than hold it up for edge-case optimisations.
Even with just |
Updated task list:
|
Also - regarding Zoltan's example of |
Merging without JIT support should be possible. Because you have added new opcodes, a call to pcre2_jit_compile() will fail, and matching should fall back to the interpreter. There are still, I think, one or two other constructions that JIT does not support. |
It is usually easier for optimizations, since it is easy to drop operators, and rewrite the META stream. This is hard for actual recursion, since you cannot drop functions form the function stack. Not in C at least. Example:
Then a And now the interesting part comes, since we encounter an ']'. The top item is Then another ']' comes, and the Adding precedence to operators is not hard, and you need to handle The challenging part is that the META stream size is limited, and perl has single character operators, so we need to encode everything into one uint32_t value. Btw |
Well, as I said, "I can code up a simple shunting-yard thing... but it would be more code than recursive descent." You are describing the classic shunting-yard thing... and it really is more code, and more complexity. At the moment, I have implemented everything ( I have implemented most of the optimisations you describe. I haven't implemented merging Note also that the
I'm not at all opposed to changing the parser. I just want to be clear on the goal of the improvement. |
If it's all implemented and working, I think we (that is, you :-) should proceed and get on with what is left to be done. Internal algorithms can always be changed later if that proves to be beneficial. |
Great! I am pretty-much done here. I can see why Zoltan wants more optimisations - that's his role in life, adding a JIT after all! I'm warming up to the idea of doing some extra work to satisfy that. The branch is ready for review; I'll only be adding some quite local changes in the parser if I do that extra work. The stuff I don't understand well enough:
The I think I do understand:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately a lots of choices is confusing for me here. For me these should work like a usual expression evaluator, with unary/binary operators and precedence. The primary expressions should be boolean values, [] can be used for grouping.
cls:[A] | ||
cls:[B] | ||
op: || | ||
op: ^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if ^ is a modifier, it is an operator in practice. Btw is [^[^a-z]]
optimized to [a-z]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. In the grammar, it's not an operator, because it's attached to the [^
, so it's parsed as a special form of "open paren". But in the evaluator, it is an operator. The close-paren "]" emits a final OP_ECLASS_NOT if it was opened with a "[^".
Btw is [^[^a-z]] optimized to [a-z]?
Not yet, no. I will do optimisations (although these are pathological cases you suggest...). If possible I'll do it in a PR next week, although I can include it in the current PR if you require.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no requirements. Since I prefer to have these optimizations, I should do them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really will do them, I'd just rather split into a PR next week, as an extra piece of work.
Just give me a few days to get onto it. I'm away this weekend.
] | ||
0: ] | ||
|
||
/[A-C--B]/B,alt_extended_class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there tests for [A--B--C--D]
[A--^B]
[A---B]
[A----B]
[^--A]
.
Is there tests to reach the maximum depth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have most of those tests.
The expression [A--^B]
has a literal ASCII ^
character, and so it's equivalent to simply [A]
.
The expression [A---B]
is a parse error. So are [A----B]
[^--A]
.
I have tests to reach maximum depth - a successful test at max depth, and a parse error test at max+1.
eclass[ | ||
cls:[a] | ||
cls:[\-] | ||
op: || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should optimize these before the release.
Failed: error 209 at offset 3: unexpected operator in character class (no preceding operand) | ||
|
||
/[a---b]/alt_extended_class | ||
Failed: error 208 at offset 5: invalid operator in character class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be [a--[-b]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? We're following UTS#18 syntax here, not inventing our own. Other implementations forbid tripled operator characters.
A single -
hyphen is either a literal or a range separator, with complex disambiguation. A double hyphen --
is always an operator (never one end of a range: [--a]
used to mean "characters from a to hyphen" but in UTS#18 syntax it's invalid and you must instead use [\--a]
). A triple hyphen is just invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no tripled operator here. [#--]
and [--a]
are currently valid.
It looks like unicode recommendation lists several chars, which cannot be in a pattern without \
. Do you have tests for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no tripled operator here. [#--] and [--a] are currently valid.
Sorry, maybe the comment was moved by GitHub when a file was updated? I thought you were commenting on the case /[a---b]/
, saying it should act the same as [a--[-b]]
.
[--a] are currently valid
Correct. But it won't be legal if you enable UTS#18 mode. The sequence "--" becomes an operator, which requires an operand.
It looks like unicode recommendation lists several chars, which cannot be in a pattern without . Do you have tests for them?
That's true. But the spec is just so vague... For example: "Different variants of SYNTAX_CHAR, SPECIAL_CHAR, and NON_SYNTAX_CHAR can be used for particular contexts to maintain compatibility". And also: "This is only a sample syntax for the purposes of examples in this document. Regular expression syntax varies widely: the issues discussed here would need to be adapted to the syntax of the particular implementation."
Basically, they want each regex engine to be able to choose its own syntax, rather than require any more than the bare minimum of changes to their historical behaviour, in order to claim to be UTS#18-compliant.
It's mainly intended to discuss the Unicode matching issues in regex engines, and not force them to adopt a common syntax.
That's why I was most of all concerned to carefully match the behaviour of other regex engines (ECMAScript, Python).
Short answer: we shouldn't randomly forbid characters that are accepted in PCRE2 and Perl currently (such as requiring {
to be escaped inside classes). We have to add the [
and --
metacharacters, but the rest can (and should, I believe) be kept as-is.
Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class | ||
|
||
/[a--b&&c]/alt_extended_class | ||
Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I address this in the manpage updates included in the PR.
There's also the (overly?) detailed preliminary survey I did of other implementations of UTS#18 syntax, which I linked in an email thread: https://github.com/NWilson/pcre2/blob/user/niwilson/utr18/NEW-CLASSES.md
- UTS#18 doesn't define the relative precedence of the operators
- Some implementations ban mixing operators (ECMAScript spec requires this)
- Some implementations give them all the same precedence... but some implementations give them a different precedence
So it's a compatibility nightmare, what a mess. I don't think we can justify picking a "side" in the battle. It would be really unfortunate if we picked some precedence for the operators, and then they updated UTS#18 with a different choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This choice definitely simplify the current code. However, perl supports multiple operators, so the code needs to be able to handle it. This might be another thing I need to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be implementing the Perl syntax, don't worry. It will drop into the current parser very easily.
I see there is still one failing CI test. |
That is exactly how it works, I think. I didn't do anything "unusual" on purpose. There are primary expressions (atoms) which are characters/ranges/POSIX-props/Unicode-props, and there are binary operators (no unary), and [] can be used for grouping. Just as you say. Very classic recursive descent with precedence, producing a RPN sequence of operators.
Yes, it's weird. Moving some code around triggers the |
Is this a bug?
I expected it to match "acac". |
Re my previous comment: under DFA matching it does match "acac". |
Another example where DFA does what I expect, but non-DFA doesn't. May be same bug as before.
|
Yes, it is a bug. I found a similar case earlier this afternoon while adding more tests of my own. Very odd, something do with the fact that I changed OP_ECLASS to take part in auto-possessification, in the same way that OP_XCLASS does. I introduced the regression at that point. It turns out to be a small one-line copy-and-paste error, when I copied a bit of code from XCLASS to the ECLASS case, but changed a break into a return (oops!). (Anyway, all of Philip's cases are now fixed.) |
@@ -40,7 +40,7 @@ POSSIBILITY OF SUCH DAMAGE. | |||
|
|||
/* This module contains an internal function that is used to match a Unicode | |||
extended grapheme sequence. It is used by both pcre2_match() and | |||
pcre2_def_match(). However, it is called only when Unicode support is being | |||
pcre2_dfa_match(). However, it is called only when Unicode support is being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this (the need for the dummy when unicode support is not compiled that will be never called) still true?, it is confusing but it would seem that ECLASS calls this unconditionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that the ECLASS code calls into pcre2_extuni at all.
I haven't changed anything in here, except noticed a typo in a comment that was copy-and-pasted into multiple files.
e83f93b
to
cf38b78
Compare
cls:[a] | ||
cls:[B] | ||
op: || | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized an optimization is missing here.
PCRE2 traditionally uses a bitset for the first 256 characters of sets. Even for the new ECLASS, there should only be one bitset.
Should work like this:
- The ECLASS contains only XCLASS byte codes.
- If, and only if an XCLASS has a bitset, then the ECLASS has a bitset.
- The ECLASS bitset is pre-computed, no extra operation is needed
- The ECLASS should be turned to CLASS / NCLASS if nothing else only the bitset is present
- No XCLASS has a bitset
I know this is a complex, although important optimization, since it reduces storage and improves performance. I can do it in a follow-up patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do full constant-folding for the CLASS/NCLASS cases, we'd not have many redundant bitsets, in practice.
I agree we could lift out the bitset, if that's important. I guess we'd need a way to fold things like CLASS && XCLASS
and so on - remove the bitset from the XCLASS, and since the CLASS includes all chars >255 we can reduce CLASS && XCLASS → XCLASS
.
I think I can do this fine in next week's PR, when I do the CLASS op CLASS → CLASS
constant-folding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks though @zherczeg, it's a great observation, and worth doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have other priorities, I can also do these things. I don't want to put more work on you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You previously said you wouldn't do work for Excel for free! I'll finish what I started, if you give me a bit of time to chip away at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
I'm done with this!! Whew. I'm not aware of anything broken, and it's feature-complete, as far as the ALT_EXTENDED_CLASS flag goes. I've filed tickets for the follow-up work, so it's not forgotten (optimizations; and I've either addressed all of Zoltan and Carlo's feedback so far, or asked for the points raised to be deferred. Many, many thanks for all the help reviewing and testing this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have discussed a lot of things in this PR, and would be good to add a specific comment (with a fixed string which is easy to search), to the code where we will likely do more things. This way we could not forget about them.
@@ -479,13 +480,13 @@ switch(c) | |||
|
|||
case OP_NCLASS: | |||
case OP_CLASS: | |||
#ifdef SUPPORT_WIDE_CHARS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add some /* revert */
comments because we should revert these after ECLASS will have a single bitset. It is easy to forget them otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure what you intend here. Even when ECLASS has a single bitset, it will behave roughly the same as XCLASS, for the purposes of the auto-possessification code at least. There will always need to be the LINK_SIZE slot, holding the size of the stack which follows.
I'll add the TODO comment as you request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wide chars mean unicode or > 8 bit. In 8 bit mode, without utf, the regular OP_CLASS with 256 bits must handle the extended class, so it should never encounter with this opcode. In general, eclasses should be turned to normal classes/nclasses whenever it is possible, just like xclasses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the implementation, eclasses should have a prediction system, which predicts their size if they cannot be turned to class/nclass. The prediction is discarded, if they can. The prediction bit (can or cannot) can be saved in the META code (you have 16 free bits for that) during the byte code length computation phase, and reused when the actual byte code is generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, I see! If the constant-folding is sufficiently aggressive, then OP_ECLASS will never be emitted without wide-char support, the same as XCLASS. OK, that makes good sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, both the "need a bitset for eclass" and "eclass can be turned to class/nclass" bits can be computed when the byte code length is computed (two bits in the META code), and this simplifies the actual byte code generation later, since "the oracle" tell us what to do.
Concerning the documentation: Many thanks for what you have done so far, Nick. Once the Perl syntax is also implemented, I was thinking that it might be helpful to have an overall section called "Extended character classes" that explains that there is Perl syntax and everybody else's syntax and then have two separate sections, one for Perl and one for the other. I am happy to work on the documentation when the time comes. Incidentally, do we need to add (*ALT_EXTENDED_CLASS) ? |
I actually had a patch implementing |
default. */ | ||
|
||
default: | ||
PCRE2_DEBUG_UNREACHABLE(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be PCRE2_UNREACHABLE()
with no return after;
probably better not to have it, do we know of any compiler complaining?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied it from the existing code above.
I'm happy with return statements after DEBUG_UNREACHABLE() assertions, because in release builds, the assertions aren't emitted at all, so something has to happen if a customer actually triggers this condition.
Regardless of whether compilers complain, I complain if I see switch statements without a default condition! An assertion is highly beneficial for maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough, but then the argument still stands for using a PCRE2_UNREACHABLE()
instead as that return should never happen in a release build anyway.
BTW, the original code you copied from is fairly new and was using PCRE2_DEBUG_UNREACHABLE() because it was "safer" and because the implementation for MSVC was an afterthought, and based on what we learned online with zero practical experience, so it will be also nice if you could confirm that using an __assume(0)
in production makes sense in this case (which is what will be emitted in a non debug build)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it looks like the macros emit:
|---|---|---|
Debug | Non-debug | |
---|---|---|
PCRE2_DEBUG_UNREACHABLE | __assume(0) | do {} while (0) |
PCRE2_UNREACHABLE | __assume(0) | __assume(0) |
This looks correct to me.
The DEBUG_UNREACHABLE version should have error-recovery code, such as a return statement, but the UNREACHABLE() version shouldn't. I like that.
I think __assume(0) is correct for MSVC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think it should emit the following instead:
Debug | Non-debug | |
---|---|---|
PCRE2_DEBUG_UNREACHABLE | PCRE2_ASSERT(FALSE) | do {} while (0) |
PCRE2_UNREACHABLE | PCRE2_ASSERT(FALSE) | __assume(0) |
We want some actual runtime code to call abort() when we're in a debug build.
I can make a PR to do that separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have noticed recently that assert()
built-in function has pretty nice output, looks better than abort()
I think not Philip. We are currently missing start-of-pattern, and in-pattern, controls for all of the following: ALLOW_EMPTY_CLASS, ALT_BSUX, DOLLAR_ENDONLY, MATCH_UNSET_BACKREF, ALT_CIRCUMFLEX, and others. These are all flags which affect the regex dialect in some way, rather than fundamentally change what is or isn't expressible. I don't think we need a start-of-pattern flag for any of these, nor for ALT_EXTENDED_CLASS: if a pattern author knows to add the flag, they can just as easily edit the pattern itself to match PCRE's dialect. (Admittedly, that would be fiddly for a flag like ALT_CIRCUMFLEX, but certainly possible to rewrite it in terms of |
I have addressed Carlo and Zoltan's latest review comments: more TODOs; Carlo's minor improvements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commits should be squashed. I have only a few questions. Overall this patch is a good first step.
Let's wait a bit longer for @PhilipHazel , but if he is away, I can also land this patch.
the operand stack in a uint32_t. A nesting limit of 15 implies (15*2+1)=31 | ||
stack operands required. */ | ||
|
||
#define ECLASS_NEST_LIMIT 15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought 1 bit is enough for an xclass result value. Why do you need two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, because you can have two operators within a '[...]'. So if you have two brackets: [A--B[C--DE]]
then the worst-case and peak stack usage is that the leafmost [C--DE]
takes three slots on the stack (result of C, D and E, before applying || to DE and --), and the non-leafmost [A--B ...foo...]
takes two slots on the stack (need to store result of matching A and B).
Hence 5 stack slots for two [ ... [...] ]
brackets. It's 2n+1.
It would be worse if we allowed mixing precedence within the brackets; then each [...]
could consume even more stack slots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the pushes/pops are tracked individually. Something like /[A]&&[B]||[C]--[D]/
is push [A], push [B], &&, push [C], ||, push[D], --
and the max stack size is 2.
Recursive descent should be able to compute it. This looks like another thing to do then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For your example (/[A]&&[B]||[C]--[D]/
), yes, the maximum stack size is 2.
The worst case is a stack size of three, if you have A op1 B op2 C
where op2 is higher (tighter) precedence.
We have a hard limit of 15 nested brackets; which gives us a maximum stack consumption of 31 slots.
Of course we can calculate (if we want) what the actual stack slots consumed will be, but we guarantee it will fit into a uint32_t if we limit the brackets to at most 15 deep.
@@ -384,11 +384,10 @@ while (TRUE) | |||
#endif | |||
break; | |||
|
|||
#if defined SUPPORT_UNICODE || PCRE2_CODE_UNIT_WIDTH != 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have the revert comment as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that it was OK to just leave the (test-only) code here clean of ifdefs. But I can do if you prefer the ifdefs to come back.
if (PRINTABLE(j)) fprintf(f, "%c", j); | ||
else fprintf(f, "\\x%02x", j); | ||
} | ||
i = j; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems all code is moved here. Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I lifted out the existing code for printing character classes into a function.
Similarly, much of the "new" code in pcre2_compile_class.c is lifted out of pcre2_compile.c.
I am not away, but I have been letting you guys get on with this patch. I have not looked at it in detail.Please go ahead and merge when you are all happy. (I am also distracted getting a new desktop PC up and configured because my old one is over 10 years old and I'm not sure how long it will go on.) |
f28d637
to
07e6c4d
Compare
* Move some existing character class code into pcre2_compile_class.c * Add a new flag PCRE2_ALT_EXTENDED_CLASS to change the behaviour of parsing [...] character classes, to emit new META codes, and new OP_ECLASS codes for nested character classes with operators * Document the behaviour relative to the UTS#18 standard * No JIT support; it falls back to the interpreter. DFA is supported.
07e6c4d
to
d898150
Compare
Thank you Philip! I've squashed again, and added the TODO comment in printint.c requested by Zoltan. I'm ready to merge, since it's fully-functional (as far as I'm aware), and I'm eager to crack on with the follow-up PRs. |
Ok, patch is landed. I except a lot of follow up works, but it is the nature of these large features. |
Great, thank you! My next PR should follow later this week. |
This is a mostly-complete implementation!
It needs some tidy-up, in places where I've left "XXX" comments. But I've hammered out the code and I'll start testing it.
I'm afraid I used recursive descent in the operator-precedence parser, but it only goes down to a maximum depth of 15 levels, so it should be OK. I just wanted to prototype quickly. I don't know "how much stack is too much" in PCRE2 - it's only pushing ~hundreds of bytes on the stack at most.
The two big bits of logic that need to be shared with existing code, are the OP_XCLASS interpreter code, and the compiler code to build the OP_CLASS/NCLASS/XCLASS code. The current code is inlined into the one place where it's used.
Anyway, this is a draft of where it's going.