Fix OP_REFI for caseless_restrict #516

NWilson · 2024-10-07T10:46:38Z

When you introduced caseless_restrict earlier in the year, it looks like you forgot to add it to the OP_REFI (and OP_DNREFI) comparison.

It's actually a bit tricky to add, because it seems the "flags" that are tracked during compile are not written out to the bytecode.

So, I added one extra field to OP_REFI to record the caseless flag. (Currently, it's just "caseless_restrict", but as I mentioned, I'll be adding "turkish_casing" in a future PR.)

I think there's no other way to do it? I hope the flags aren't stored in the bytecode already and I just missed it.

It seems a bit wasteful, perhaps, to spend a uint8 (or uint32 in the case of the 32-bit library), but I think you basically forced this outcome back when you added the (?r) flag allowing caseless_restrict to be varied through the pattern.

NWilson · 2024-10-07T10:48:41Z

@zherczeg Zoltan, I've half-heartedly updated the JIT code here. It hardcodes the length of the opcodes in many places, so after extending OP_REFI and OP_DNREFI I had to find all references I could.

The JIT code seems to be working on all the existing tests, but fails on the one new test I added, which exercises the caseless_restrict behaviour in a backref.

I'd be grateful if you'd be able to help me out on that.

zherczeg · 2024-10-07T11:10:25Z

Nice catch! I am not sure which one is better, adding a new opcode (only the case insensitive variant may use restrict), or adding an argument. I would prefer a new opcode. @PhilipHazel what do you think?

NWilson · 2024-10-07T12:59:23Z

When I add PCRE2_EXTRA_TURKISH_CASING it would double the number of possibilities, so it would end up as 4 opcodes. That's not terrible, so I suppose either is possible.

zherczeg · 2024-10-07T13:14:09Z

I didn't know about that. Can restricted ascii and Turkish combined? I don't how that casing is working. Another option is using global options in the interpreter, if possible.

NWilson · 2024-10-07T13:37:04Z

Unfortunately, all four options are possible, in theory (-r -turk, +r -turk, -r +turk, +r +turk).

And it can't use global options because Philip already added an inline option /(?r: ... )/ rather than making it a pre-pattern option instead.

zherczeg · 2024-10-07T14:22:07Z

I completely forgot that it is not a pre-pattern. I simply used pattern flags here:
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile_class.c#L448

No test failed. Anyway it could be fixed by passing extra options.

zherczeg · 2024-10-07T16:34:52Z

JIT fix ae11878

Added some test as well. It turned out that DNREFI support was not added to jit :(

Feel free to use the commit, no need to mention me in the patch.

PhilipHazel · 2024-10-07T16:36:02Z

I see the CI tests are failing.

Historical note: the original PCRE did keep track of the options during matching, and there was (for example) only OP_CHAR, not OP_CHARI. However, all the saving and restoring got complicated and here is a ChangeLog entry for 8.13:

Some internal refactoring has changed the processing so that the handling
of the PCRE_CASELESS and PCRE_MULTILINE options is done entirely at compile
time (the PCRE_DOTALL option was changed this way some time ago: version
7.7 change 16). This has made it possible to abolish the OP_OPT op code,
which was always a bit of a fudge.

Doing all the options handling at compile time should also in theory make matching a little bit faster. PCRE1 still uses the stack for backtracking, and the above change reduced the number of arguments to the recursive function, which helped with stack usage. This is not relevant for PCRE2 since the 10.30 refactoring.

I added (?r) because it seemed right to make it the same as other options, but also so that pattern creators who have no access to the calling code can set it for the whole pattern in the same way as (?i) etc., though it could have been (*CASELESS_RESTRICT).

As for whether to add a new opcode or an argument, either will require checking every reference to the original, so much the same amount of work. We are not short of opcodes so it may make sense to add four but isn't it eight? Four versions of OP_REFI and four versions of OP_DNREFI, or am I missing something? Eight is rather a lot; if I'm right about that, then perhaps an argument makes more sense. But I don't really mind which you do, though multiple opcodes might execute faster? BUT there are actually quite a lot of opcodes that end in ...I. Do they all need looking at now there are (will be) four different ways of doing a caseless match?

carenas · 2024-10-07T17:23:02Z

Unfortunately, all four options are possible, in theory (-r -turk, +r -turk, -r +turk, +r +turk)

IMHO the new "turk" flag would be incompatible with +r, because by definition what the later does is IGNORE any othercase that crosses the ASCII/UTF boundary (see Perl's /aa mode).

of course I might be wrong, since I haven't seen the "turk" flag implementation but I am assuming it implies the 2 entries in CaseFolding.txt we skip:

0049; T; 0131; # LATIN CAPITAL LETTER I
0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE

zherczeg · 2024-10-07T17:40:03Z

Oh, a /* Fall through. */ is missing from the patch.

carenas · 2024-10-07T18:15:55Z

Oh, a /* Fall through. */ is missing from the patch.

Fixed in ac76507; took me a while to find the "standard" spelling though since it seems only sljit has those until now.

NWilson · 2024-10-07T19:04:56Z

Historical note: the original PCRE did keep track of the options during matching, and there was (for example) only OP_CHAR, not OP_CHARI. However, all the saving and restoring got complicated and here is a ChangeLog entry for 8.13:

Thank you Philip, that's really useful and interesting!

I added (?r) because it seemed right to make it the same as other options, but also so that pattern creators who have no access to the calling code can set it for the whole pattern in the same way as (?i) etc., though it could have been (*CASELESS_RESTRICT).

There's no reason to regret it now, it's most-flexible and works well.

We are not short of opcodes so it may make sense to add four but isn't it eight? Four versions of OP_REFI and four versions of OP_DNREFI, or am I missing something?

Yes, it would be four each, you've counted correctly.

BUT there are actually quite a lot of opcodes that end in ...I. Do they all need looking at now there are (will be) four different ways of doing a caseless match?

Aha, no! Because the other OP_...I opcodes have quite specific behaviours tied to the default case-folding relationship. For example, OP_CHARI is fixed to match exactly two characters (the one that follows it in the bytecode, and that char's "other case"). The notion of "other case" isn't affected by caseless-restrict, so no change is needed.

For caseless-restrict, it really is just the REFI and DNREFI code that needs to care about the specifics of the case equivalence.

NWilson · 2024-10-07T19:10:48Z

IMHO the new "turk" flag would be incompatible with +r, because by definition what the later does is IGNORE any othercase that crosses the ASCII/UTF boundary (see Perl's /aa mode).

I think they're compatible, logically. (?r) removes the case-folding rules that cross the ASCII/non-ASCII boundary (which is actually just two characters). And Turkish adds two case-folding rules for the Turkish alphabet. They aren't touching the same characters, so they both make sense.

Maybe a user doesn't like the "Kelvin sign"? Not a problem - you can remove that, and you still get to choose whether to have the Turkish mappings or not.

To put it another way: if caseless-restrict is useful, why shouldn't Turkish users be able to select it?

NWilson · 2024-10-07T19:19:47Z

Thank you very much all three of you for your help, on this very fussy little detail fix!

carenas · 2024-10-07T19:31:56Z

To put it another way: if caseless-restrict is useful, why shouldn't Turkish users be able to select it?

because it is useful only in a specific context (see #11, although the Perl link I posted earlier is probably clearer IMHO), where the user intentionally wants to make sure that a caseless match is only within (or not) ASCII to avoid "surprises".

a "turk" flag user wants to have Unicode characters, and indeed will add 2 more cross between ASCII and Unicode when doing caseless matches, which is what (?r) is meant to prevent.

you are correct it doesn't impact the same characters, and your interpretation of how it could work together is valid, but the point I was trying to make is that (?r) restricts ALL characters (eventhough they are currently only 2 pairs affected in our case tables). as a plus we would only need 3 versions of each affected opcode.

zherczeg · 2024-10-08T07:17:31Z

@carenas I see /* Fall through */ in the other places as well, if you prefer to use /* fallthrough */, change it everywhere (in another patch). I also prefer one patch for one PR.

zherczeg · 2024-10-08T07:24:31Z

I hope Perl does not plan to use 'r' for something different.

NWilson · 2024-10-08T08:36:23Z

Rebased, and fixed to "Fall through" for consistency.

carenas

Minor nitpick, but since we are not doing multiple OPs instead of the flag, the following also needs updating in HACKING IMHO:

pcre2/HACKING

Lines 362 to 369 in dcbf9a0

    
           Changeable options 
        
           ------------------ 
        
           The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL) and 
        
           some others may be changed in the middle of patterns by items such as (?i). 
        
           Their processing is handled entirely at compile time by generating different 
        
           opcodes for the different settings. The runtime functions do not need to keep 
        
           track of an option's state.

zherczeg

LGTM

zherczeg · 2024-10-08T10:05:49Z

@carenas if you are ok with the patch, I will land it. @NWilson please update the test conflicts and merge commits

NWilson · 2024-10-08T10:20:25Z

Rebased, squashed, and fixed the conflicts with the PR you just merged.

carenas · 2024-10-08T11:40:22Z

Usually Philip follows up with some documentation in ChangeLog, which this change (and most others) were missing.

PhilipHazel · 2024-10-08T15:43:19Z

Previously, when I wanted a new (?some-letter I used an upper case letter (e.g. (?J)) to try to keep away from potential Perl changes, but (?R) was already in use so I must have decided to take a chance on (?r).

Re ChangeLog: I've been thinking about this. Since the move to Git the contents of ChangeLog haven't really kept in step the way they used to, though as @carenas says, I've done some post hoc additions from time to time. Over the years, I've found it useful as a way of remembering what changed when, but perhaps others are less interested. (Reading the PCRE1 ChangeLog from the start - version 0.91 - is historically instructive as it reminds one of how much has changed since 1997.) The other think I've used ChangeLog for is for updating the NEWS file from it just before a release. I think there are perhaps three possible ways to go:

Close ChangeLog and in future rely only on Git log entries.
From time to time, probably just before a release, go through the Git log and update ChangeLog appropriately.
Keep hassling everybody to update ChangeLog for every change.

What do people think?

carenas · 2024-10-09T03:32:01Z

I didn't meant from my comment to be a policy discussion, just a reminder for a task that needed to be done and indeed an excuse for myself to do it if it wasn't tackled independently (as shown in #519).

I do think though that keeping the ChangeLog in good shape is important, and also that is done in a timely way so that it could be described properly (specially considering that commits and PR descriptions as of now are not consistently used to work as the source of an automated replacement).

Updating them with the committed changes is not ideal though (as they will likely result in conflicts), so probably 2 might be the only reasonable short term option?

NWilson force-pushed the user/niwilson/restrict-backref branch from 9818d1b to 609d753 Compare October 8, 2024 08:35

carenas approved these changes Oct 8, 2024

View reviewed changes

zherczeg approved these changes Oct 8, 2024

View reviewed changes

Fix OP_REFI for caseless_restrict

a7657d5

NWilson force-pushed the user/niwilson/restrict-backref branch from b330274 to a7657d5 Compare October 8, 2024 10:20

carenas approved these changes Oct 8, 2024

View reviewed changes

zherczeg merged commit 2239414 into PCRE2Project:master Oct 8, 2024
15 checks passed

NWilson deleted the user/niwilson/restrict-backref branch November 6, 2024 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OP_REFI for caseless_restrict #516

Fix OP_REFI for caseless_restrict #516

NWilson commented Oct 7, 2024

NWilson commented Oct 7, 2024

zherczeg commented Oct 7, 2024

NWilson commented Oct 7, 2024

zherczeg commented Oct 7, 2024

NWilson commented Oct 7, 2024

zherczeg commented Oct 7, 2024

zherczeg commented Oct 7, 2024

PhilipHazel commented Oct 7, 2024

carenas commented Oct 7, 2024

zherczeg commented Oct 7, 2024

carenas commented Oct 7, 2024

NWilson commented Oct 7, 2024

NWilson commented Oct 7, 2024 •

edited

Loading

NWilson commented Oct 7, 2024

carenas commented Oct 7, 2024 •

edited

Loading

zherczeg commented Oct 8, 2024

zherczeg commented Oct 8, 2024

NWilson commented Oct 8, 2024

carenas left a comment

zherczeg left a comment

zherczeg commented Oct 8, 2024

NWilson commented Oct 8, 2024

carenas commented Oct 8, 2024 •

edited

Loading

PhilipHazel commented Oct 8, 2024

carenas commented Oct 9, 2024 •

edited

Loading

	Changeable options
	------------------

	The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL) and
	some others may be changed in the middle of patterns by items such as (?i).
	Their processing is handled entirely at compile time by generating different
	opcodes for the different settings. The runtime functions do not need to keep
	track of an option's state.

Fix OP_REFI for caseless_restrict #516

Fix OP_REFI for caseless_restrict #516

Conversation

NWilson commented Oct 7, 2024

NWilson commented Oct 7, 2024

zherczeg commented Oct 7, 2024

NWilson commented Oct 7, 2024

zherczeg commented Oct 7, 2024

NWilson commented Oct 7, 2024

zherczeg commented Oct 7, 2024

zherczeg commented Oct 7, 2024

PhilipHazel commented Oct 7, 2024

carenas commented Oct 7, 2024

zherczeg commented Oct 7, 2024

carenas commented Oct 7, 2024

NWilson commented Oct 7, 2024

NWilson commented Oct 7, 2024 • edited Loading

NWilson commented Oct 7, 2024

carenas commented Oct 7, 2024 • edited Loading

zherczeg commented Oct 8, 2024

zherczeg commented Oct 8, 2024

NWilson commented Oct 8, 2024

carenas left a comment

Choose a reason for hiding this comment

zherczeg left a comment

Choose a reason for hiding this comment

zherczeg commented Oct 8, 2024

NWilson commented Oct 8, 2024

carenas commented Oct 8, 2024 • edited Loading

PhilipHazel commented Oct 8, 2024

carenas commented Oct 9, 2024 • edited Loading

NWilson commented Oct 7, 2024 •

edited

Loading

carenas commented Oct 7, 2024 •

edited

Loading

carenas commented Oct 8, 2024 •

edited

Loading

carenas commented Oct 9, 2024 •

edited

Loading