Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix OP_REFI for caseless_restrict #516

Merged

Conversation

NWilson
Copy link
Contributor

@NWilson NWilson commented Oct 7, 2024

When you introduced caseless_restrict earlier in the year, it looks like you forgot to add it to the OP_REFI (and OP_DNREFI) comparison.

It's actually a bit tricky to add, because it seems the "flags" that are tracked during compile are not written out to the bytecode.

So, I added one extra field to OP_REFI to record the caseless flag. (Currently, it's just "caseless_restrict", but as I mentioned, I'll be adding "turkish_casing" in a future PR.)

I think there's no other way to do it? I hope the flags aren't stored in the bytecode already and I just missed it.

It seems a bit wasteful, perhaps, to spend a uint8 (or uint32 in the case of the 32-bit library), but I think you basically forced this outcome back when you added the (?r) flag allowing caseless_restrict to be varied through the pattern.

@NWilson
Copy link
Contributor Author

NWilson commented Oct 7, 2024

@zherczeg Zoltan, I've half-heartedly updated the JIT code here. It hardcodes the length of the opcodes in many places, so after extending OP_REFI and OP_DNREFI I had to find all references I could.

The JIT code seems to be working on all the existing tests, but fails on the one new test I added, which exercises the caseless_restrict behaviour in a backref.

I'd be grateful if you'd be able to help me out on that.

@zherczeg
Copy link
Collaborator

zherczeg commented Oct 7, 2024

Nice catch! I am not sure which one is better, adding a new opcode (only the case insensitive variant may use restrict), or adding an argument. I would prefer a new opcode. @PhilipHazel what do you think?

@NWilson
Copy link
Contributor Author

NWilson commented Oct 7, 2024

When I add PCRE2_EXTRA_TURKISH_CASING it would double the number of possibilities, so it would end up as 4 opcodes. That's not terrible, so I suppose either is possible.

@zherczeg
Copy link
Collaborator

zherczeg commented Oct 7, 2024

I didn't know about that. Can restricted ascii and Turkish combined? I don't how that casing is working. Another option is using global options in the interpreter, if possible.

@NWilson
Copy link
Contributor Author

NWilson commented Oct 7, 2024

Unfortunately, all four options are possible, in theory (-r -turk, +r -turk, -r +turk, +r +turk).

And it can't use global options because Philip already added an inline option /(?r: ... )/ rather than making it a pre-pattern option instead.

@zherczeg
Copy link
Collaborator

zherczeg commented Oct 7, 2024

I completely forgot that it is not a pre-pattern. I simply used pattern flags here:
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile_class.c#L448

No test failed. Anyway it could be fixed by passing extra options.

@zherczeg
Copy link
Collaborator

zherczeg commented Oct 7, 2024

JIT fix ae11878

Added some test as well. It turned out that DNREFI support was not added to jit :(

Feel free to use the commit, no need to mention me in the patch.

@PhilipHazel
Copy link
Collaborator

I see the CI tests are failing.

Historical note: the original PCRE did keep track of the options during matching, and there was (for example) only OP_CHAR, not OP_CHARI. However, all the saving and restoring got complicated and here is a ChangeLog entry for 8.13:

  1. Some internal refactoring has changed the processing so that the handling
    of the PCRE_CASELESS and PCRE_MULTILINE options is done entirely at compile
    time (the PCRE_DOTALL option was changed this way some time ago: version
    7.7 change 16). This has made it possible to abolish the OP_OPT op code,
    which was always a bit of a fudge.

Doing all the options handling at compile time should also in theory make matching a little bit faster. PCRE1 still uses the stack for backtracking, and the above change reduced the number of arguments to the recursive function, which helped with stack usage. This is not relevant for PCRE2 since the 10.30 refactoring.

I added (?r) because it seemed right to make it the same as other options, but also so that pattern creators who have no access to the calling code can set it for the whole pattern in the same way as (?i) etc., though it could have been (*CASELESS_RESTRICT).

As for whether to add a new opcode or an argument, either will require checking every reference to the original, so much the same amount of work. We are not short of opcodes so it may make sense to add four but isn't it eight? Four versions of OP_REFI and four versions of OP_DNREFI, or am I missing something? Eight is rather a lot; if I'm right about that, then perhaps an argument makes more sense. But I don't really mind which you do, though multiple opcodes might execute faster? BUT there are actually quite a lot of opcodes that end in ...I. Do they all need looking at now there are (will be) four different ways of doing a caseless match?

@carenas
Copy link
Contributor

carenas commented Oct 7, 2024

Unfortunately, all four options are possible, in theory (-r -turk, +r -turk, -r +turk, +r +turk)

IMHO the new "turk" flag would be incompatible with +r, because by definition what the later does is IGNORE any othercase that crosses the ASCII/UTF boundary (see Perl's /aa mode).

of course I might be wrong, since I haven't seen the "turk" flag implementation but I am assuming it implies the 2 entries in CaseFolding.txt we skip:

0049; T; 0131; # LATIN CAPITAL LETTER I
0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE

@zherczeg
Copy link
Collaborator

zherczeg commented Oct 7, 2024

Oh, a /* Fall through. */ is missing from the patch.

@carenas
Copy link
Contributor

carenas commented Oct 7, 2024

Oh, a /* Fall through. */ is missing from the patch.

Fixed in ac76507; took me a while to find the "standard" spelling though since it seems only sljit has those until now.

@NWilson
Copy link
Contributor Author

NWilson commented Oct 7, 2024

Historical note: the original PCRE did keep track of the options during matching, and there was (for example) only OP_CHAR, not OP_CHARI. However, all the saving and restoring got complicated and here is a ChangeLog entry for 8.13:

Thank you Philip, that's really useful and interesting!

I added (?r) because it seemed right to make it the same as other options, but also so that pattern creators who have no access to the calling code can set it for the whole pattern in the same way as (?i) etc., though it could have been (*CASELESS_RESTRICT).

There's no reason to regret it now, it's most-flexible and works well.

We are not short of opcodes so it may make sense to add four but isn't it eight? Four versions of OP_REFI and four versions of OP_DNREFI, or am I missing something?

Yes, it would be four each, you've counted correctly.

BUT there are actually quite a lot of opcodes that end in ...I. Do they all need looking at now there are (will be) four different ways of doing a caseless match?

Aha, no! Because the other OP_...I opcodes have quite specific behaviours tied to the default case-folding relationship. For example, OP_CHARI is fixed to match exactly two characters (the one that follows it in the bytecode, and that char's "other case"). The notion of "other case" isn't affected by caseless-restrict, so no change is needed.

For caseless-restrict, it really is just the REFI and DNREFI code that needs to care about the specifics of the case equivalence.

@NWilson
Copy link
Contributor Author

NWilson commented Oct 7, 2024

IMHO the new "turk" flag would be incompatible with +r, because by definition what the later does is IGNORE any othercase that crosses the ASCII/UTF boundary (see Perl's /aa mode).

I think they're compatible, logically. (?r) removes the case-folding rules that cross the ASCII/non-ASCII boundary (which is actually just two characters). And Turkish adds two case-folding rules for the Turkish alphabet. They aren't touching the same characters, so they both make sense.

Maybe a user doesn't like the "Kelvin sign"? Not a problem - you can remove that, and you still get to choose whether to have the Turkish mappings or not.

To put it another way: if caseless-restrict is useful, why shouldn't Turkish users be able to select it?

@NWilson
Copy link
Contributor Author

NWilson commented Oct 7, 2024

Thank you very much all three of you for your help, on this very fussy little detail fix!

@carenas
Copy link
Contributor

carenas commented Oct 7, 2024

To put it another way: if caseless-restrict is useful, why shouldn't Turkish users be able to select it?

because it is useful only in a specific context (see #11, although the Perl link I posted earlier is probably clearer IMHO), where the user intentionally wants to make sure that a caseless match is only within (or not) ASCII to avoid "surprises".

a "turk" flag user wants to have Unicode characters, and indeed will add 2 more cross between ASCII and Unicode when doing caseless matches, which is what (?r) is meant to prevent.

you are correct it doesn't impact the same characters, and your interpretation of how it could work together is valid, but the point I was trying to make is that (?r) restricts ALL characters (eventhough they are currently only 2 pairs affected in our case tables). as a plus we would only need 3 versions of each affected opcode.

@zherczeg
Copy link
Collaborator

zherczeg commented Oct 8, 2024

@carenas I see /* Fall through */ in the other places as well, if you prefer to use /* fallthrough */, change it everywhere (in another patch). I also prefer one patch for one PR.

@zherczeg
Copy link
Collaborator

zherczeg commented Oct 8, 2024

I hope Perl does not plan to use 'r' for something different.

@NWilson NWilson force-pushed the user/niwilson/restrict-backref branch from 9818d1b to 609d753 Compare October 8, 2024 08:35
@NWilson
Copy link
Contributor Author

NWilson commented Oct 8, 2024

Rebased, and fixed to "Fall through" for consistency.

Copy link
Contributor

@carenas carenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpick, but since we are not doing multiple OPs instead of the flag, the following also needs updating in HACKING IMHO:

pcre2/HACKING

Lines 362 to 369 in dcbf9a0

Changeable options
------------------
The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL) and
some others may be changed in the middle of patterns by items such as (?i).
Their processing is handled entirely at compile time by generating different
opcodes for the different settings. The runtime functions do not need to keep
track of an option's state.

Copy link
Collaborator

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zherczeg
Copy link
Collaborator

zherczeg commented Oct 8, 2024

@carenas if you are ok with the patch, I will land it. @NWilson please update the test conflicts and merge commits

@NWilson NWilson force-pushed the user/niwilson/restrict-backref branch from b330274 to a7657d5 Compare October 8, 2024 10:20
@NWilson
Copy link
Contributor Author

NWilson commented Oct 8, 2024

Rebased, squashed, and fixed the conflicts with the PR you just merged.

@zherczeg zherczeg merged commit 2239414 into PCRE2Project:master Oct 8, 2024
15 checks passed
@carenas
Copy link
Contributor

carenas commented Oct 8, 2024

Usually Philip follows up with some documentation in ChangeLog, which this change (and most others) were missing.

@PhilipHazel
Copy link
Collaborator

Previously, when I wanted a new (?some-letter I used an upper case letter (e.g. (?J)) to try to keep away from potential Perl changes, but (?R) was already in use so I must have decided to take a chance on (?r).

Re ChangeLog: I've been thinking about this. Since the move to Git the contents of ChangeLog haven't really kept in step the way they used to, though as @carenas says, I've done some post hoc additions from time to time. Over the years, I've found it useful as a way of remembering what changed when, but perhaps others are less interested. (Reading the PCRE1 ChangeLog from the start - version 0.91 - is historically instructive as it reminds one of how much has changed since 1997.) The other think I've used ChangeLog for is for updating the NEWS file from it just before a release. I think there are perhaps three possible ways to go:

  1. Close ChangeLog and in future rely only on Git log entries.
  2. From time to time, probably just before a release, go through the Git log and update ChangeLog appropriately.
  3. Keep hassling everybody to update ChangeLog for every change.

What do people think?

@carenas
Copy link
Contributor

carenas commented Oct 9, 2024

I didn't meant from my comment to be a policy discussion, just a reminder for a task that needed to be done and indeed an excuse for myself to do it if it wasn't tackled independently (as shown in #519).

I do think though that keeping the ChangeLog in good shape is important, and also that is done in a timely way so that it could be described properly (specially considering that commits and PR descriptions as of now are not consistently used to work as the source of an automated replacement).

Updating them with the committed changes is not ideal though (as they will likely result in conflicts), so probably 2 might be the only reasonable short term option?

@NWilson NWilson deleted the user/niwilson/restrict-backref branch November 6, 2024 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants