-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix OP_REFI for caseless_restrict #516
Fix OP_REFI for caseless_restrict #516
Conversation
@zherczeg Zoltan, I've half-heartedly updated the JIT code here. It hardcodes the length of the opcodes in many places, so after extending OP_REFI and OP_DNREFI I had to find all references I could. The JIT code seems to be working on all the existing tests, but fails on the one new test I added, which exercises the I'd be grateful if you'd be able to help me out on that. |
Nice catch! I am not sure which one is better, adding a new opcode (only the case insensitive variant may use restrict), or adding an argument. I would prefer a new opcode. @PhilipHazel what do you think? |
When I add |
I didn't know about that. Can restricted ascii and Turkish combined? I don't how that casing is working. Another option is using global options in the interpreter, if possible. |
Unfortunately, all four options are possible, in theory (-r -turk, +r -turk, -r +turk, +r +turk). And it can't use global options because Philip already added an inline option |
I completely forgot that it is not a pre-pattern. I simply used pattern flags here: No test failed. Anyway it could be fixed by passing extra options. |
JIT fix ae11878 Added some test as well. It turned out that DNREFI support was not added to jit :( Feel free to use the commit, no need to mention me in the patch. |
I see the CI tests are failing. Historical note: the original PCRE did keep track of the options during matching, and there was (for example) only OP_CHAR, not OP_CHARI. However, all the saving and restoring got complicated and here is a ChangeLog entry for 8.13:
Doing all the options handling at compile time should also in theory make matching a little bit faster. PCRE1 still uses the stack for backtracking, and the above change reduced the number of arguments to the recursive function, which helped with stack usage. This is not relevant for PCRE2 since the 10.30 refactoring. I added (?r) because it seemed right to make it the same as other options, but also so that pattern creators who have no access to the calling code can set it for the whole pattern in the same way as (?i) etc., though it could have been (*CASELESS_RESTRICT). As for whether to add a new opcode or an argument, either will require checking every reference to the original, so much the same amount of work. We are not short of opcodes so it may make sense to add four but isn't it eight? Four versions of OP_REFI and four versions of OP_DNREFI, or am I missing something? Eight is rather a lot; if I'm right about that, then perhaps an argument makes more sense. But I don't really mind which you do, though multiple opcodes might execute faster? BUT there are actually quite a lot of opcodes that end in ...I. Do they all need looking at now there are (will be) four different ways of doing a caseless match? |
IMHO the new "turk" flag would be incompatible with of course I might be wrong, since I haven't seen the "turk" flag implementation but I am assuming it implies the 2 entries in CaseFolding.txt we skip:
|
Oh, a |
Fixed in ac76507; took me a while to find the "standard" spelling though since it seems only sljit has those until now. |
Thank you Philip, that's really useful and interesting!
There's no reason to regret it now, it's most-flexible and works well.
Yes, it would be four each, you've counted correctly.
Aha, no! Because the other For caseless-restrict, it really is just the REFI and DNREFI code that needs to care about the specifics of the case equivalence. |
I think they're compatible, logically. Maybe a user doesn't like the "Kelvin sign"? Not a problem - you can remove that, and you still get to choose whether to have the Turkish mappings or not. To put it another way: if caseless-restrict is useful, why shouldn't Turkish users be able to select it? |
Thank you very much all three of you for your help, on this very fussy little detail fix! |
because it is useful only in a specific context (see #11, although the Perl link I posted earlier is probably clearer IMHO), where the user intentionally wants to make sure that a caseless match is only within (or not) ASCII to avoid "surprises". a "turk" flag user wants to have Unicode characters, and indeed will add 2 more cross between ASCII and Unicode when doing caseless matches, which is what you are correct it doesn't impact the same characters, and your interpretation of how it could work together is valid, but the point I was trying to make is that |
@carenas I see |
I hope Perl does not plan to use 'r' for something different. |
9818d1b
to
609d753
Compare
Rebased, and fixed to "Fall through" for consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpick, but since we are not doing multiple OPs instead of the flag, the following also needs updating in HACKING IMHO:
Lines 362 to 369 in dcbf9a0
Changeable options | |
------------------ | |
The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL) and | |
some others may be changed in the middle of patterns by items such as (?i). | |
Their processing is handled entirely at compile time by generating different | |
opcodes for the different settings. The runtime functions do not need to keep | |
track of an option's state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
b330274
to
a7657d5
Compare
Rebased, squashed, and fixed the conflicts with the PR you just merged. |
Usually Philip follows up with some documentation in ChangeLog, which this change (and most others) were missing. |
Previously, when I wanted a new (?some-letter I used an upper case letter (e.g. (?J)) to try to keep away from potential Perl changes, but (?R) was already in use so I must have decided to take a chance on (?r). Re ChangeLog: I've been thinking about this. Since the move to Git the contents of ChangeLog haven't really kept in step the way they used to, though as @carenas says, I've done some post hoc additions from time to time. Over the years, I've found it useful as a way of remembering what changed when, but perhaps others are less interested. (Reading the PCRE1 ChangeLog from the start - version 0.91 - is historically instructive as it reminds one of how much has changed since 1997.) The other think I've used ChangeLog for is for updating the NEWS file from it just before a release. I think there are perhaps three possible ways to go:
What do people think? |
I didn't meant from my comment to be a policy discussion, just a reminder for a task that needed to be done and indeed an excuse for myself to do it if it wasn't tackled independently (as shown in #519). I do think though that keeping the ChangeLog in good shape is important, and also that is done in a timely way so that it could be described properly (specially considering that commits and PR descriptions as of now are not consistently used to work as the source of an automated replacement). Updating them with the committed changes is not ideal though (as they will likely result in conflicts), so probably 2 might be the only reasonable short term option? |
When you introduced caseless_restrict earlier in the year, it looks like you forgot to add it to the
OP_REFI
(andOP_DNREFI
) comparison.It's actually a bit tricky to add, because it seems the "flags" that are tracked during compile are not written out to the bytecode.
So, I added one extra field to
OP_REFI
to record the caseless flag. (Currently, it's just "caseless_restrict", but as I mentioned, I'll be adding "turkish_casing" in a future PR.)I think there's no other way to do it? I hope the flags aren't stored in the bytecode already and I just missed it.
It seems a bit wasteful, perhaps, to spend a uint8 (or uint32 in the case of the 32-bit library), but I think you basically forced this outcome back when you added the
(?r)
flag allowing caseless_restrict to be varied through the pattern.