-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Turkish 'i' casing #521
Add support for Turkish 'i' casing #521
Conversation
maint/GenerateUcd.py
Outdated
Turkish case-equivalences. */ | ||
const uint32_t PRIV(ucd_turkish_dotted_i_caseset) = %d; | ||
const uint32_t PRIV(ucd_turkish_dotless_i_caseset) = %d; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel very happy about adding these.
I wanted a simple and vaguely efficient way to transform /[i]/i
→ /[iİ]/
at compile-time, so I wouldn't need to change anything in the matcher. How to neatly represent a set of exactly two characters? I thought I could re-use the existing PT_CLIST mechanism.
It's vaguely tidy. But even better would be an "OP_2CHAR" that gives a dedicated efficient encoding for either-of-two characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using PT_CLIST is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In another thread, I discussed getting rid of PT_CLIST, in which case I'd replace it with a simpler "OP_2CHAR".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a new OP_VCHARI
that would point to the correct entry on this table depending of PCRE2_EXTRA_TURKISH_CASING. They are always 2 characters equivalences anyway while the casesets was meant to be used by 3 or more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is, that each of these basic opcodes requires repeated variants, and that is already a lot of code cloning in the matcher. Implementing it is easy in jit, but we should not forget about the interpreter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be worth replacing PT_CLIST with "OP_2CHAR" in a future PR. It's not necessary right now though, just to get the Turkish feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OP_2CHAR seems a reasonable idea for this very special case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope a 2CHAR could also replace CHARI. Should be faster than reading UCD records. The CHARI could be limited to 8 bit only.
@@ -5293,7 +5296,7 @@ restriction is in force). Sometimes we can just extend the original range. */ | |||
|
|||
if ((options & PCRE2_CASELESS) != 0) | |||
{ | |||
#ifndef SUPPORT_UNICODE | |||
#ifdef SUPPORT_UNICODE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be an existing bug, which was harmless
@@ -737,6 +737,13 @@ def write_bitsets(list, item_size): | |||
if x > 127 and x + other_case[x] < 128: | |||
other_case[x] = 0 | |||
|
|||
# Append a couple of extra caseless sets (unreferenced by the record objects) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the records exist in the source file and were skipped (they have a T "type" and we only process the ones with "C" or "S".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right. They are just a fixed pair of well-known mappings for i / I
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about adding d5fbf94 on top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think about it. I'm still not sure we even want the Turkish characters in there, it's a temporary hack really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is just a slightly cleaner hack IMHO; and even if somehow OP_CHAR2
materializes, doing something more data driven and with less hardcoding would be easier to adjust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better remove the non standard (?t)
, it would seem that the extra ucd classes might not be needed if you are also overloading ucd_casesets
This code adds many specializations, and I am still thinking how could we optimize it. If I understand correctly, the Turkish casing removes the Their choices for constants does makes it hard. Maybe some macros could help. |
Yes, exactly correct. The most-annoying thing is that even infects the ASCII fast-path processing, not just the slower UCD record access. Thankfully, there aren't loads of places where PCRE2 needs to worry about casing of characters. And some of them don't need to be updated - for example, we don't need change the idea of casing used by the "first code-unit" support. What I think we want:
So the Turkish (and ASCII-restrict) casing should really only affect whether characters are compiled to standard or special-cased opcodes (...and also the OP_REFI impact). That's not too bad, it's relatively restricted and clear. I've largely only had to add Turkish pollution to chunks of code that already had ASCII-caseless-restrict pollution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed something. It seems the turkish flag overwrites the restricted flag. Is this correct?
How the two flags should work together?
No, it shouldn't do. They ought to work together fine. CaseFolding.txt describes the case-equivalences between characters. CASELESS_RESTRICT unlinks the equivalences (from that file) which cross the ASCII boundary; it only affects the "k/Kelvin" and "s/Long-S" equivalences. TURKISH_CASING sets new equivalences for the 4 "i" characters. I should add tests for them both at once. It looks like I forgot to do that. |
Btw, when both restricted and turkish are enabled, the |
That is what I suggested as well but couldn't convince Nick; the documentation isn't clear about the expected behaviour (which I find confusing) either:
And could even qualify as a regression for #11. |
I've added documentation in the |
Zoltan, using the "PT_CLIST" for the Turkish I characters causes an assertion failure in SLJIT:
Maybe now would be the time to introduce OP_2CHAR... although that would mean even more JIT work to implement it (but fairly simple at least, I hope). I've also tried to make the change in |
Let me do more free work for microsoft: Btw, it would be good to test ucp (no utf) caseless turkish backreferences. Is there tests for turkish + restricted? |
Picking up on a couple of points: As far as (?r) goes, we could introduce (*CASELESS_RESTRICT) without causing any incompatibility. At the same time, the use of (?r) could be deprecated for a release or two before removal. I agree with NIck (and therefore disagree with Carlo) that the need for this to vary in different parts of the pattern is minimal. However, I note that the nearest thing Perl has to (?r) is (?aa) which can be varied throughout the pattern. This output is from perltest.sh:
Perl's /aa is not the same as PCRE2's /r because /aa also makes \d etc into ASCII-only items. That must have been why I implemented (?aD) etc in PCRE2 as separate switches. Perhaps we should change (?r) into (?aR) and (?t) into (?aT), though I'm just throwing out ideas here. Oops, no, (?aT) is already taken. (?aI) is free.... This is all a mess. As far as EBCDIC is concerned, the code should compile in two different ways in an EBCDIC environment: (a) dealing with 8-bit EBCDIC as input and (b) reading Unicode encoded as UTF-8 as input. This was requested for the processing of UTF-8 files in an EBCDIC environment. In case (a) there is clearly no need for ASCII/Unicode support. As I understand it, IBM's z/OS has a native environment, which is what Ze'ev Atlas works in, but also "Unix system services" which "allows UNIX applications from other platforms to run on IBM System z mainframes running z/OS" (Wikipedia). In theory, there is a small amount of testing that can be done by compiling PCRE2 --with-ebcdic on an ASCII system, but it is many years since I tried it. |
but there is no UTF-8 in EBCDIC systems, are you sure it is not UTF-EBCDIC which is what Perl supports with if what they were actually using is UTF-16, then the code is correct because it is compatible with UTF-8, though and uses the same scalars for all 4 characters, but if it is the other UTF which seems to match better with EBCDIC CCSID 1026 then at least U+69 will be encoded as x89 and U+131 (AKA dotless i) would be in x79 (which are still 1 byte encodings) even with "UTF" |
You certainly can have UTF-8 files on an EBCDIC system. There's nothing magic about an EBCDIC system that stops you storing any binary on it. I think in UTF mode on an EBCDIC system, PCRE2 appears to just do... use normal UTF. |
I have:
There is one test case that's failing for the JIT, which I'll have a go fixing in the morning tomorrow. Otherwise, I think this is basically good to go. |
76b8567
to
d414c45
Compare
doc/html/pcre2unicode.html
Outdated
case-folding mappings crossing the ASCII boundary (the Turkish 'i' ones, as | ||
requested), but the other mappings (Kelvin sign and long S) are removed. | ||
If you require exact ASCII matching of ASCII characters in patterns, then | ||
you must use PCRE2_EXTRA_CASELESS_RESTRICT without PCRE2_EXTRA_TURKISH_CASING. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we discuss this? This means turkish is stronger than restrict, although they can exists together (simply removes i-I casing).
Restrict was promoted as a constraint, not an effect which can be changed later:
https://php.watch/versions/8.4/pcre2-regexp-syntax-changes
"When applied, it prevents the matching across ASCII and non-ASCII characters."
I don't know what is the best decision, since it is a new feature, but confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You and Carlo both think that it's confusion. So maybe we could just forbid them from being used together.
That's not easy though... since the (?r)
flag can be set mid-pattern!
If you use Turkish, you really do need the case mappings that cross the ASCII boundary. It would be broken to have some half-way behaviour, that disassociates the 'i' and 'I' characters, but doesn't add the association between those characters and U+130, U+131.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support the mutual exclusivity. I think it was even suggested before. If somebody really needs it in the future, we can think about that. And hear his use case first...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use Turkish, you really do need the case mappings that cross the ASCII boundary.
Since we are going to produce an error, we should also (at least for now) trigger it when the utf/ucp flags aren't also provided, because turkish_caseless
ONLY knows how to make that association in that case.
if an 8-bit codepage is in use, is also not clear what turkish_caseless
should do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should be the behaviour be if you do /...(?r: foo ).../i,turkish_casing
? Should it deactivate the turkish casing, when ASCII-restrict is activated, so i and I become equal again? Or just "break" the turkish casing, leaving all 4 of the 'i' characters not equal to each other?
They're both nasty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind as long as the behaviour is clearly documented. The /...(?r:foo).../i,turkish_casing example is a plausible scenario for a pattern that is looking for something ascii in the middle of some Turkish, in which case turning off turkish casing does make sense. I think this is, in fact, what Perl does. With a hacked version of perltest.sh (I will tidy this up and commit in due course) I get this:
Locale: tr_TR.utf8
Perl v5.40.0
/\x{130}/gi,utf
\x{130}\x{131}iI
0: \x{130}
0: i
/\x{131}/gi,utf
\x{130}\x{131}iI
0: \x{131}
0: I
/(?aa)\x{130}/gi,utf
\x{130}\x{131}iI
0: \x{130}
/(?aa)\x{131}/gi,utf
\x{130}\x{131}iI
0: \x{131}
In other words, U+0130 matches i and U+0131 matches I in Turkish \i mode, but if ascii is forced, they don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to return a compiler error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A compiler error would not be compatible with Perl, and there is a plausible use case for what Perl does.
@@ -2835,8 +2838,8 @@ be quantified. */ | |||
|
|||
/* Here's the actual function. */ | |||
|
|||
static int parse_regex(PCRE2_SPTR ptr, uint32_t options, BOOL *has_lookbehind, | |||
compile_block *cb) | |||
static int parse_regex(PCRE2_SPTR ptr, uint32_t options, uint32_t xoptions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We removed (?t)
and replaced with (*TURKISH_CASING)
, so the xoptions can't be read from the compile-context anymore (since they can now be changed with a start-of-pattern option).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we should revert the change then.
It looks like my incorrect suggestion was causing the failure:
The first character of the caseset needs to be checked for ascii, not the current character. The test was good to reveal it! Thank you! |
I have updated the CASELESS_RESTRICT and TURKISH_CASING behaviour, as requested.
Finally, these checks are not triggered if CASELESS_RESTRICT is set mid-pattern. In this case, if both CASELESS_RESTRICT and TURKISH_CASING are set, then Turkish is totally deactivated as long as CASELESS_RESTRICT is in force. I have updated the docs; but I need to update the tests to match now... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this code much better.
@@ -2835,8 +2838,8 @@ be quantified. */ | |||
|
|||
/* Here's the actual function. */ | |||
|
|||
static int parse_regex(PCRE2_SPTR ptr, uint32_t options, BOOL *has_lookbehind, | |||
compile_block *cb) | |||
static int parse_regex(PCRE2_SPTR ptr, uint32_t options, uint32_t xoptions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we should revert the change then.
src/pcre2_ucd.c
Outdated
Turkish case-equivalences. */ | ||
|
||
const uint32_t PRIV(ucd_turkish_dotted_i_caseset) = 112; | ||
const uint32_t PRIV(ucd_turkish_dotless_i_caseset) = 115; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth to have two variables for this? A +3 could be enough, since they are together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Carlo would probably complain if he saw a hardcoded +3 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a #define
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ironically, I think it is less ugly if hardcoded all the way like this, but sure hope we can get OP_2CHAR
done soon enough to remove this ASAP
Tests updated. Surely we're done now...? I hope! |
I think the extra restrictions put up on (?r) were unnecessarily harsh so have them prtially reverted in a couple of followup commits in my fork of this so it matches Perl better (details in the commit messages themselves). we are missing documentation on the interaction between rebasing the whole thing on top of HEAD and squashing all commits with a nice descriptive commit message also missing. |
22fc8e7
to
39248b4
Compare
OK, thanks Carlo. I don't really see that we need to support I've accepted your two commits, and rebased & squashed. |
New flag: PCRE2_EXTRA_TURKISH_CASING, and pre-pattern flag (*TURKISH_CASING). Also added a pre-pattern flag (*CASELESS_RESTRICT) for this existing flag.
39248b4
to
1807a54
Compare
This patch looks good, maybe we should land it, and discuss / fix the remaining issues later. |
I think the use case is when the application sets CASELESS_RESTRICT and the creator of the pattern uses (*UTF) or (*UCP). I will do the merge and have a play. |
Does anybody understand why the scorecards action is now failing? |
Online services are not available. It says: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. Ignore this message. |
I will ignore the message, but we now have a red cross on the Actions tab. I am not sure who set up this action, or exactly what it does. |
Me neither. It says it is about security risks: https://github.com/PCRE2Project/pcre2/blame/master/.github/workflows/scorecards.yml |
It will probably work again in the next merge, it was just rate limited this time. |
No description provided.