-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add substitute case callout function #512
Add substitute case callout function #512
Conversation
Just a question. The substitute is not a complex code, would not be a better idea to duplicate it for your use case? I would probably do this to reduce the dependence form a generic code. |
Haha! Yes, I totally agree. I have duplicated the code internally, twice, to make this change, but each time the PR into the Excel codebase has been rejected. The Excel managers really believe that PCRE2 and all its functions like So unfortunately, if I want to make any improvements for our application (like better Unicode support) then I have to do that in the official code. |
I'm flattered. :-) |
I never thought that PCRE2 is that important from PR perspective. Does the "ship that for the next forty years" is sarcasm, or the actual plan? I am still curious about your longer term collaboration plans with us. This can be discussed in private emails. |
I have done some very minor updates to the documentation updates, including updating the dates and PCRE2 version number (note that for many doc files the date is both at the top and the bottom). |
Don't forget we need to also update the generated |
Sigh. Yes, of course. Done. |
That's not really sarcasm, it is basically the plan. Excel is an "end-user programming language" in academic jargon. People don't configure it (like apached or exim), people actually write "applications" inside it. And there are billions of users, so any backwards-incompatible change has to be managed very carefully. The regex feature is going into Excel's formula language, which is Excel's "standard library". The level of caution is on the same kind of level as .NET or Java: never make a backwards-incompatible change, because customers depend on stability. But Excel carries this a level further - even if it's a bugfix, that's regarded as "backwards-incompatible", so behaviours are updated very, very slowly. Excel is 41 years old currently, and has billions of users (literally, according to public estimates). I think Microsoft is intending it to stay in business for another 40 years. The short answer is: we will be updating our version of PCRE2 rather rarely, and very cautiously.
Thank you, I'm very grateful! |
my concern (and I could be wrong since I had only skimmed over the PR and hadn't seen the new API being used by an application) is how are you planning to handle the "obvious" bug that will be coming because of the 1 to 1 character limitation with for ex: to clarify, I am not objecting to it, but just think it would need to be eventually extended anyway, so it might be better if it works in multiple characters to begin with (at least for its output), specially considering the long term commitment. |
That's a good question, it is a "bug". However, Perl has the same bug if you try In general, regex engines have poor or non-existent support for multi-character sequences, so it's consistent with everything else in PCRE2 that we don't handle these. Supporting it would be a substantial effort, with quite a major code change to I decided our goal should be the same as PCRE2's case-folding support: to function correctly for the "simple" (one-to-one) character mappings, and not support the multi-character mappings. That's really more than anyone expects from a regex engine, as measured against other engines at least. |
We need to add that |
Greek beta is a completely different character to German Esszet (although they do look extremely similar). They have no Unicode properties in common. Similarly, the Greek uppercase Beta is completely different to uppercase Latin "B" (although fonts often use exactly the same glyph). Same goes for the Cyrillic characters with identical appearance to Latin. |
It does. U+00DF is German "Eszett", U+03B2 is Greek lower case beta. |
|
Oh you're right, I'm so sorry. I didn't have utf8 enabled in my test. I get the same result as you. |
BTW, I don't disagree with the goal, I am just concerned that this API will be baked in the next release and when that goal changes, we will need ANOTHER API to support it. I see a few options:
|
This arises from the discussion we had a few weeks ago on the Excel call.
We discussed improving the case-handling of the pcre2_substitute function, but in general, Philip seemed not overly-enthusiastic, simply because correct locale-aware handling of user-visible strings is hard.
I agree. This PR adds a callout (callback) function to allow a third-party Unicode engine to be used for user-visible string processing. This can be used by applications to do locale-aware case transformations.
It's still a one-char-to-one-char mapping, which is simplistic, but allows support for more locales than the current system.
Aside: PCRE2's current Unicode handling for pattern matching (using CaseFolding.txt data) is really rather good. This should not be locale-aware, since case-equivalence of characters is defined in a locale-independent manner. The uppercasing/lowercasing performed by pcre2_substitute really is a special case.