-
Notifications
You must be signed in to change notification settings - Fork 160
Add utf8proc_isequal_normalized #298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
A thread-local global variable... I see how this is attractive, and this would work in a plain C/C++ code. I'm not quite sure this will work in Julia, or with other run-time systems that may set up their own threading mechanism. I say this because Julia has its own calling convention (its own ABI) and I am not sure that Julia's |
I'm not sure what you mean, the thread local is just my test/example code, it's not required at all for use of the new function I added in the commit. |
Right, my bad. |
I'm not a huge fan of the complication of a re-entrant API. I really think the algorithm can be re-worked to avoid allocations entirely by using a fixed length (stack) array of the most recent character for each combining class. |
A version that doesn't require scratch buffers would be nice, but with my current understanding of the problem I don't see how that's possible without sacrificing performance. Valid Unicode sequences can contain infinite combining characters in any order, and when comparing two un-normalized UTF-8 strings you need some way to detect when they would be normalized to the same thing even if the un-normalized versions are in different orders. The Julia algorithm did this by decomposing into the scratch buffers and sorting the combining characters in there so they'd be ready for simple direct comparison, when I ported it to C I copied the normalization sorting code that already existed elsewhere in the same source file. The worst case scenario is that both strings are entirely one long sequence of combining characters in different orders, meaning the two scratch buffers used for sorting both need to be able to hold the entire source string decomposed to 32-bit code points in that case. Any solution which does not require scratch buffers necessarily will have terrible performance when passing some threshold for number of combining characters in a row, which can lead to denial of service attacks. At least with the scratch buffer approach, the maximum required buffer size can be computed in advance and rejected if it's undesirable, and then the cost is just to sort the combining sequences for comparison. The scratch buffer approach already lends itself well to using local stack buffers as a fast path for the vast majority of "normal" input strings that don't have too many combining characters in a row, and the choice of how to handle the less usual strings falls to the user, if they even want to bother with it at all. Hence my suggestion to possibly add a simple wrapper that makes that decision internally, so users who aren't picky at the moment can use the simplified wrapper until they need to optimize. It's possible I'm overestimating the problem though, I don't have a full grasp of Unicode combining classes and combining characters so if you think it's simpler than this please do tell me. |
This is not obvious to me, because you don't have to sort within each combining class, and there are only a small number of possible combining classes. For example, the simplest allocation-free algorithm would be to scan through the combining characters, one combining class at a time, searching/comparing only characters in that combining class before going back and searching the next combining class. This is a linear-time algorithm, with cost at worst proportional to the number of combining characters multiplied by the number of combining classes. And I suspect that you could be more efficient than this by storing a fixed amount of state, proportional to the number of combining (Furthermore, in a typical string the number of combining characters per character is almost always going to be small, rarely more than two or three.) Getting this right is tricky, but will be worth it, and I don't want to lock us into a re-entrant API in the meantime. |
If you think it's possible then I'll work another attempt at this, after researching a bit more about the combining class stuff. Probably won't get around to it until this weekend at the earliest though. |
I've reworked the algorithm to not require a scratch buffer anymore based on your suggestions and advice, I'm still debugging some finer details with some test cases of mine that don't pass as they should but my debugger is misbehaving so I'm calling it quits for this weekend. In the meantime, how are things looking now with this version? I don't think I'll need to make significant changes to fix whatever bug is present. New example C++ usage code (to contrast with the original in the OP): [[nodiscard]] bool isEqualNormalized(std::string_view const a, std::string_view const b)
{
utf8proc_processing_state_t sa{};
utf8proc_processing_state_t sb{};
sa.str = {reinterpret_cast<utf8proc_uint8_t const*>(std::data(a)), gsl::narrow<utf8proc_ssize_t>(std::size(a))};
sb.str = {reinterpret_cast<utf8proc_uint8_t const*>(std::data(b)), gsl::narrow<utf8proc_ssize_t>(std::size(b))};
utf8proc_isequal_normalized(&sa, &sb, UTF8PROC_STABLE);
assert(sa.error == 0 && sb.error == 0);
return sa.str.len == 0 && sb.str.len == 0;
} |
combining_initialized = false; | ||
} | ||
if (a_combining_class != b_combining_class) { | ||
/* mismatch detected */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this is a mismatch, rather than just two characters that haven't been sorted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inside the branch for at least one of the combining classes being 0, the sorting is only relevant for nonzero combining classes to my understanding. I'll need to debug it more to be sure though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the bugs affecting my own internal tests and also added a comment better clarifying this part of the code.
* potentially quite far ahead of where the updated string views point. Use | ||
* the `str_at_error` members to help with analyzing errors. | ||
*/ | ||
UTF8PROC_DLLEXPORT void utf8proc_isequal_normalized(utf8proc_processing_state_t *a, utf8proc_processing_state_t *b, utf8proc_option_t options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API seems a little complicated? Can't it just take two strings and return a boolean value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What boolean value should be returned if both strings have the same invalid UTF-8 at the same points? What if that invalid UTF-8 isn't quite at the same points but would be after sorting combining characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or return 1 if they are equal, 0 if they are unequal, and a negative UTF8PROC_ERROR
code upon an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered that, but it seems awkward to me - it encourages people to just directly compare it to 0 or 1, and if they don't compare it and just stick it in an if statement it will be treated as equal on error, so you still need to store the result in a separate variable and inspect it anyway. There's also the issue of not knowing right away which string caused the error (or both with different errors), which may be difficult to determine given there's two different internal functions the error could have come from. Also, to provide the functionality of finding the longest common sequence, the input parameters still have to be in/out or the return value has to be a structure of multiple values. Any simplification I can think of removes one or more valuable use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also the issue of not knowing right away which string caused the error
I'm less worried about that. Error conditions should be rare, so in those cases you could simply re-normalize each string separately to see which went wrong (if you need that level of detail at all).
Also, to provide the functionality of finding the longest common sequence
I'm not entirely convinced of this application — finding the longest common sequence in the normalized string, when you only have the non-normalized strings. Especially because the sorting of the combining characters makes it really difficult to say what the "common sequence" corresponds to in the non-normalized strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm less worried about that. Error conditions should be rare, so in those cases you could simply re-normalize each string separately to see which went wrong (if you need that level of detail at all).
Errors may be rare, but their possibility and necessity to be handled affects the way we can design the code in the non-error path. I have a large structure of unsanitized data, some of which is allowed to be normalized and other of which must be treated as binary, and I want to search and compare within it without paying the cost of normalizing everything since 1. that can't be done in place and 2. I want to avoid accessing or modifying areas that don't need to be touched. When I have a constant unsanitized string on the left hand side and I want to find other places it appears on the right hand side, knowing that the error is only on one side makes the code simpler: error on left means nothing will compare equal so we can end early, error on right means just this one check can be skipped over and there's still potential for something else to compare equal. If we only have a single error with no indication of which string it's from, that means I have to either check for erroneous data in the left hand side in advance (expensive in the normal code path, especially with few things to compare against) or I have to add unnecessarily complicated code to check both strings for the error and decide whether to proceed with comparison based on which side the error turns out to be in.
I'm not entirely convinced of this application — finding the longest common sequence in the normalized string, when you only have the non-normalized strings. Especially because the sorting of the combining characters makes it really difficult to say what the "common sequence" corresponds to in the non-normalized strings.
The way I coded it currently, the str
pointers will always be updated to a boundary, never in the middle of combining characters. This is important for allowing re-entrancy after repairing invalid sequences due to how the combining class code works, that state is local to the function and thus dropped between calls, so it can't pick back up in the middle as that would give wrong answers. It's also really useful for a rough sorting of autocomplete suggestions in IDEs - you can search and compare identifiers to what has been typed so far and sort the list based on the ones that get the furthest or have the most in common. It doesn't matter if the sorting isn't perfect, it just needs to be fast and push the better matches closer to the top. If we don't expose this information, the alternative is having to store normalized copies of all identifiers and keep them in sync with modifications to the source code, which is messy and uses more memory and power.
Maybe we should just provide a simpler wrapper function that works the way you suggest and still also expose the more advanced version for these other use cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we only have a single error with no indication of which string it's from, that means I have to either check for erroneous data in the left hand side in advance
Why would you have to check in advance? Why not just check the individual strings only if an error occurs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the other half of my sentence explained it makes the user code more complicated. If you don't check in advance, then on error you have to check both strings, and have logic to potentially continue with further comparisons and remember not to re-check both strings again on error. That's an unnecessary annoyance with using the API when utf8proc already knows which string produced the error and could just tell the user.
I've fixed some bugs with the combining code and clarified the code a little more, still haven't added tests/fuzzing yet though as we're still discussing the public API. |
Based on conversation in #101 I have ported the
isequal_normalized
function from Julia to C. This is my first attempt and I haven't updated tests/fuzzing yet, at this stage I mainly want feedback on the API and code style. Once that's settled I'll figure out the other bells and whistles.Despite my earlier comment, I did actually manage to implement the algorithm without requiring a callback nor any memory allocation in the middle of it, instead it's a re-entrant algorithm that updates the string pointers to avoid re-testing parts that have already been deemed equivalent. This makes it easy to detect and fix or handle invalid UTF-8 as well, since the string pointers will be pointing right at the problem on error. I also managed to make the algorithm able to run ahead and try out the rest of the string to determine how big the scratch buffers need to be in the case when they aren't large enough, though it does complicate the implementations slightly.
My example/test code is written in C++ currently, it looks similar to this:
After other details are sorted out I can add a plain C example. Not sure if there should also be a
utf8proc_isequal_normalized_simple
that does all that and returns a simpler result. Also not sure if there should be a version that doesn't take thecustom_func
/custom_data
.