Add utf8proc_isequal_normalized #298

LB-- · 2025-07-20T16:54:45Z

Based on conversation in #101 I have ported the isequal_normalized function from Julia to C. This is my first attempt and I haven't updated tests/fuzzing yet, at this stage I mainly want feedback on the API and code style. Once that's settled I'll figure out the other bells and whistles.

Despite my earlier comment, I did actually manage to implement the algorithm without requiring a callback nor any memory allocation in the middle of it, instead it's a re-entrant algorithm that updates the string pointers to avoid re-testing parts that have already been deemed equivalent. This makes it easy to detect and fix or handle invalid UTF-8 as well, since the string pointers will be pointing right at the problem on error. I also managed to make the algorithm able to run ahead and try out the rest of the string to determine how big the scratch buffers need to be in the case when they aren't large enough, though it does complicate the implementations slightly.

My example/test code is written in C++ currently, it looks similar to this:

thread_local std::vector<utf8proc_int32_t> scratchSpace{};
[[nodiscard]] bool isEqualNormalized(std::string_view const a, std::string_view const b)
{
	if(std::empty(scratchSpace))
	{
		scratchSpace.resize(32);
	}
	utf8proc_processing_state_t sa{};
	utf8proc_processing_state_t sb{};
	sa.str = {reinterpret_cast<utf8proc_uint8_t const*>(std::data(a)), gsl::narrow<utf8proc_ssize_t>(std::size(a))};
	sb.str = {reinterpret_cast<utf8proc_uint8_t const*>(std::data(b)), gsl::narrow<utf8proc_ssize_t>(std::size(b))};
	while(true)
	{
		gsl::span const bufA(gsl::span(scratchSpace).subspan(0, std::size(scratchSpace)/2));
		gsl::span const bufB(gsl::span(scratchSpace).subspan(std::size(bufA), std::size(bufA)));
		sa.buf = {std::data(bufA), 0, gsl::narrow<utf8proc_ssize_t>(std::size(bufA))};
		sb.buf = {std::data(bufB), 0, gsl::narrow<utf8proc_ssize_t>(std::size(bufB))};
		utf8proc_isequal_normalized(&sa, &sb, utf8proc_option_t{}, nullptr, nullptr);
		if(!sa.buf.ptr || !sb.buf.ptr)
		{
			scratchSpace.resize(gsl::narrow<size_t>(max(sa.buf.len_used, sb.buf.len_used))*2);
			continue;
		}
		if(sa.str.len == 0 && sb.str.len == 0)
		{
			return true;
		}
		return false;
	}
}

After other details are sorted out I can add a plain C example. Not sure if there should also be a utf8proc_isequal_normalized_simple that does all that and returns a simpler result. Also not sure if there should be a version that doesn't take the custom_func/custom_data.

eschnett · 2025-07-20T17:35:25Z

A thread-local global variable... I see how this is attractive, and this would work in a plain C/C++ code. I'm not quite sure this will work in Julia, or with other run-time systems that may set up their own threading mechanism. I say this because Julia has its own calling convention (its own ABI) and I am not sure that Julia's ccall ensures that thread-local variables in C will be set up correctly. I kind of assume they are (I guess too much would break in glibc otherwise) but that's a point that needs to be evaluated carefully.

LB-- · 2025-07-20T17:38:56Z

I'm not sure what you mean, the thread local is just my test/example code, it's not required at all for use of the new function I added in the commit.

eschnett · 2025-07-20T20:36:16Z

Right, my bad.

stevengj · 2025-07-21T12:38:34Z

I'm not a huge fan of the complication of a re-entrant API. I really think the algorithm can be re-worked to avoid allocations entirely by using a fixed length (stack) array of the most recent character for each combining class.

LB-- · 2025-07-21T20:33:17Z

A version that doesn't require scratch buffers would be nice, but with my current understanding of the problem I don't see how that's possible without sacrificing performance. Valid Unicode sequences can contain infinite combining characters in any order, and when comparing two un-normalized UTF-8 strings you need some way to detect when they would be normalized to the same thing even if the un-normalized versions are in different orders.

The Julia algorithm did this by decomposing into the scratch buffers and sorting the combining characters in there so they'd be ready for simple direct comparison, when I ported it to C I copied the normalization sorting code that already existed elsewhere in the same source file. The worst case scenario is that both strings are entirely one long sequence of combining characters in different orders, meaning the two scratch buffers used for sorting both need to be able to hold the entire source string decomposed to 32-bit code points in that case.

Any solution which does not require scratch buffers necessarily will have terrible performance when passing some threshold for number of combining characters in a row, which can lead to denial of service attacks. At least with the scratch buffer approach, the maximum required buffer size can be computed in advance and rejected if it's undesirable, and then the cost is just to sort the combining sequences for comparison.

The scratch buffer approach already lends itself well to using local stack buffers as a fast path for the vast majority of "normal" input strings that don't have too many combining characters in a row, and the choice of how to handle the less usual strings falls to the user, if they even want to bother with it at all. Hence my suggestion to possibly add a simple wrapper that makes that decision internally, so users who aren't picky at the moment can use the simplified wrapper until they need to optimize.

It's possible I'm overestimating the problem though, I don't have a full grasp of Unicode combining classes and combining characters so if you think it's simpler than this please do tell me.

stevengj · 2025-07-22T23:33:23Z

Any solution which does not require scratch buffers necessarily will have terrible performance

This is not obvious to me, because you don't have to sort within each combining class, and there are only a small number of possible combining classes.

For example, the simplest allocation-free algorithm would be to scan through the combining characters, one combining class at a time, searching/comparing only characters in that combining class before going back and searching the next combining class. This is a linear-time algorithm, with cost at worst proportional to the number of combining characters multiplied by the number of combining classes.

And I suspect that you could be more efficient than this by storing a fixed amount of state, proportional to the number of combining ~~characters~~ classes, to reduce the number of passes over the string (basically, just caching where the first character in each combining class appears, if at all).

(Furthermore, in a typical string the number of combining characters per character is almost always going to be small, rarely more than two or three.)

Getting this right is tricky, but will be worth it, and I don't want to lock us into a re-entrant API in the meantime.

LB-- · 2025-07-23T00:19:42Z

If you think it's possible then I'll work another attempt at this, after researching a bit more about the combining class stuff. Probably won't get around to it until this weekend at the earliest though.

LB-- · 2025-07-27T16:51:11Z

I've reworked the algorithm to not require a scratch buffer anymore based on your suggestions and advice, I'm still debugging some finer details with some test cases of mine that don't pass as they should but my debugger is misbehaving so I'm calling it quits for this weekend. In the meantime, how are things looking now with this version? I don't think I'll need to make significant changes to fix whatever bug is present.

New example C++ usage code (to contrast with the original in the OP):

[[nodiscard]] bool isEqualNormalized(std::string_view const a, std::string_view const b)
{
	utf8proc_processing_state_t sa{};
	utf8proc_processing_state_t sb{};
	sa.str = {reinterpret_cast<utf8proc_uint8_t const*>(std::data(a)), gsl::narrow<utf8proc_ssize_t>(std::size(a))};
	sb.str = {reinterpret_cast<utf8proc_uint8_t const*>(std::data(b)), gsl::narrow<utf8proc_ssize_t>(std::size(b))};
	utf8proc_isequal_normalized(&sa, &sb, UTF8PROC_STABLE);
	assert(sa.error == 0 && sb.error == 0);
	return sa.str.len == 0 && sb.str.len == 0;
}

stevengj · 2025-07-28T17:17:05Z

utf8proc.c

+          combining_initialized = false;
+        }
+        if (a_combining_class != b_combining_class) {
+          /* mismatch detected */


Not sure why this is a mismatch, rather than just two characters that haven't been sorted?

This is inside the branch for at least one of the combining classes being 0, the sorting is only relevant for nonzero combining classes to my understanding. I'll need to debug it more to be sure though.

I fixed the bugs affecting my own internal tests and also added a comment better clarifying this part of the code.

stevengj · 2025-07-28T17:19:04Z

utf8proc.h

+ * potentially quite far ahead of where the updated string views point. Use
+ * the `str_at_error` members to help with analyzing errors.
+ */
+UTF8PROC_DLLEXPORT void utf8proc_isequal_normalized(utf8proc_processing_state_t *a, utf8proc_processing_state_t *b, utf8proc_option_t options);


This API seems a little complicated? Can't it just take two strings and return a boolean value?

What boolean value should be returned if both strings have the same invalid UTF-8 at the same points? What if that invalid UTF-8 isn't quite at the same points but would be after sorting combining characters?

Or return 1 if they are equal, 0 if they are unequal, and a negative UTF8PROC_ERROR code upon an error.

I considered that, but it seems awkward to me - it encourages people to just directly compare it to 0 or 1, and if they don't compare it and just stick it in an if statement it will be treated as equal on error, so you still need to store the result in a separate variable and inspect it anyway. There's also the issue of not knowing right away which string caused the error (or both with different errors), which may be difficult to determine given there's two different internal functions the error could have come from. Also, to provide the functionality of finding the longest common sequence, the input parameters still have to be in/out or the return value has to be a structure of multiple values. Any simplification I can think of removes one or more valuable use cases.

There's also the issue of not knowing right away which string caused the error

I'm less worried about that. Error conditions should be rare, so in those cases you could simply re-normalize each string separately to see which went wrong (if you need that level of detail at all).

Also, to provide the functionality of finding the longest common sequence

I'm not entirely convinced of this application — finding the longest common sequence in the normalized string, when you only have the non-normalized strings. Especially because the sorting of the combining characters makes it really difficult to say what the "common sequence" corresponds to in the non-normalized strings.

I'm less worried about that. Error conditions should be rare, so in those cases you could simply re-normalize each string separately to see which went wrong (if you need that level of detail at all).

Errors may be rare, but their possibility and necessity to be handled affects the way we can design the code in the non-error path. I have a large structure of unsanitized data, some of which is allowed to be normalized and other of which must be treated as binary, and I want to search and compare within it without paying the cost of normalizing everything since 1. that can't be done in place and 2. I want to avoid accessing or modifying areas that don't need to be touched. When I have a constant unsanitized string on the left hand side and I want to find other places it appears on the right hand side, knowing that the error is only on one side makes the code simpler: error on left means nothing will compare equal so we can end early, error on right means just this one check can be skipped over and there's still potential for something else to compare equal. If we only have a single error with no indication of which string it's from, that means I have to either check for erroneous data in the left hand side in advance (expensive in the normal code path, especially with few things to compare against) or I have to add unnecessarily complicated code to check both strings for the error and decide whether to proceed with comparison based on which side the error turns out to be in.

I'm not entirely convinced of this application — finding the longest common sequence in the normalized string, when you only have the non-normalized strings. Especially because the sorting of the combining characters makes it really difficult to say what the "common sequence" corresponds to in the non-normalized strings.

The way I coded it currently, the str pointers will always be updated to a boundary, never in the middle of combining characters. This is important for allowing re-entrancy after repairing invalid sequences due to how the combining class code works, that state is local to the function and thus dropped between calls, so it can't pick back up in the middle as that would give wrong answers. It's also really useful for a rough sorting of autocomplete suggestions in IDEs - you can search and compare identifiers to what has been typed so far and sort the list based on the ones that get the furthest or have the most in common. It doesn't matter if the sorting isn't perfect, it just needs to be fast and push the better matches closer to the top. If we don't expose this information, the alternative is having to store normalized copies of all identifiers and keep them in sync with modifications to the source code, which is messy and uses more memory and power.

Maybe we should just provide a simpler wrapper function that works the way you suggest and still also expose the more advanced version for these other use cases?

If we only have a single error with no indication of which string it's from, that means I have to either check for erroneous data in the left hand side in advance

Why would you have to check in advance? Why not just check the individual strings only if an error occurs?

Because the other half of my sentence explained it makes the user code more complicated. If you don't check in advance, then on error you have to check both strings, and have logic to potentially continue with further comparisons and remember not to re-check both strings again on error. That's an unnecessary annoyance with using the API when utf8proc already knows which string produced the error and could just tell the user.

LB-- · 2025-07-29T18:05:50Z

I've fixed some bugs with the combining code and clarified the code a little more, still haven't added tests/fuzzing yet though as we're still discussing the public API.

LB-- force-pushed the main branch from 5b31811 to 3d0ff43 Compare July 27, 2025 16:47

stevengj reviewed Jul 28, 2025

View reviewed changes

[WIP] Add utf8proc_isequal_normalized

506d360

LB-- force-pushed the main branch from 3d0ff43 to 506d360 Compare July 29, 2025 17:58

Add utf8proc_isequal_normalized #298

Are you sure you want to change the base?

Add utf8proc_isequal_normalized #298

Uh oh!

Conversation

LB-- commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eschnett commented Jul 20, 2025

Uh oh!

LB-- commented Jul 20, 2025

Uh oh!

eschnett commented Jul 20, 2025

Uh oh!

stevengj commented Jul 21, 2025

Uh oh!

LB-- commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevengj commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LB-- commented Jul 23, 2025

Uh oh!

LB-- commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LB-- Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevengj Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LB-- commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

LB-- commented Jul 20, 2025 •

edited

Loading

LB-- commented Jul 21, 2025 •

edited

Loading

stevengj commented Jul 22, 2025 •

edited

Loading

LB-- commented Jul 27, 2025 •

edited

Loading

LB-- Jul 29, 2025 •

edited

Loading

stevengj Jul 30, 2025 •

edited

Loading

LB-- commented Jul 29, 2025 •

edited

Loading