perf(utf8): ASCII wrapping via strict printable-only invariant #506
+46
−134
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The is ascii only checks isn't exactly what the name implies. It strictly enforces printable ASCII (32-126), explicitly excluding control characters like tabs (
\t) and newlines.This provides a stronger guarantee than typical 7-bit ASCII checks: if
isAsciiOnlyis true, every byte is exactly 1 column wide.However, we ignored this and still running O(N)
width loops even on the fast path. This patch deletes those loops entirely:
If it's guaranteed printable ASCII, the display width is identical to
text.len. We don't need to iterate N bytes just to add 1 N times.Since width maps 1:1 to byte index, the wrap position is simply
min(text.len, max_width). We don't need to scan the string to find where it overflows.The obvious risk here is tabs (byte 9), which are strictly ASCII but variable width.
But since
isAsciiOnlychecksval >= 32, so it returns false for\t. This forces tabbed content into the slow Unicode path wheretab_widthis handled properly.I also considered if it was possible for FFI consumers to pass
isAsciiOnly=truefor strings with tabs or newlines. But since the public API doesn't exposeisAsciiOnlydirectly, and instead derives it viautf8.isAsciiOnly(), which returns false for empty strings and control characters, this optimization is safe and transparent to external users.