Skip to content

Optimize char deserialization with manual UTF-8 decoder#33

Closed
tanmay4l wants to merge 3 commits intoanza-xyz:masterfrom
tanmay4l:optimize-char-decode
Closed

Optimize char deserialization with manual UTF-8 decoder#33
tanmay4l wants to merge 3 commits intoanza-xyz:masterfrom
tanmay4l:optimize-char-decode

Conversation

@tanmay4l
Copy link
Contributor

addresses the TODO comment at line 247-250 which noted: "Could implement a manual decoder that avoids UTF-8
validate + chars() and instead performs the UTF-8 validity checks and produces a char directly. Some quick
micro-benchmarking revealed a roughly 2x speedup is possible."

Changes

Before:

let str = core::str::from_utf8(buf).map_err(invalid_utf8_encoding)?;
let c = str.chars().next().unwrap();

After:
- Manual UTF-8 decoding for 2-4 byte characters using bit masks
- Inline validation of continuation bytes (must be 10xxxxxx)
- Overlong encoding validation (3-byte: >= U+0800, 4-byte: >= U+10000)
- Surrogate validation (rejects U+D800..U+DFFF)
- Out of range validation (rejects > U+10FFFF)

@tanmay4l tanmay4l closed this Jan 22, 2026
@tanmay4l tanmay4l deleted the optimize-char-decode branch January 22, 2026 18:58
@tanmay4l tanmay4l restored the optimize-char-decode branch January 22, 2026 19:25
@tanmay4l tanmay4l reopened this Jan 22, 2026
@kskalski
Copy link
Contributor

You mention using a microbenchmark? Does it make sense to include it in the PR (e.g. add to wincode/benches) and get the comparison numbers in the PR description?

@kskalski
Copy link
Contributor

Thanks, I used the code and did a few other changes on top of it including the benchmark too in #187

@kskalski kskalski closed this Feb 19, 2026
@tanmay4l tanmay4l deleted the optimize-char-decode branch February 22, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants