perf: manual decode for utf-8 char deserialization using take_array by kskalski · Pull Request #187 · anza-xyz/wincode

kskalski · 2026-02-13T04:25:05Z

Avoid calling fill_exact during char deserialization such that Reader implementation isn't required to always provide borrowed slice of requested (2-4 byte) size.
Use take_array of varying lengths depending on first byte check
Manually compose code_point from bytes for each 2-4 len (based on Optimize char deserialization with manual UTF-8 decoder #33)
Add separate ReadError for invliad UTF-8 code point value (avoids necessity to recreate buf to generate the error)
Add benchmark for char deserialization

This removes the only call to fill_exact in production code outside of default / forwarding implementations of Reader.

Benchmark comparison

Checks decoding 10_000 chars from a random String

master: 72us
pr-33: 31us
this PR: 27us
bincode: 118us

wincode/src/schema/impls.rs

cpubot · 2026-02-13T18:10:20Z

For small reads like this it's totally fine to use fill_* methods rather than copying directly.

wincode/wincode/src/io/std_io.rs

Lines 103 to 106 in 85034c2

    
           if buffered_len >= n_bytes { 
        
               // SAFETY: `filled` always points to an initialized portion of the buffer. 
        
               return Ok(unsafe { from_raw_parts(buf.as_ptr().cast::<u8>().add(filled.start), n_bytes) }); 
        
           }

The proposed implementation will short-circuit if the buffer already contains enough bytes to fulfill the request.

fill_buf or fill_exact doesn't always entail IO -- since we accept an n_bytes argument, we can satisfy the request without necessarily filling.

Our Reader API necessarily entails some kind of underlying buffer (given definitions of fill_buf, fill_exact, peek). In general I think the optimal strategy should be "fill wherever possible unless reads may be large" -- this amortizes IO over fewer total syscalls. This strategy is reflected in current implementations. For small reads (integers, char, IpAddr, etc), we potentially trigger buffering such that subsequent small reads don't actually make IO calls, while larger sequences like Vec<T> where T: ZeroCopy, String take the direct copy path.

Consider a case where one is deserializing a struct containing all integers and chars. If those were to all use copy_into_slice, they could totally circumvent buffering depending on the underlying IO Reader implementation, in which case an IO call would be made for each individual deserialization.

The proposed no-grow implementation will still actually call fill_buf when reads are less than buffer capacity

wincode/wincode/src/io/std_io.rs

Line 213 in 85034c2

if needed >= capacity {

and this was done specifically to avoid the above case mentioned above. But, this is an implementation detail of the proposal, not something necessarily guaranteed by the API. By using the fill_* methods, we guarantee by definition that this case can't happen.

kskalski · 2026-02-14T00:48:05Z

The practical problem is that even when operating on buffers, satisfying fill_exact and fill_array may be difficult and force the implementation to do memmove or worse do allocations. This is the case for agave reader that operates on chunks of larger buffer reading them in parallel - generally when you are reading a series of values you get to the end of the current chunk, say you are at position pos: chunk.len() - 3 and when you are asked for contiguous slice of memory of size 4, you can't just point to chunk[pos..pos + 4], since it goes out of the current chunk. Even if you have the rest of the data already in other chunk, you need to do extra defragmentation:

move chunk[pos..] to chunk[0..chunk.len() - pos]
move chunk.len() - pos data from next chunk into current one
the management of positions / consume after such op become more complicated (data is now in different part of buffer than usually, etc.)

I would argue that copy_to_slice is an "always cheaper" version of that, especially when user is going to copy out of returned slice anyway. Instead, they should copy from the position where the data is already in to destination memory. copy_to_slice is efficient for buffered reader, since it can easily avoid the unnecessary or short IOs, the crucial part is that it can populate target slice in 2 steps: first from the existing available data, then maybe after doing IO or switching to next chunk, the rest (possibly repeating the process several times).

It's fine to have fill_buf(n) API when expected return slice can be <n. However I think we should avoid APIs (or at least uses of it) where expectation is >=n.

Arguably the de-fragmentation cost mentioned above is not that bad if the expectation for returned slice can be bounded to n <= predictable_max. So fortunately getting rid of fill_exact is not a blocker. In case of this PR I think populating small piece of uninitialized stack is better than forcing reader to give access to contiguous memory (even in your reference impl we can avoid unnecessary memmove this way). I will do some comparisons of the assembly involved in real case, so we also know perf implications for current out of mem readers.

cpubot · 2026-02-14T20:59:58Z

I would argue that copy_to_slice is an "always cheaper" version of that, especially when user is going to copy out of returned slice anyway.

The issue with a memcpy for small values is that it can prohibit scalarization because it is an opaque "copy some bytes" intrinsic. This is similar in motivation to #64. It gives the compiler more opportunity and more visibility to load rather than copy. That being said, since we inline so heavily, the compiler is likely to be able produce equivalent assembly, but we have seen explicit cases where it does not do so reliably (and this can be especially problematic on bpf targets).

I do agree with one of your points in #188, that because we employ this pattern so often:

let bytes: [u8; N] = *reader.fill_array();
unsafe { reader.consume_unchecked(N) };

There is likely room for an additional helper like your proposed read_to_array (or my suggested name, get_array).

I would certainly prefer this over using copy_into_slice everywhere on small loads for the reason mentioned above. Slice impls can implement get_array() with a dereference of fill_array() and avoid the memcpy (which I propose should be the default impl in the Reader trait), while still giving flexibility to other Reader implementations that want to do something different.

cpubot · 2026-02-18T00:14:08Z

wincode/src/schema/impls.rs

            0xF0..=0xF4 => 4,
            _ => return Err(invalid_char_lead(b0)),
        };
+        let mut buf = [0u8; 4];


It's still not totally clear to me why we want to eliminate usage of fill_buf / fill_exact, but if we do go this route, perhaps

let mut buf = MaybeUninit::<[u8; 4]>::uninit(); // Safety: len is at most 4, so slice is up to buf.len() and casting [u8] to uninit is safe let uninit_slice = unsafe { core::slice::from_raw_parts_mut(buf.as_mut_ptr().cast::<MaybeUninit<u8>>(), len) }; reader.copy_into_slice(uninit_slice)?; let buf = unsafe { core::slice::from_raw_parts(buf.as_ptr().cast::<u8>(), len) };

I will post a bit more context on that in a different place - in short the issue is that chunked buffered reader can't guarantee >1 byte sized contiguous slices are always available.

For this PR I was also thinking of building on top of #33 and use take_array in each branch separately, but I can also compare with the version above.

Sounds good. Would definitely like to hear more about the background context on this

See #53 (comment)

Ok, I think we got what I wanted in a win-win fashion - current state of the PR gets rid of fill_exact and achieves better performance than master and even slightly better than patch proposed in #33 (see benchamark numbers from PR description)

cpubot

Nice!

kskalski commented Feb 13, 2026

View reviewed changes

wincode/src/schema/impls.rs Outdated Show resolved Hide resolved

kskalski marked this pull request as ready for review February 13, 2026 07:56

kskalski requested a review from cpubot February 13, 2026 07:58

cpubot mentioned this pull request Feb 13, 2026

RFC: Impl BufReader for std::io::Read #53

Closed

cpubot reviewed Feb 18, 2026

View reviewed changes

tanmay4l and others added 3 commits February 19, 2026 07:54

Optimize char deserialization with manual UTF-8 decoder

5286b9d

Clippy-clean

ac9807b

Add benchmark for char deserialization

c7c9b19

kskalski force-pushed the ks/char_fill_exact branch from a0afebf to 697db74 Compare February 19, 2026 00:05

kskalski changed the title ~~feat: use reader.copy_into_slice when deserializing char~~ perf: manual decode for utf-8 char deserialization using take_array Feb 19, 2026

use take_array

090e669

kskalski force-pushed the ks/char_fill_exact branch from 697db74 to 090e669 Compare February 19, 2026 00:13

cpubot approved these changes Feb 19, 2026

View reviewed changes

kskalski merged commit 2b3a053 into anza-xyz:master Feb 19, 2026
3 checks passed

kskalski deleted the ks/char_fill_exact branch February 19, 2026 01:35

kskalski mentioned this pull request Feb 19, 2026

Optimize char deserialization with manual UTF-8 decoder #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: manual decode for utf-8 char deserialization using take_array#187

perf: manual decode for utf-8 char deserialization using take_array#187
kskalski merged 4 commits intoanza-xyz:masterfrom
kskalski:ks/char_fill_exact

kskalski commented Feb 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

cpubot commented Feb 13, 2026

Uh oh!

kskalski commented Feb 14, 2026 •

edited

Loading

Uh oh!

cpubot commented Feb 14, 2026

Uh oh!

cpubot Feb 18, 2026

Uh oh!

kskalski Feb 18, 2026

Uh oh!

cpubot Feb 18, 2026

Uh oh!

kskalski Feb 18, 2026

Uh oh!

kskalski Feb 19, 2026

Uh oh!

cpubot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kskalski commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark comparison

Uh oh!

Uh oh!

cpubot commented Feb 13, 2026

Uh oh!

kskalski commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpubot commented Feb 14, 2026

Uh oh!

cpubot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

kskalski Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

cpubot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

kskalski Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

kskalski Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

cpubot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kskalski commented Feb 13, 2026 •

edited

Loading

kskalski commented Feb 14, 2026 •

edited

Loading