Skip to content

Investigating BufferVec::push() performance improvement #22361

@goodartistscopy

Description

@goodartistscopy

While browsing through Bevy's code I stumbled upon this comment:

impl BufferVec<T> {

    pub fn push(&mut self, value: T) -> usize {
        let element_size = u64::from(T::min_size()) as usize;
        let offset = self.data.len();

        // TODO: Consider using unsafe code to push uninitialized, to prevent
        // the zeroing. It shows up in profiles.
        self.data.extend(iter::repeat_n(0, element_size));

        // ...
    }
    // ...
}

and thought I'd have a stab at it.

We begin by using reserve() instead of extend() and then use spare_capacity_mut() to get the new data (yet uninitialized). Writeris implemented for &[MaybeUninit<u8>] so the rest is mostly the same.

pub fn push(&mut self, value: T) -> usize {
    let element_size = u64::from(T::min_size()) as usize;

    self.data.reserve(element_size);

    let spare: &mut [MaybeUninit<u8>] = self.data.spare_capacity_mut();

    let mut dest = &mut spare[..element_size];
    value.write_into(&mut Writer::new(&value, &mut dest, 0).unwrap());

    // SAFETY:
    // - new len only covers the new element, for which space was reserved
    // - all uninitialized bytes have been written to
    let offset = self.data.len();
    unsafe {
        self.data.set_len(offset + element_size);
    }

    offset / u64::from(T::min_size()) as usize
}

From my tests this works fine (but see below) and the assembly indeed shows that the zeroing is gone. Benchmarks against the baseline are a bit all other the place on my machine (MBP M2 Max), but the trend seems positive in most cases. Though in devprofile the gains are huge, between 3-10x faster. If anyone wants to test on other devices, here is the branch. Use

cargo bench -p benches --bench render -- push

to benchmark using different element types.

So far so good, except the safety comment is a lie: not all bytes have been written to, because write_into() skips any inner padding that type T might have in its GPU representation (described by its implementation of trait ShaderType). T the Rust type is "expanded" to this representation by push() and write_into().

So this is a request for comments and ideas for follow-ups; the BufferVec API does not expose the underlying CPU data, but the unintialized bytes are exposed through a reference in write_buffer() and write_buffer_range(). Maybe write_into() can be modified to zero the padding bytes too ?

In the branch linked above, see the buffer-test example for what I used to test correctness (code for this example is with the benchmarks)

cargo run -p benches --example buffer-test

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-RenderingDrawing game state to the screenC-PerformanceA change motivated by improving speed, memory usage or compile timesS-Needs-DesignThis issue requires design work to think about how it would best be accomplished

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions