Investigating `BufferVec::push()` performance improvement

While browsing through Bevy's code I stumbled upon [this comment](https://github.com/bevyengine/bevy/blob/e457025cc447a9c36ac452f89c3da9ef836cd0d5/crates/bevy_render/src/render_resource/buffer_vec.rs#L348):
```rust
impl BufferVec<T> {

    pub fn push(&mut self, value: T) -> usize {
        let element_size = u64::from(T::min_size()) as usize;
        let offset = self.data.len();

        // TODO: Consider using unsafe code to push uninitialized, to prevent
        // the zeroing. It shows up in profiles.
        self.data.extend(iter::repeat_n(0, element_size));

        // ...
    }
    // ...
}
```
and thought I'd have a stab at it.

We begin by using `reserve()` instead of `extend()` and then use [`spare_capacity_mut()`](https://doc.rust-lang.org/std/vec/struct.Vec.html#method.spare_capacity_mut) to get the new data (yet uninitialized). `Writer`is implemented for `&[MaybeUninit<u8>]` so the rest is mostly the same.

```rust
pub fn push(&mut self, value: T) -> usize {
    let element_size = u64::from(T::min_size()) as usize;

    self.data.reserve(element_size);

    let spare: &mut [MaybeUninit<u8>] = self.data.spare_capacity_mut();

    let mut dest = &mut spare[..element_size];
    value.write_into(&mut Writer::new(&value, &mut dest, 0).unwrap());

    // SAFETY:
    // - new len only covers the new element, for which space was reserved
    // - all uninitialized bytes have been written to
    let offset = self.data.len();
    unsafe {
        self.data.set_len(offset + element_size);
    }

    offset / u64::from(T::min_size()) as usize
}
```
From my tests this works fine (but see below) and the assembly indeed shows that the zeroing is gone. Benchmarks against the baseline are a bit all other the place on my machine (MBP M2 Max), but the trend seems positive in most cases. Though in `dev`profile the gains are huge, between 3-10x faster. If anyone wants to test on other devices, [here is the branch](https://github.com/goodartistscopy/bevy/tree/buffer-vec-push-opt).  Use

```sh
cargo bench -p benches --bench render -- push
```
to benchmark using different element types. 

So far so good, except the safety comment is a lie: not all bytes have been written to, because `write_into()` skips any inner padding that type `T` might have in its GPU representation (described by its implementation of trait  `ShaderType`). `T` the Rust type is "expanded" to this representation by `push()` and `write_into()`.

So this is a request for comments and ideas for follow-ups; the `BufferVec` API does not expose the underlying CPU data, but the unintialized bytes are exposed through a reference in [`write_buffer()`](https://github.com/bevyengine/bevy/blob/e457025cc447a9c36ac452f89c3da9ef836cd0d5/crates/bevy_render/src/render_resource/buffer_vec.rs#L420) and [`write_buffer_range()`](https://github.com/bevyengine/bevy/blob/e457025cc447a9c36ac452f89c3da9ef836cd0d5/crates/bevy_render/src/render_resource/buffer_vec.rs#L445). Maybe `write_into()` can be modified to zero the padding bytes too ? 

In the branch linked above, see the `buffer-test` example for what I used to test correctness (code for this example is with the benchmarks)
```sh
cargo run -p benches --example buffer-test
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Investigating `BufferVec::push()` performance improvement #22361

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Investigating BufferVec::push() performance improvement #22361

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Investigating `BufferVec::push()` performance improvement #22361