Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(injector): add an extend method to Nucleo's injector #74

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

alexpasmantier
Copy link

Description

This pull request adds an extend method to the Injector struct.

The main motivation I have for this comes from trying to optimize loading times for https://github.com/alexpasmantier/television which led me to take a look at Nucleo's implementation of boxcar.

The proposed extend method does the following for an incoming batch of values:

  • reserve all corresponding indexes at once (reducing contention on the inflight atomic)
  • compute start and end locations and allocate necessary buckets upfront
  • proceed to routing and inserting individual values to the relevant buckets

Benchmarks

I took the liberty of adding Criterion as a dev dependency in order to run a couple of benchmarks and assess if this was a meaningful feature to add or not.

cargo bench raw output
     Running benches/main.rs (target/release/deps/main-97d28d594921087e)
Gnuplot not found, using plotters backend
grow_boxcar/push/100    time:   [3.2966 µs 3.2992 µs 3.3018 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
grow_boxcar/extend/100  time:   [3.2370 µs 3.2387 µs 3.2410 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
grow_boxcar/push/1000   time:   [9.2759 µs 9.6346 µs 10.035 µs]
Found 30 outliers among 100 measurements (30.00%)
  19 (19.00%) low severe
  3 (3.00%) high mild
  8 (8.00%) high severe
grow_boxcar/extend/1000 time:   [6.7988 µs 6.8025 µs 6.8068 µs]
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
grow_boxcar/push/50000  time:   [268.11 µs 270.10 µs 272.19 µs]
Found 16 outliers among 100 measurements (16.00%)
  6 (6.00%) high mild
  10 (10.00%) high severe
grow_boxcar/extend/50000
                        time:   [223.23 µs 227.82 µs 233.67 µs]
Found 21 outliers among 100 measurements (21.00%)
  9 (9.00%) high mild
  12 (12.00%) high severe
grow_boxcar/push/500000 time:   [4.2144 ms 4.2321 ms 4.2528 ms]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
grow_boxcar/extend/500000
                        time:   [2.0938 ms 2.0998 ms 2.1062 ms]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
grow_boxcar/push/5000000
                        time:   [47.687 ms 47.796 ms 47.911 ms]
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low mild
  6 (6.00%) high mild
  2 (2.00%) high severe
grow_boxcar/extend/5000000
                        time:   [42.718 ms 42.783 ms 42.855 ms]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
Benchmarking grow_boxcar/push/20000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 21.3s, or reduce sample count to 20.
grow_boxcar/push/20000000
                        time:   [210.50 ms 211.12 ms 211.89 ms]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
Benchmarking grow_boxcar/extend/20000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 18.5s, or reduce sample count to 20.
grow_boxcar/extend/20000000
                        time:   [184.64 ms 185.70 ms 186.74 ms]

grow_boxcar_push_threaded/push/100
                        time:   [20.058 µs 20.115 µs 20.178 µs]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
grow_boxcar_push_threaded/extend/100
                        time:   [19.374 µs 19.420 µs 19.466 µs]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
grow_boxcar_push_threaded/push/1000
                        time:   [74.038 µs 74.317 µs 74.594 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
grow_boxcar_push_threaded/extend/1000
                        time:   [37.466 µs 37.632 µs 37.789 µs]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe
grow_boxcar_push_threaded/push/50000
                        time:   [3.3598 ms 3.3668 ms 3.3734 ms]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) low mild
grow_boxcar_push_threaded/extend/50000
                        time:   [276.76 µs 277.82 µs 278.91 µs]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe
grow_boxcar_push_threaded/push/500000
                        time:   [29.937 ms 30.243 ms 30.524 ms]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  5 (5.00%) low mild
grow_boxcar_push_threaded/extend/500000
                        time:   [3.9092 ms 3.9219 ms 3.9363 ms]
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
Benchmarking grow_boxcar_push_threaded/push/5000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 19.6s, or reduce sample count to 20.
grow_boxcar_push_threaded/push/5000000
                        time:   [193.18 ms 199.94 ms 206.54 ms]
Benchmarking grow_boxcar_push_threaded/extend/5000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.7s, or reduce sample count to 80.
grow_boxcar_push_threaded/extend/5000000
                        time:   [55.980 ms 56.358 ms 56.751 ms]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking grow_boxcar_push_threaded/push/20000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 117.7s, or reduce sample count to 10.
grow_boxcar_push_threaded/push/20000000
                        time:   [919.62 ms 944.07 ms 971.58 ms]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe
Benchmarking grow_boxcar_push_threaded/extend/20000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 20.1s, or reduce sample count to 20.
grow_boxcar_push_threaded/extend/20000000
                        time:   [207.17 ms 209.00 ms 211.02 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe


Observations

Sequential execution

The first benchmark compares, for different sizes of input:

  • sequentially pushing each value into the boxcar
  • extending the boxcar with all values at once
Screenshot 2025-02-05 at 12 45 17
100 lines 1000 lines 50_000 lines 500_000 lines 5_000_000 lines 20_000_000 lines
push 3.2992 µs 9.6346 µs 270.10 µs 4.2321 ms 47.796 ms 211.12 ms
extend 3.2387 µs 6.8025 µs 227.82 µs 2.0998 ms 42.783 ms 185.70 ms

While extend does look slightly faster than push for most input sizes, I was pretty skeptical at that point that the difference really justified the extra complexity.

The slight edge is I believe mostly explained by the fact that extend can pre-allocate all the buckets beforehand.

Adding values from multiple threads

The second benchmark compares, for different sizes of input:

  • N threads each pushing sequentially values from their own batch of values into the boxcar
  • N threads each extending the boxcar with their own batch of values
Screenshot 2025-02-05 at 12 45 55
100 lines 1000 lines 50_000 lines 500_000 lines 5_000_000 lines 20_000_000 lines
push 20.115 µs 74.317 µs 3.3668 ms 30.243 ms 199.94 ms 944.07 ms
extend 19.420 µs 37.632 µs 277.82 µs 3.9219 ms 56.358 ms 209.00 ms

In this case, the difference becomes quite significant across the entire range of input sizes, mostly - I believe, due to much less contention on atomics, and imho is a nice low hanging optimization for the library.

Curious to have some feedback on this.

Cheers

Copy link
Member

@the-mikedavis the-mikedavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance difference looks promising!

I'm not super familiar with this code myself so I just have some minor/style comments.

src/boxcar.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
Co-authored-by: Michael Davis <[email protected]>
@alexpasmantier alexpasmantier force-pushed the add-extend-method-to-injector branch 2 times, most recently from bca0298 to ba6c552 Compare February 7, 2025 17:48
@alexpasmantier alexpasmantier force-pushed the add-extend-method-to-injector branch from ba6c552 to fb31691 Compare February 7, 2025 18:25
@alexpasmantier
Copy link
Author

alexpasmantier commented Feb 10, 2025

Any thoughts on how to proceed with these changes?
Are there any requirements you feel aren't met yet and should be improved on?

Should we wait for input from @pascalkuthe?

(same thing for #75)

@the-mikedavis
Copy link
Member

I'm not that familiar with this code but I think this looks good. @pascalkuthe should have a look as well. He's a bit busy at the moment with work so it might take him a while to find some time to look at this (and #75).

Unrelated: also consider upstreaming both of these changes to https://github.com/ibraheemdev/boxcar - if I read the history correctly this module is vendored from that crate and it could be nice to share these improvements with the users of that crate too. (That crate looks to have quite a few dependents looking at download info so these changes could be quite impactful :)

let end_location = Location::of(start_index + count - 1);

// Allocate necessary buckets upfront
if start_location.bucket != end_location.bucket {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pessimisation. This is only supposed to be used for avoiding contention on allocating a new shard. That is only needed for the end_bucket and the bucket after it. For the other buckets it's not needed as they will all be allocated contention free from within this function.

The correct logic would look like this:

let alloc_entry = end_location.bucket_len - (end_location.bucket_len >> 3);
if end_location.entry >= alloc_entry && (start_location.bucket != end_location.bucket || start_location.entry <= alloc_entry) {
    if let Some(next_bucket) = self.buckets.get(end_location.bucket as usize + 1) {
        Vec::get_or_alloc(next_bucket, end_location.bucket_len << 1, self.columns);
    }
} 

if start_location.bucket != end_location.bucket {
    let bucket_ptr = self.buckets.get_unckecked(end_location.bucket as usize);
    Vec::get_or_alloc(bucket_ptr, end_location.bucket_len, self.columns);
}

we probably want to turn all_entry intoa function on Location since it's used in multiple places now

}
// Route each value to its corresponding bucket
let mut location;
for (i, v) in values.into_iter().enumerate() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExactSizeItreator is a safe trait that can have bugs/lie about it's size. Unsafe code cannot rely on the reported length being correct. You need to track the index (with .enumrate()) and if we go past the claimed length you need to panic (less is fine since that would just mean a permanent gap in the vec)

.len()
.try_into()
.expect("overflowed maximum capacity");
if count == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as said the reported len can be wrong so you want an assert inside this function like assert_eq!(values.next(), None, "values reproted incorrect len")

@@ -182,6 +182,83 @@ impl<T> Vec<T> {
index
}

/// Extends the vector by appending multiple elements at once.
pub fn extend<I>(&self, values: I, fill_columns: impl Fn(&T, &mut [Utf32String]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a non-trivial unsafe function I would like to see some unit tests (just of the boxcar in isolation)

// if we are at the end of the bucket, move on to the next one
if location.entry == location.bucket_len - 1 {
// safety: `location.bucket + 1` is always in bounds
bucket = unsafe { self.buckets.get_unchecked((location.bucket + 1) as usize) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true. end_location could be the last bucket (which would make this UB).

I think this check should be at the start of the function (and simply check wether the bucket changed compared to the previous location)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants