Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for Batch Input Creation #86

Open
N9199 opened this issue Feb 8, 2023 · 8 comments
Open

API for Batch Input Creation #86

N9199 opened this issue Feb 8, 2023 · 8 comments
Labels
enhancement New feature or request research-needed

Comments

@N9199
Copy link

N9199 commented Feb 8, 2023

I'm looking for a way to create input for each batch. As an example, if I wanted to benchmark something like sorting algorithms I need to have a new unsorted vector for each batch, either via copying one already created or by shuffling the one already used. In both cases the setup isn't negligible, so I'm wondering if nanobench has an API which helps with that? I was thinking of something like Google Benchmarks PauseTiming and ResumeTiming, or something like Rust's Criterion iter_batched and iter_batched_ref?

@martinus
Copy link
Owner

martinus commented Feb 8, 2023

No API for that because measurement quality could be very low if the timer has to be restarted often. Instead, please do 2 measurements: first measure runtime of just the setup (e.g. creating & shuffling), then benchmark everything, and subtract the first.

@martinus martinus closed this as completed Feb 8, 2023
@helmesjo
Copy link

helmesjo commented Feb 10, 2023

@martinus

Couldn't this be easily solved by having all setup done first (for x amount of iterations, and from the docs it seems like nanobench calculates up-front how many iterations to do). Then it would just be a matter of picking the next "batch" from a contiguous list:

std::vector<int> generate_shuffled_vector();

ankerl::nanobench::BenchBatch<std::vector<int>>().
  generateBatch([] -> std::vector<int>{
    // returned batch can just be stored
    // in whatever structure you see fit (eg. in this
    // case perhaps `std::vector<std::vector<int>>`)
    return generate_shuffled_vector();
  }).run("amazing sorter", [](std::vector<int>& batch){
    // `batch` is just the next entry, and next, and next...
    my_amazing_sorter(batch);
  }
);

As long as whatever generated batch is either copy- or moveable, all should be fine...

Obviously there is some overhead here, but it should be fairly stable (depending on what your "batch" is).
Also, I guess most scenarios requiring this are probably non-trivial, so I bet the batches[next++] won't cause totally useless results... But I'm all guessing here. Just seems that being able to do something as simple as a sorting algorithm would be expected.

Also I want to get away from google-bench, but need this. 🙃

@N9199
Copy link
Author

N9199 commented Feb 10, 2023

To add to what @helmesjo said, I looked into how Criterion implemented it, and I saw the current implementation of run, it seems that they do essentially the same, but they add a preparation step where they construct the inputs. Now maybe I'm missing something, but couldn't the same be done in nanobench with the input construction happening before pc.beginMeasure();?

@martinus
Copy link
Owner

Ive reopened the issues and linked all similar issues I previously closed to this one. I guess public interest is too large for me to brush it away...

What I absolutely not want is start/stop of the timer in the tight measurement loop (which seems to be what Criterion is doing). That would make it impossible to benchmark something that's fast, especially because I also need to start/stop the linux performance counters. I need contemplate more on how to best do it, but currently don't have too much free time on my hand for this.

@jonas-schulze
Copy link
Contributor

I would love to have this feature as well. Maybe it would be the easiest to have a different entry point for benchmarks that require setup (and teardown!) to be excluded from the measurements, say FancyBench()::run(), as -- I assume -- they are not what you consider "fast". 😉

@haydenflinner
Copy link

I've landed here because I think I need this feature to benchmark 'something fast', Protobuf encode+decode. Basically, I have an encode+decode step which benchmarks out as 100ns. Then if I break this apart into two separate functions, one bench for encode and one for decode, I get 30ns for encode and 1ns for decode. These results don't really make sense naively, I'd expect X+Y=Z, 30+something=100. I suspect either something more is being optimized out (hard to tell because I'm no good at disassembly 😃 ) or this might be an artifact of running the things in two separate loops so that the branching/caching get along better. I just don't expect such a huge true difference because my message I'm encoding/decoding is less than a cache-line in size.

But so far the only examples I've seen of state larger than a word carried across runs and ensuring things won't be optimized out is the gtest approach of redoing the setup within the hot loop, starting and stopping a timer for the part you're interested in. I understand this might add some overhead, but I'm not that concerned with that size of overhead, especially because when my code is embedded in another process, I will be far less certain exactly what the cost of each line of code is, due to different cache conditions/inlinability vs the microbench.

@RT2Code
Copy link

RT2Code commented Oct 19, 2023

I understand why you don't want to pause the timer between epoch iterations and I agree with you about that, but at least we really need fixtures between epochs.

It's probably not ideal, but here's an example of a possible implementation :
RT2Code@c79a081

And my use case :

Signal<void(int)> signal;
bench.epochIterations(1000).run("Connect", [&]()
{
	signal.connect(slot);
},
[] {}, // Setup function (doing nothing here)
[&] // Teardown function
{
	signal.reset();
});

Connect is a function with side effects since it appends the slot to a vector, so using a fixed iteration number is better here in order to have a consistent result between each run. The teardown function serves the purpose of resetting the signal between each epoch. Without this, subsequent epochs would cause the vector to expand, leading to severe variations in timing measurements.

@lakinwecker
Copy link

lakinwecker commented Jul 17, 2024

Instead, please do 2 measurements: first measure runtime of just the setup (e.g. creating & shuffling), then benchmark everything, and subtract the first.

Couldn't we just provide an API to do exactly this? You provide code for a "setup", then you provide code for a setup + operation, and then you run the tight loops on both, and do the subtraction?

    ankerl::nanobench::Bench().batch(1000).runWithSetup("Foo bar", 
    [&]() {
        // setup
        Foo foo = expensiveSetup();
    },
    [&](auto & foo){
        for(size_t i = 0; i < 1000; ++i) {
            foo.bar();
        }
        ...
    });

Something like this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request research-needed
Projects
None yet
Development

No branches or pull requests

7 participants