Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metabench is measuring the wrong thing #124

Closed
ericniebler opened this issue May 29, 2016 · 18 comments
Closed

Metabench is measuring the wrong thing #124

ericniebler opened this issue May 29, 2016 · 18 comments

Comments

@ericniebler
Copy link

ericniebler commented May 29, 2016

Very few people are manipulating type lists of > a few dozen elements. I find the current tests to be largely meaningless for real world metaprograms. It would be more interesting to run the tests up to, say, 100, but repeatedly and with lots of different types to force the number of instantiations up.

In addition, benchmarks are meaningless if the total CPU time is less than about 5s. So, for each N in [0,100] each test should keep adding unique instantiations of the algorithm being measured until the CPU time is high enough and then divide by the number of instantiations to get an accurate idea of the performance profile of that algorithm at that N.

That way, we can really start the work of optimizing our libs for the real world.

@ericniebler ericniebler changed the title Metabench if measuring the wrong thing Metabench is measuring the wrong thing May 29, 2016
@brunocodutra
Copy link
Collaborator

brunocodutra commented May 29, 2016

So, for each N in [0,100] each test should keep adding unique specializations of the algorithm being measured until the CPU time is high enough and then divide by the number of specializations to get an accurate idea of the performance profile at that N.

I'm not sure I follow your idea. So you propose that, for a given N, instead of instantiating the algorithm under test for a single list of size N, we instead instantiate it for K/N lists of the same size N but each containing different types (so as to avoid memoization)?

@ericniebler
Copy link
Author

For each N, instantiate the algorithm with K different lists to avoid memoization. Increase K until CPU time > 5s. Then divide the CPU time by K to get an accurate idea of the cost of each instantiation at that N.

@brunocodutra
Copy link
Collaborator

brunocodutra commented May 29, 2016

For each N, instantiate the algorithm with K different lists to avoid memoization. Increase K until CPU time > 5s. Then divide the CPU time by K to get an accurate idea of the cost of each instantiation at that N.

Right. I see two problems with this approach as is:

  1. 5s for every N for every algorithm for every library would make it impossible to update benchmark results using Travis.
  2. Even if we agree on some X < 5s, how would we got about increasing K until CPU time > X? I mean, Metabench works roughly by generating static source files for each N and then measuring the time taken to compile those source files. We could surely put that in a loop and keep increasing K and regenerating a source file until the time taken to compile it gets high enough, but that too would be prohibitively expensive time-wise, even for very small values of X.

I think the best we could do would be to define some static K (per algorithm) and, for every N, generate a source file that instantiates the given algorithm for K/N lists of size N. Then we measure how long it takes to compile that source file and divide the result by K/N. This way we would have roughly the same resolution with respect to time for every N.

Does that seem reasonable?

@ldionne
Copy link
Owner

ldionne commented May 29, 2016

@ericniebler I think you have a point that horizontal benchmarks (with many small instantiations) have value and that we're not measuring that right now. However, saying that the current benchmarks are largely meaningless is wrong in my experience, as many "real world" metaprograms will start from a small list but generate a larger one from it, and then call algorithms on that larger list.

In all cases, I don't think that's a problem with the Metabench module itself, but just a constatation that the benchmark suite is incomplete. What I would do is basically what @brunocodutra says above (except I don't understand the need for K/N): for each algorithm, fix some constant K that makes sense. K is the length of the list that we're interested in benchmarking, and it should be something that we're expecting to see in real world metaprograms. Then, for each N, just create N different lists of length K and run the algorithm on each of these lists. That won't give us a precise timing of the algorithm for that K, but the overall curve will give us a hint as to whether instantiating this algorithm many times with different lists is costly or not. I don't think it is very useful to know the precise timing for small Ks by dividing by the number of lists, just seeing the total timing should be enough. And we can achieve all of this without changing the current module at all; we just need to add new benchmarks.

What do you think?

@brunocodutra
Copy link
Collaborator

brunocodutra commented May 29, 2016

What do you think?

Hmm that's a bit different from what I proposed and I'm not sure I got the idea. Do you expect the timings to vary across instances of a given algorithm for lists of the same size?

My idea was simply to increase the total compilation time for small N such that we reduce the noise imposed by the OS. A good way to do that, I think, would be to simply keep the product number_of_lists*list_size constant, hence the K/N, where K would be some arbitrary constant larger than than MAX_N and N would be the size of the list.

@ldionne
Copy link
Owner

ldionne commented May 29, 2016

I don't expect the timings to vary across instances of the same algorithms given inputs of same size, but what we'd be measuring is the total compiler overhead when instantiating the algorithm multiple times. The compiler has internal structures that it needs to maintain and build as the compilation progresses. The 100th instantiation of the same algorithm (on a different input of the same size) could be different from the 1st instantiation due to these structures.

So I think your idea (and Eric's) is different from mine in the following sense. What you want is a precise timing of algorithms on small inputs. What I'm suggesting is an imprecise (well, not too precise) timing of calling the same algorithm many times on different inputs of the same size. What I'm proposing requires less fiddling with the data, and I think it is even more useful because that's what we're interested in at the end of the day.

If we try to measure the time of a single algorithm on a small input, we'll have to measure the algorithm on many small inputs and then divide, as you suggest. However, that requires making the assumption that the time for the 100th instantiation is roughly the same as the 1st instantiation, which I don't think is necessarily true.

@brunocodutra
Copy link
Collaborator

The 100th instantiation of the same algorithm (on a different input of the same size) could be different from the 1st instantiation due to these structures.

I see now, makes sense.

If we try to measure the time of a single algorithm on a small input, we'll have to measure the algorithm on many small inputs and then divide, as you suggest. However, that requires making the assumption that the time for the 100th instantiation is roughly the same as the 1st instantiation, which I don't think is necessarily true.

Interesting reasoning. Personally I wouldn't think the compiler overhead would play an important role here, but that's something we gotta check indeed.

brunocodutra referenced this issue Jun 17, 2016
Also, use -nostdinc++ with locally-built libc++ to avoid standard
library include path conflicts.

Closes #126
@brunocodutra
Copy link
Collaborator

brunocodutra commented Aug 2, 2016

Following up from #129 I think it would be a great idea to provide benchmarks for complete solutions to a couple of classical TMP problems as a way to assess the performance of our libraries as perceived by the end user.

So, any ideas for a classical TMP problem?

@porkybrain @edouarda @jonathanpoelen @gnzlbg

@ldionne
Copy link
Owner

ldionne commented Aug 2, 2016

Note: This is related but not exactly the same as this issue. This issue is about horizontal microbenchmarks, while what @brunocodutra is talking about is macro benchmarks. I think both would be useful, of course, but the question is what we want to invest effort in as a first start.

@edouarda
Copy link

edouarda commented Aug 2, 2016

Micro benchmarks are a good indicator of what to expect at the macro level.

Macro benchmarks may result in libraries being optimized for the cases of the macro benchmark.

@ldionne
Copy link
Owner

ldionne commented Aug 2, 2016

Fair point. Then perhaps we want to focus on higher precision micro benchmarks as proposed above. Basically it would be about trying to get a precise measurement of the algorithm for small inputs, which would require executing the algorithm with many different lists of small size.

@brunocodutra
Copy link
Collaborator

brunocodutra commented Aug 2, 2016

Note: This is related but not exactly the same as this issue.

Right, the idea was to group all related discussions under this generic issue which is called Metabench is measuring the wrong thing

Micro benchmarks are a good indicator of what to expect at the macro level. Macro benchmarks may result in libraries being optimized for the cases of the macro benchmark.

I didn't mean to replace existing microbenchmarks, but rather provide macro benchmarks in addition to the benchmarks we provide today. Microbenchmarks are for us metaprogramming developers who understand what they mean, while macro benchmarks would be targeted to the users who just want a general idea of the performance of the different libraries under various complex scenarios. Finer grained benchmarks would be just a tab away anyways, so I don't see how macro benchmarks could be any harmful.

I also think it would be quite fun to benchmark fairly complex scenarios.

@gnzlbg
Copy link

gnzlbg commented Aug 2, 2016

To me it makes sense to focus on micro-benchmarks when experimenting with optimizations.

Still it might make sense to have some micro-benchmarks that are derived from typical use cases of meta-programming rather from "i think I can optimize this, let's make a micro-benchmark up to measure".

If somebody is looking for ideas I would suggest looking at how meta-programming is used to implement tuple, tuple algorithms, variant, ... and the <functional> header (std::bind and std::invoke in particular). I would say that those are some of the most common ways through which people "benefit" from meta-programming even if they don't write meta-programs themselves.

There are obvious tons of libraries that use meta-programming and that people do use, but I don't know how impactful would be trying to improve some e.g. Boost.Spirit patterns as opposed to std::tuple.

@ericniebler
Copy link
Author

common_type is another example.

@brunocodutra
Copy link
Collaborator

brunocodutra commented Aug 3, 2016

common_type is another example.

That's a great idea for another microbenchmark.

As for macrobenchmarks I was thinking something more involved such as parsing raw literals. This example in particular would be tricky to benchmark though, because the number of digits is quite limited by the sizeof(std::uintmax_t), even for binary literals.

@odinthenerd
Copy link
Contributor

Sorry for being way late to the party. I think there are examples of the use of huge list, kvasir.io is one of them, there is only one "device initialization", which can have a few thousand elements, which is optimized using sort and fold. Currently it crashes above 500 elements or so because sort is still not powerful enough.
Obviously @ericniebler is right the typical use case is many many uses of small algorithms. Full solutions benchmarks would be nice too since they would reflect the memoization effects of parts of the respective algorithms. For example many at implementations use lists of const void*, these need to be generated for every unique index but could be reused by the compiler for the same index into a different list.
Full solutions tests are a lot of work though. I am mentoring a student who will be doing most of the implementation of parameter2. Maybe I can talk him into doing the implementation with different MPLs, in any case it will take a while as he is still learning some of the material.

@brunocodutra
Copy link
Collaborator

@porkybrain Sounds interesting, what are some of the more involved MP problems that must be addressed by parameter2?

Then perhaps we want to focus on higher precision micro benchmarks as proposed above.

@ldionne Any ideas on how we could go about this?

@ldionne
Copy link
Owner

ldionne commented Oct 16, 2016

I'm closing this in favor of #148 and #151, which tackle the two conversations taking place here independently.

@ldionne ldionne closed this as completed Oct 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants