Metabench is measuring the wrong thing #124

ericniebler · 2016-05-29T15:04:40Z

Very few people are manipulating type lists of > a few dozen elements. I find the current tests to be largely meaningless for real world metaprograms. It would be more interesting to run the tests up to, say, 100, but repeatedly and with lots of different types to force the number of instantiations up.

In addition, benchmarks are meaningless if the total CPU time is less than about 5s. So, for each N in [0,100] each test should keep adding unique instantiations of the algorithm being measured until the CPU time is high enough and then divide by the number of instantiations to get an accurate idea of the performance profile of that algorithm at that N.

That way, we can really start the work of optimizing our libs for the real world.

brunocodutra · 2016-05-29T15:18:50Z

So, for each N in [0,100] each test should keep adding unique specializations of the algorithm being measured until the CPU time is high enough and then divide by the number of specializations to get an accurate idea of the performance profile at that N.

I'm not sure I follow your idea. So you propose that, for a given N, instead of instantiating the algorithm under test for a single list of size N, we instead instantiate it for K/N lists of the same size N but each containing different types (so as to avoid memoization)?

ericniebler · 2016-05-29T15:22:34Z

For each N, instantiate the algorithm with K different lists to avoid memoization. Increase K until CPU time > 5s. Then divide the CPU time by K to get an accurate idea of the cost of each instantiation at that N.

brunocodutra · 2016-05-29T15:47:07Z

For each N, instantiate the algorithm with K different lists to avoid memoization. Increase K until CPU time > 5s. Then divide the CPU time by K to get an accurate idea of the cost of each instantiation at that N.

Right. I see two problems with this approach as is:

5s for every N for every algorithm for every library would make it impossible to update benchmark results using Travis.
Even if we agree on some X < 5s, how would we got about increasing K until CPU time > X? I mean, Metabench works roughly by generating static source files for each N and then measuring the time taken to compile those source files. We could surely put that in a loop and keep increasing K and regenerating a source file until the time taken to compile it gets high enough, but that too would be prohibitively expensive time-wise, even for very small values of X.

I think the best we could do would be to define some static K (per algorithm) and, for every N, generate a source file that instantiates the given algorithm for K/N lists of size N. Then we measure how long it takes to compile that source file and divide the result by K/N. This way we would have roughly the same resolution with respect to time for every N.

Does that seem reasonable?

ldionne · 2016-05-29T17:29:53Z

@ericniebler I think you have a point that horizontal benchmarks (with many small instantiations) have value and that we're not measuring that right now. However, saying that the current benchmarks are largely meaningless is wrong in my experience, as many "real world" metaprograms will start from a small list but generate a larger one from it, and then call algorithms on that larger list.

In all cases, I don't think that's a problem with the Metabench module itself, but just a constatation that the benchmark suite is incomplete. What I would do is basically what @brunocodutra says above (except I don't understand the need for K/N): for each algorithm, fix some constant K that makes sense. K is the length of the list that we're interested in benchmarking, and it should be something that we're expecting to see in real world metaprograms. Then, for each N, just create N different lists of length K and run the algorithm on each of these lists. That won't give us a precise timing of the algorithm for that K, but the overall curve will give us a hint as to whether instantiating this algorithm many times with different lists is costly or not. I don't think it is very useful to know the precise timing for small Ks by dividing by the number of lists, just seeing the total timing should be enough. And we can achieve all of this without changing the current module at all; we just need to add new benchmarks.

What do you think?

brunocodutra · 2016-05-29T17:57:45Z

What do you think?

Hmm that's a bit different from what I proposed and I'm not sure I got the idea. Do you expect the timings to vary across instances of a given algorithm for lists of the same size?

My idea was simply to increase the total compilation time for small N such that we reduce the noise imposed by the OS. A good way to do that, I think, would be to simply keep the product number_of_lists*list_size constant, hence the K/N, where K would be some arbitrary constant larger than than MAX_N and N would be the size of the list.

ldionne · 2016-05-29T18:47:39Z

I don't expect the timings to vary across instances of the same algorithms given inputs of same size, but what we'd be measuring is the total compiler overhead when instantiating the algorithm multiple times. The compiler has internal structures that it needs to maintain and build as the compilation progresses. The 100th instantiation of the same algorithm (on a different input of the same size) could be different from the 1st instantiation due to these structures.

So I think your idea (and Eric's) is different from mine in the following sense. What you want is a precise timing of algorithms on small inputs. What I'm suggesting is an imprecise (well, not too precise) timing of calling the same algorithm many times on different inputs of the same size. What I'm proposing requires less fiddling with the data, and I think it is even more useful because that's what we're interested in at the end of the day.

If we try to measure the time of a single algorithm on a small input, we'll have to measure the algorithm on many small inputs and then divide, as you suggest. However, that requires making the assumption that the time for the 100th instantiation is roughly the same as the 1st instantiation, which I don't think is necessarily true.

brunocodutra · 2016-05-29T18:52:35Z

The 100th instantiation of the same algorithm (on a different input of the same size) could be different from the 1st instantiation due to these structures.

I see now, makes sense.

If we try to measure the time of a single algorithm on a small input, we'll have to measure the algorithm on many small inputs and then divide, as you suggest. However, that requires making the assumption that the time for the 100th instantiation is roughly the same as the 1st instantiation, which I don't think is necessarily true.

Interesting reasoning. Personally I wouldn't think the compiler overhead would play an important role here, but that's something we gotta check indeed.

Also, use -nostdinc++ with locally-built libc++ to avoid standard library include path conflicts. Closes #126

brunocodutra · 2016-08-02T16:04:01Z

Following up from #129 I think it would be a great idea to provide benchmarks for complete solutions to a couple of classical TMP problems as a way to assess the performance of our libraries as perceived by the end user.

So, any ideas for a classical TMP problem?

@porkybrain @edouarda @jonathanpoelen @gnzlbg

ldionne · 2016-08-02T16:13:30Z

Note: This is related but not exactly the same as this issue. This issue is about horizontal microbenchmarks, while what @brunocodutra is talking about is macro benchmarks. I think both would be useful, of course, but the question is what we want to invest effort in as a first start.

edouarda · 2016-08-02T16:18:01Z

Micro benchmarks are a good indicator of what to expect at the macro level.

Macro benchmarks may result in libraries being optimized for the cases of the macro benchmark.

ldionne · 2016-08-02T16:21:18Z

Fair point. Then perhaps we want to focus on higher precision micro benchmarks as proposed above. Basically it would be about trying to get a precise measurement of the algorithm for small inputs, which would require executing the algorithm with many different lists of small size.

brunocodutra · 2016-08-02T16:42:12Z

Note: This is related but not exactly the same as this issue.

Right, the idea was to group all related discussions under this generic issue which is called Metabench is measuring the wrong thing

Micro benchmarks are a good indicator of what to expect at the macro level. Macro benchmarks may result in libraries being optimized for the cases of the macro benchmark.

I didn't mean to replace existing microbenchmarks, but rather provide macro benchmarks in addition to the benchmarks we provide today. Microbenchmarks are for us metaprogramming developers who understand what they mean, while macro benchmarks would be targeted to the users who just want a general idea of the performance of the different libraries under various complex scenarios. Finer grained benchmarks would be just a tab away anyways, so I don't see how macro benchmarks could be any harmful.

I also think it would be quite fun to benchmark fairly complex scenarios.

gnzlbg · 2016-08-02T18:23:08Z

To me it makes sense to focus on micro-benchmarks when experimenting with optimizations.

Still it might make sense to have some micro-benchmarks that are derived from typical use cases of meta-programming rather from "i think I can optimize this, let's make a micro-benchmark up to measure".

If somebody is looking for ideas I would suggest looking at how meta-programming is used to implement tuple, tuple algorithms, variant, ... and the <functional> header (std::bind and std::invoke in particular). I would say that those are some of the most common ways through which people "benefit" from meta-programming even if they don't write meta-programs themselves.

There are obvious tons of libraries that use meta-programming and that people do use, but I don't know how impactful would be trying to improve some e.g. Boost.Spirit patterns as opposed to std::tuple.

ericniebler · 2016-08-02T18:45:55Z

common_type is another example.

brunocodutra · 2016-08-03T11:50:30Z

common_type is another example.

That's a great idea for another microbenchmark.

As for macrobenchmarks I was thinking something more involved such as parsing raw literals. This example in particular would be tricky to benchmark though, because the number of digits is quite limited by the sizeof(std::uintmax_t), even for binary literals.

odinthenerd · 2016-08-03T13:46:30Z

Sorry for being way late to the party. I think there are examples of the use of huge list, kvasir.io is one of them, there is only one "device initialization", which can have a few thousand elements, which is optimized using sort and fold. Currently it crashes above 500 elements or so because sort is still not powerful enough.
Obviously @ericniebler is right the typical use case is many many uses of small algorithms. Full solutions benchmarks would be nice too since they would reflect the memoization effects of parts of the respective algorithms. For example many at implementations use lists of const void*, these need to be generated for every unique index but could be reused by the compiler for the same index into a different list.
Full solutions tests are a lot of work though. I am mentoring a student who will be doing most of the implementation of parameter2. Maybe I can talk him into doing the implementation with different MPLs, in any case it will take a while as he is still learning some of the material.

brunocodutra · 2016-08-19T20:31:33Z

@porkybrain Sounds interesting, what are some of the more involved MP problems that must be addressed by parameter2?

Then perhaps we want to focus on higher precision micro benchmarks as proposed above.

@ldionne Any ideas on how we could go about this?

ldionne · 2016-10-16T22:16:06Z

I'm closing this in favor of #148 and #151, which tackle the two conversations taking place here independently.

ericniebler changed the title ~~Metabench if measuring the wrong thing~~ Metabench is measuring the wrong thing May 29, 2016

brunocodutra referenced this issue Jun 17, 2016

[Travis] Add testing with GCC 6

54092ba

Also, use -nostdinc++ with locally-built libc++ to avoid standard library include path conflicts. Closes #126

ldionne mentioned this issue Aug 2, 2016

Make sure MPL benchmarks are representative #129

Closed

ldionne mentioned this issue Aug 24, 2016

Consider adding high-precision microbenchmarks for small sequences #148

Closed

ldionne mentioned this issue Oct 16, 2016

Consider benchmarking real-world scenarios #151

Open

ldionne closed this as completed Oct 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metabench is measuring the wrong thing #124

Metabench is measuring the wrong thing #124

ericniebler commented May 29, 2016 •

edited

Loading

brunocodutra commented May 29, 2016 •

edited

Loading

ericniebler commented May 29, 2016

brunocodutra commented May 29, 2016 •

edited

Loading

ldionne commented May 29, 2016

brunocodutra commented May 29, 2016 •

edited

Loading

ldionne commented May 29, 2016

brunocodutra commented May 29, 2016

brunocodutra commented Aug 2, 2016 •

edited

Loading

ldionne commented Aug 2, 2016

edouarda commented Aug 2, 2016

ldionne commented Aug 2, 2016

brunocodutra commented Aug 2, 2016 •

edited

Loading

gnzlbg commented Aug 2, 2016 •

edited

Loading

ericniebler commented Aug 2, 2016

brunocodutra commented Aug 3, 2016 •

edited

Loading

odinthenerd commented Aug 3, 2016

brunocodutra commented Aug 19, 2016

ldionne commented Oct 16, 2016

Metabench is measuring the wrong thing #124

Metabench is measuring the wrong thing #124

Comments

ericniebler commented May 29, 2016 • edited Loading

brunocodutra commented May 29, 2016 • edited Loading

ericniebler commented May 29, 2016

brunocodutra commented May 29, 2016 • edited Loading

ldionne commented May 29, 2016

brunocodutra commented May 29, 2016 • edited Loading

ldionne commented May 29, 2016

brunocodutra commented May 29, 2016

brunocodutra commented Aug 2, 2016 • edited Loading

ldionne commented Aug 2, 2016

edouarda commented Aug 2, 2016

ldionne commented Aug 2, 2016

brunocodutra commented Aug 2, 2016 • edited Loading

gnzlbg commented Aug 2, 2016 • edited Loading

ericniebler commented Aug 2, 2016

brunocodutra commented Aug 3, 2016 • edited Loading

odinthenerd commented Aug 3, 2016

brunocodutra commented Aug 19, 2016

ldionne commented Oct 16, 2016

ericniebler commented May 29, 2016 •

edited

Loading

brunocodutra commented May 29, 2016 •

edited

Loading

brunocodutra commented May 29, 2016 •

edited

Loading

brunocodutra commented May 29, 2016 •

edited

Loading

brunocodutra commented Aug 2, 2016 •

edited

Loading

brunocodutra commented Aug 2, 2016 •

edited

Loading

gnzlbg commented Aug 2, 2016 •

edited

Loading

brunocodutra commented Aug 3, 2016 •

edited

Loading