Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement min_element (benchmark linked in description) #99

Closed
wants to merge 1 commit into from

Conversation

ecrypa
Copy link
Contributor

@ecrypa ecrypa commented Oct 5, 2019

See #73.

Performance of Clang 8 and GCC 9 is similar (v5 refers to this version; v4 is based on metal::bind and metal::arg).

I suspect that the most expensive operation in this version is the recursive call of typename _min_element_worker<false>::template type. The recursion processes one element per step (no fast-tracking), and each call uses O(N) template arguments. It should be more efficient to cut the list into smaller pieces, calculate the minima of the pieces, and then calculate the minima of the minima. I guess you know the tree-based lookup of kvasir as explained in this blog post from Odin Holmes?

@ecrypa ecrypa changed the title implement min_element including benchmarks implement min_element (benchmark linked in description) Oct 5, 2019
Copy link
Owner

@brunocodutra brunocodutra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you've mastered Odin's technique to recurse on template alias.

I'm actually surprised by the fact the performance of the one-liner based on higher-order composition turns out not to be that bad at all. Maintaining a custom implementation would only pay off if it outperforms the one-liner by a considerable margin. I think we can get there with fast-tracking, I suggest you have a look at metal::fold_left and metal::fold_right.

It may also be worth it to get some inspiration from metal::at, in particular the mutual recursion between detail::grouper and detail::_at, which builds a recursive tree-like structure that is then unrolled by evaluating the nested ::type. I think it's similar to the idea discussed in that blog post by Odin, in fact I think I was inspired by it at that time.

Either way, I'd be fine with the one-liner if you think it's not worth it to invest time in the fast-tracked version.

include/metal/list/min_element.hpp Show resolved Hide resolved
template<class...> class comp, class v0, class v1,
class... tail>
using type = typename _min_element_worker<!sizeof...(
tail)>::template type<comp, if_<comp<v1, v0>, v1, v0>, tail...>;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me of the way metal::fold_left and metal::fold_right are implemented, we can probably leverage the same strategy for fast tracking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought about that. In the ideal case, we find a SFINAE-friendly version of combiner and feed that into fold_left directly.

If we can not find a SFINAE-friendly combiner then the fast-tracking is tedious: Both arguments are mentioned twice (if_<comp<v1, v0>, v1, v0>). Do you have an idea how to solve that elegantly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a way, but it is slow on GCC9. I suspect that the aliases are recursively expanded twice for both arguments.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n the ideal case, we find a SFINAE-friendly version of combiner and feed that into fold_left directly.

A SFINAE-friendly version of combiner is called metal::bind. If there were a better way to implement it, then we could just define min_element as the beautiful one-liner.

If we can not find a SFINAE-friendly combiner then the fast-tracking is tedious.

Specially so, because it's an almost exact copy of metal::fold_left. Do you think we could extract the internal implementation and make the c alias as you defined a parameter such that one could implement both fold_left and min_element (and perhaps other algorithms) without any code duplication?

it is slow on GCC9.

There's a trick you could try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is slow on GCC9.

There's a trick you could try.

Yeah, good idea. Let me call that v9.foldlike_tempsave. It does help, but it is still slower than the one-liner. Weirdly, an even faster workaround is given by v10.foldlike_indirection. It is unbelievably fast on GCC 9, and it is one of the three equally fast winners on Clang 8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we could extract the internal implementation and make the c alias as you defined a parameter such that one could implement both fold_left and min_element (and perhaps other algorithms) without any code duplication?

Currently, the fastest min_element implementation on Clang 8 and GCC 9 is v10.foldlike_indirection. If I remember correctly then you should be able to change the combiner (c) from

template<template<typename...> class comp, class x, class y>
using _min_element_combiner = if_<comp<y, x>, y, x>;

to

template<template<typename...> class expr, class x, class y>
using _fold_left_combiner = expr<x, y>;

in order to obtain an implementation of metal::fold_left. I suspect that it should be equally fast, but I hesitate to touch such a fundamental well tested algorithm.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unbelievably fast on GCC 9, and it is one of the three equally fast winners on Clang 8.

Wow amazing!

If I remember correctly then you should be able to change the combiner (c) from

You're definitely onto something. It's possible this indirection would trip MSVC, but worst case we can implement something analogous to detail::forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly then you should be able to change the combiner [...]

It seems to work. See #100. I am not sure if it is a good idea, though.

@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 5, 2019

I'm actually surprised by the fact the performance of the one-liner based on higher-order composition turns out not to be that bad at all. Maintaining a custom implementation would only pay off if it outperforms the one-liner by a considerable margin. I think we can get there with fast-tracking, I suggest you have a look at metal::fold_left and metal::fold_right.

I do not know how to apply fast-tracking here because both arguments are mentioned twice. I tried this (v6), which is fast on clang 8, but does not compile on GCC 9.

It may also be worth it to get some inspiration from metal::at, in particular the mutual recursion between detail::grouper and detail::_at, which builds a recursive tree-like structure that is then unrolled by evaluating the nested ::type. I think it's similar to the idea discussed in that blog post by Odin, in fact I think I was inspired by it at that time.

Yes, those are interesting. I will definitely have a look there because I want to learn the techniques. However, I am not sure if it is worth to introduce more complexity here if the one-liner does the job well.

Either way, I'd be fine with the one-liner if you think it's not worth it to invest time in the fast-tracked version.

I believe that the one-liner is a good starting point. We could implement it, provide docs, and create tests. When more performance is required, it can be improved.

@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 6, 2019

Looks like you've mastered Odin's technique to recurse on template alias.

I keep on reading the following two sources again and again, and I can not decide if I should trust that technique or not:

  1. http://odinthenerd.blogspot.com/2017/03/the-alias-flaw-why-kvasirmpl-is-faster.html
  2. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59498

I believe that the one-liner is a good starting point. We could implement it, provide docs, and create tests. When more performance is required, it can be improved.

After some more benchmarking, I am convinced that the one-liner (v4, higher-order metafunction) is the best I can achieve currently: It is fastest on GCC 9, and it is not too bad on Clang 8. For reference, you can find all implementations in ecrypa@40b2acc (should I add commits to this PR?).

@brunocodutra
Copy link
Owner

Yes, those are interesting. I will definitely have a look there because I want to learn the techniques. However, I am not sure if it is worth to introduce more complexity here if the one-liner does the job well.

Agreed.

I keep on reading the following two sources again and again, and I can not decide if I should trust that technique or not:

It does work, but only when a dependent nested lookup is involved, that's why that "silly" (sizeof...(Ts)<10000000) is there. If we replace it by true it fails with the same error you see in that bug report even on Clang 9 with C++20 enabled. The fact all three major compilers consistently agree on this tells me there must be something buried deep in the language specification that implies this surprising behavior, but I wouldn't be able to tell you exactly what that is.

ecrypa referenced this pull request in ecrypa/metal Oct 6, 2019
@ecrypa ecrypa mentioned this pull request Oct 6, 2019
@ecrypa ecrypa closed this Oct 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants