Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement min_element #100

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

ecrypa
Copy link
Contributor

@ecrypa ecrypa commented Oct 6, 2019

Continuation of #99 (comment).

I copied the code of fold_left because I want to benchmark old and new versions. Preliminary results can be found in the section "Change of fold left" at https://ecrypa.github.io/metal-benchmark/.

@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 7, 2019

Summary

This generalization of fold_left seems to work, but I do not like it. I will concisely try to explain my dislike. Then I turn back to bind-free composition outside of fold_left. Finally, I ask specific questions.

Why I dislike the approach of this PR

Semantically, a fold can be fully specified by a single template<class init, class arg> class expr. Instead, this implementation (v2) asks for a comp (which is an expression that specifies an ordering), and it additionally asks for that weird template<template<class, class> class comp, class init, class arg> class combiner. Inside the fold, comp and combiner are composed to the real expr that is used for the folding. From my point of view, this is not a natural interface for fold_left, and other use cases may require even more arguments than just comp and combiner. Key question: Why do we compose inside of the fold_left implementation?

How to compose user-provided ordering and min_element logic

I can think of three options to consider:

  1. composition inside fold_left (see above)
  2. composition outside, based on metal::bind (works, but not blazingly fast; see v4)
  3. composition outside, based on native C++ templates (SFINAE issues)

I believe that (1) diminishes the utility of fold_left as a general building block because the fold_left interface can offer a limited level of composition only.

Improving the performance of (2) is highly desirable because user-code performance can be expected to benefit from efficient composition, too. However, I expect the optimization to be difficult (impossible?) because it is already better than (you @brunocodutra) expected.

When @brunocodutra pointed me at the SFINAE issues of (3), I was scared. I surely do not fully understand it yet, but I have gained some optimism that it can be mitigated. Therefor, I would like to understand what kind of failures we have to consider.

Failures to be considered by approach (3) to min_element

... and how to handle them

(A) user-provided ordering expression expects a wrong number of arguments

We should be able to preserve SFINAE-friendlyness by simple pattern matching.

(B) user-provided ordering fires a static_assert on certain input

I believe that there is no way at all to preserve SFINAE-friendlyness in that case, right?

(C) user-provided ordering (UPO) causes a SF when passed certain arguments

Since the arguments are forwarded to the UPO not before min_element is "called", there should be the desired SF on min_element.

(D) ... (What else can happen?)

As far as I understand that whole SFINAE business, the following techniques should help in any conceivable case: We can change our external composition from

template<class init, class arg>
using combiner = if_<comp<arg, init>, arg, init>;

to

template<
  template<class> class di /* disguised_identity */,
  class init, class arg>
using combiner = if_<comp<di<arg>, di<init>>, di<arg>, di<init>>;

or

template<
  template<template<class...> class> dei /* disguised_expr_identity */,
  class init, class arg>
using combiner = if_<typename dei<comp>::template type<arg, init>, arg, init>;

or a combination of both.

If the private interface (for internal use only) of fold_left is changed such that it accepts one of the two latter aliases (or its combination) then we can achieve a much more general SFINAE-friendly composition than with the current draft implementation.

Specific questions

I would be particularly thankful for input on any of these specific points:

  • What do you think about the internal composition (1) in this PR?
  • Which failures other than (A)--(C) should I consider when testing native composition (3)?
  • Do you agree that (D) should help in any case of failure? Can we get away without it, though?
  • Any other hints how I should proceed?

@brunocodutra
Copy link
Owner

brunocodutra commented Oct 8, 2019

Why do we compose inside of the fold_left implementation?

Agree 100% that this is not something we would like to do, considering Metal provides first class higher order composition that could just be reused. The only reason why we would want to do something like this would be to get better benchmarks, but frankly it may not be worth it, specially if we think we're paying for a performance improvement no one needs.

However, I expect the optimization to be difficult (impossible?) because it is already better than (you @brunocodutra) expected.

A very pleasant surprise I should add :)

We should be able to preserve SFINAE-friendlyness by simple pattern matching.

Unfortunately, you can't do that in general.

I believe that there is no way at all to preserve SFINAE-friendlyness in that case, right?

Correct, more generally, any error that originates in the definition as opposed to declaration of a type template is unrecoverable through SFINAE and considered a user error.

Since the arguments are forwarded to the UPO not before min_element is "called", there should be the desired SF on min_element.

I'm not sure I understand this statement, could you elaborate?

As far as I understand that whole SFINAE business, the following techniques should help in any conceivable case

I'm not sure I understand, how would di or dei help with SFINAE?

Perhaps I should clarify what I mean by SFINAE-friendliness.
I'll start by defining a way to testing for it:

template<template<class...> class tmpl, class... args>
using is_sfinae_friendly = metal::is_invocable<metal::lambda<tmpl>, args...>;

For any tmpl and any args, tmpl is only SFINAE-friendly if instantiating is_sfinae_friendly<tmpl, args...> does not trigger a compilation error.

Metal unit tests systematically check for its own SFINAE-friendliness by attempting to evaluate algorithms on all sorts of ill-defined types. You'll see this test data has been carefully crafted to cover various edge cases and, in particular, attempts to trip metal::as_number by defining a nested ::value to be various things other than the expected constant integral value convertible to metal::int_. You'll see that unit tests expect metal::as_number to survive instantiations against all of these without triggering a compilation error.

Checking whether some template expression is SFINAE-friendly is in general very complex and requires deep understanding of the standard that I do not claim to have myself. Put simply, a substitution failure is only recoverable if it's triggered as part of the declaration of a type due to its own instantiation or the instantiation of one of its immediate type arguments. A substitution error that gets triggered as part of the definition of a type is never recoverable. For the purpose of this definition, a template alias is considered a declaration.

If we turn back to the original definition of _min_element_val:

template<template<typename...> class expr>
struct _min_element_val<lambda<expr>> {
  template<typename x, typename y>
  using combiner = if_<expr<x, y>, x, y>;
  // ...
};

I claim combiner is not SFINAE friendly, because a substitution error that is triggered by expr<x, y> as part of the instantiation of combiner<x, y> though part of its declaration, is triggered by the substitution of expr, which is not one of its immediate type arguments. Therefore such a substitution error leads to a compilation error.

You can test this by attempting to instantiate is_sfinae_friendly<_min_element_val<lambda<std::is_void>>, void, some_extra_argument>, which should trigger a compilation error stating that std::is_void<void, some_extra_argument> cannot be instantiated.

I hope this helps.

include/metal/value/fold_left_v2.hpp Outdated Show resolved Hide resolved
include/metal/value/fold_left_v2.hpp Outdated Show resolved Hide resolved
@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 9, 2019

Wow! Thanks a lot for your excellent explanations. I very much appreciate the time you are spending to teach me these things.

While your explanations are extremely useful for me, I suspect that it is most efficient (for both of us) if I do not respond to each of your points individually. Instead, I propose to respond with (hopefully) improved code. Should you be more interested in my thoughts/explanations then please let me know.

@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 9, 2019

I start to feel more confident about the application of these metaprogramming techniques. I am looking forward to your comments to this latest approach.

By the way: Should I close-and-open, or is it fine to continue here?

@ecrypa ecrypa changed the title generalize fold_left; implement min_element implement min_element Oct 9, 2019
@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 9, 2019

We should be able to preserve SFINAE-friendlyness by simple pattern matching.

Unfortunately, you can't do that in general.

What an inconsistent behavior of compilers here! Clang matches nothing, MSVC matches everything, and GCC/ICC need C++17 in order to match in the way we want them to (C++14 is not enough). I can not tell if and where to submit bug reports.

@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 9, 2019

Benchmarks are promising (Clang 8, GCC 9). The winner metal::v14::min_element (similar to the version of this PR) is based on the new helper dcall.

If there are no SFINAE troubles then this is my favorite version.

@brunocodutra
Copy link
Owner

I suspect that it is most efficient (for both of us) if I do not respond to each of your points individually

I'm fine either way, but I'll try to summarize all my thoughts and answer all your question on this single comment to hopefully make it easier for both of us to make sense of all the discussions we have in flight.

Should I close-and-open, or is it fine to continue here?

It's totally fine to keep updating this one. If you're wondering, force-pushing is fine too.

What an inconsistent behavior of compilers here!

Whoa, I didn't actually check other compilers. You know you've pushed too far when compilers stop agreeing with each other. I don't know which, if any, is correct here, but experience tells me clang is usually right.

Benchmarks are promising (Clang 8, GCC 9). The winner metal::v14::min_element (similar to the version of this PR) is based on the new helper dcall.

Impressive! This may be it! Just wondering, what happens if we get rid of class... _ in the definition of combiner?

I think dcall may get even faster if you specialize _dcaller to help the compiler with memoization. I imagine something like this:

template<bool>
struct _dcaller {};

template<>
struct _dcaller<true> {
    template<template<class...> class expr, class... vals>
    using type = expr<vals...>;
};

template<template<class...> class expr, class... vals>
using dcall = typename _dcaller<sizeof...(vals) >= 0>::template type<expr, vals...>;

Whether dcall is SFINAE-friendly, I don't know, but I quickly reimplemented metal::bind in terms of dcall instead of _bind_impl to see if it would work, but unfortunately it doesn't pass SFINAE test cases. I'd encourage you to define unit tests for min_element upfront to help filtering out alternatives that are not viable.

example/src/list.cpp Outdated Show resolved Hide resolved
@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 12, 2019

Just wondering, what happens if we get rid of class... _ in the definition of combiner?

I had once introduced that pack in order to improve the performance by reducing the number of class-template instantiations. I wanted a single memoized instantiation of _dcaller to be reused for all _dcalls. At the same time, _dcall must provide SFINAE guarantees. The latter is achieved by making the _dcaller depend on the arguments. For performance reasons, I had tried to avoid x and y as template arguments of _dcaller because that causes many class-template instantiations. Instead, I introduced the argument dependence by disguising the actual number of arguments.

However, a benchmark has shown no significant differences in terms of compile time. That is why I have removed that empty pack. The new version (current version of this PR) uses the plain old technique that is used all over the place in metal: The _dcaller depends on all arguments of the expr. A single argument also suffices, but my benchmarks did not show any significant compile-time advantage either.

I suspect that the memory footprint is better with the empty-pack trick, but I did not have any issues with lists of size 500+.

Without measurable advantage I classified that empty-pack trick as premature optimization.

I think dcall may get even faster if you specialize _dcaller to help the compiler with memoization. I imagine something like this:

Although I have discarded that empty-pack trick, I will keep that specialization technique in mind! Is it important to provide a definition of the unspecialized class template, or can I also omit the {}?

Whether dcall is SFINAE-friendly, I don't know, but I quickly reimplemented metal::bind in terms of dcall instead of _bind_impl to see if it would work, but unfortunately it doesn't pass SFINAE test cases.

I observed that the sizeof... obfuscation does not work if the compiler sees the number of arguments too early. That is the reason why I had to introduce that empty pack in combiner (it was not sufficient to have a _binary_dcall with two arguments plus that empty pack).

Thanks for the hint to make use of existing unit tests.

I'd encourage you to define unit tests for min_element upfront to help filtering out alternatives that are not viable.

When optimizing my approaches, I used some hand-crafted tests in the examples, which you also discovered. I did see plenty of other attempts failing, so this one should not be super unfriendly to SFINAE :-).

@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 13, 2019

Without measurable advantage I classified that empty-pack trick as premature optimization.

Just to back this up: here are the benchmarks using Clang 9 (see v11, v12, v13, and v14 in particular). Meanwhile, I have customized metabench to measure the "number of types" as reported by Clang with -Xclang -print-stats. Results do differ, but there are no major differences among v11..v14. I should add that I do not understand the output of -Xclang -print-stats (see sample here), and I just used the number from the second line in this quote:

*** AST Context Stats:
73 types total.

@brunocodutra
Copy link
Owner

Amazing work! Do consider contributing that to metaben.ch, it would be awesome to have a toggle button to switch between time and number of types.

@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 13, 2019

Do consider contributing that to metaben.ch,

I have opened an issue there in order to share that idea.

ecrypa pushed a commit to ecrypa/metal that referenced this pull request Oct 13, 2019
@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 13, 2019

I like how this is boiling down to a two-liner (after some crazy intermediate attempts).

Latest benchmarks of selected versions:

@brunocodutra
Copy link
Owner

I missed one of your questions last time.

Is it important to provide a definition of the unspecialized class template, or can I also omit the {}?

Without the {}, types are declared but not defined and I fear that may open a whole can of worms when it comes to ODR violations and undefined behavior, so I prefer to define everything.

@brunocodutra
Copy link
Owner

Benchmarks look pretty sweet! Those extra 30 ms on a list with 1000 elements definitely don't seem worth the pile of code that fast tracking requires.

I wonder if redefining metal::apply in a similar way to dcall would shave off some of those 30 ms, but either way that's definitely out of scope for this PR.

@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 15, 2019

Benchmarks look pretty sweet! Those extra 30 ms on a list with 1000 elements definitely don't seem worth the pile of code that fast tracking requires.

The most efficient solutions are the neat ones. One exception is the bulky solution v7.alias_fasttrack, which is efficient on clang, but inefficient on GCC.

I wonder if redefining metal::apply in a similar way to dcall would shave off some of those 30 ms, but either way that's definitely out of scope for this PR.

By that kind of optimization we can tweak the efficiency in the range of v11...v15. Fastest (and hardest to use) are v11 and v14, but that advantage is small and Clang-only. GCC 9 is equally fast with v11..v15.

@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 15, 2019

I just benchmarked two different approaches to obtain the index (instead of the value). The winner (currently in this PR) is v17.ind.find. That was surprising to me, because my favorite was v16.ind.accumulate. I should analyze if the huge difference originates from indices<seq> or from the accumulate itself.

See benchmark results for Clang 9 and GCC 9.

Other ideas how to obtain the index?

@brunocodutra
Copy link
Owner

Other ideas how to obtain the index?

Have you tried metal::transpose?

template<class lbd, class is_empty>
struct _min_element {};

template<class lbd>
struct _min_element<lbd, true_> {
    template<class>
    using type = number<0>;
};

template<class lbd>
struct _min_element<lbd, false_> {
    template<class state, class next>
    using custom_min = if_<apply<lbd, back<transpose<next, state>>>, next, state>;

    template<class seq>
    using type = front<apply<lambda<custom_min>, transpose<pair<indices<seq>, seq>>>>;
};

Note: not tested.

@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 16, 2019

For huge lists, it is slower than find (Clang 9, GCC 9). v19 is very close to your version, v18 is slightly optimized (but also slower than find for huge lists). Smaller lists are hard to measure.

@brunocodutra
Copy link
Owner

brunocodutra commented Oct 16, 2019

Could you stack this against a version that returns the value itself, so we can see the cost of recovering the index?

@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 17, 2019

You can estimate the cost of recovering the index from the latest benchmarks. v11 is the fastest version, and it returns the value. v15 is my most favorite version, and it returns the value, too. Index return is implemented in v16.ind, v17.ind, v18.ind, and v19.ind only. The latter (v16..v19) are based on v15 (using invoke or apply). You can find the implementations in my fork's branch called benchmark/min-element; permalink to the current version is here.

Do we need better resolution for small lists?

@brunocodutra
Copy link
Owner

I missed the fact v11 was returning the value, thanks.

Do we need better resolution for small lists?

I don't think it's worth it, small lists are certainly not the bottleneck of a metaprogram.

@ecrypa ecrypa force-pushed the feature/min-element branch 2 times, most recently from 5d2f4e6 to 658bb5b Compare October 26, 2019 10:59
@ecrypa ecrypa marked this pull request as ready for review October 26, 2019 11:02
Copy link
Owner

@brunocodutra brunocodutra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for taking so long to review your changes, I've been really short on time lately.
I really appreciate the amount of work you have put into this so far, I'd be ready to merge if it were not for GCC 4.7 being silly.

This is a shot in the dark, but have you tried the following pattern instead? We follow in most places and for whatever reason it seems to work well for GCC 4.7, probably thanks to the extra indirections through detail::call. One of the examples for sort even uses the same predicate based on sizeof, which seems to be the reason why GCC 4.7 has a hard time with the example for min_element.

template<class seq, class lbd = metal::lambda<metal::less>>
using min_element = call<detail::_min_element<seq>::template type, lbd>;

template<class lbd>
struct _min_element_impl {
    template<class x, class y>
    using custom_min = if_<invoke<lbd, y, x>, y, x>;

    template<class... vals>
    using type = fold_left<lambda<custom_min>, vals...>;
};

template<class seq>
struct _min_element {};

template<class... vals>
struct _min_element<list<vals...>> {
    template<typename lbd>
    using type = call<_min_element_impl<lbd>::template type, vals...>
};

/// \see min, sort
#if !defined(METAL_WORKAROUND)
template<class seq, class lbd = metal::lambda<metal::less>>
using min_element = apply<
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A shot in the dark, but have you tried the following pattern instead? We follow in most places and for whatever reason it seems to work well for GCC 4.7. One of the examples for sort even uses the same smaller predicate based on sizeof.

template<class seq, class lbd = metal::lambda<metal::less>>
using min_element = call<detail::_min_element<seq>::template type, lbd>;

template<class lbd>
struct _min_element_impl {
    template<class x, class y>
    using custom_min = if_<invoke<lbd, y, x>, y, x>;

    template<class... vals>
    using type = fold_left<lambda<custom_min>, vals...>;
};

template<class seq>
struct _min_element {};

template<class... vals>
struct _min_element<list<vals...>> {
    template<typename lbd>
    using type = call<_min_element_impl<lbd>::template type, vals...>
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing your suggestion in #112. As an alternative, I am also testing 9d833d8 in this PR.

CHECK((metal::min_element<metal::list<ENUM(INC(N), NUMBERS FIX(INC(M)))>>), (metal::number<0>)); \
/**/

GEN(MATRIX)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test coverage!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is derived from the test of metal::sort.

It bothers me that the test is passing while the example fails on GCC 4.7.

ecrypa pushed a commit to ecrypa/metal that referenced this pull request Oct 27, 2019
@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 28, 2019

Good news is that 39a5465 passes the list example on GCC 4.7. Bad news is that it (v22) is slower than higher-order metafunction (v4) on GCC 9. Weirdly, Clang 9 is fastest with this fix.

@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 28, 2019

Nevermind, I was looking at MSVC out of habit. GCC 4.7 still fails to compile the example. 👎

@ecrypa
Copy link
Contributor Author

ecrypa commented Oct 31, 2019

For some reason, GCC 4.7 chooses the overload value<> sfinae(...); for the min_element example. I turned it into a hard error and reduced it to a minimal example.

Can you give me any advices how to overcome that issue?

#define WORKAROUND_THAT_BREAKS_GCC_4_7
// removing the line above makes GCC 4.7 happy

template<
  template<template<class...> class...> class tmpl,
  template<class...> class... exprs>
struct _forwarder {
  using type = tmpl<exprs...>;
};

template<
  template<template<class...> class...> class tmpl,
  template<class...> class... exprs>
#if defined(WORKAROUND_THAT_BREAKS_GCC_4_7)
using forward = typename _forwarder<tmpl, exprs...>::type;
#else
using forward = tmpl<exprs...>;
#endif

struct _min_element_impl {
  template<template<class...> class>
  using type = long;
};

template<template<class...> class expr>
struct _min_element {
  using type = forward<_min_element_impl::template type, expr>;
};

template<class T>
using some_expr = T;

using some_val = _min_element<some_expr>::type;

some_val foo = 4;

@brunocodutra
Copy link
Owner

If GCC 4.7 is causing so much trouble, it may be time we let it go.
All it takes is updating Travis configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants