Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Make CUB block algorithms usable with cuda::std::array #1877

Open
1 task done
pauleonix opened this issue Jun 18, 2024 · 14 comments · Fixed by #2380
Open
1 task done

[FEA]: Make CUB block algorithms usable with cuda::std::array #1877

pauleonix opened this issue Jun 18, 2024 · 14 comments · Fixed by #2380
Labels
feature request New feature or request.

Comments

@pauleonix
Copy link
Contributor

pauleonix commented Jun 18, 2024

Is this a duplicate?

Area

CUB

Is your feature request related to a problem? Please describe.

#1764 deprecated cub::ArrayWrapper in favor of cuda::std::array without providing a replacement for users using it with CUB's block algorithms that take C-style arrays. cub::ArrayWrapper provided access to it's C-style array member, but cuda::std::array doesn't (because std::array doesn't either). Inside CUB this is solved by accessing cuda::std::arrays's .__elems_ (an implementation detail) which is not a good solution for users.

Describe the solution you'd like

Replace the C-style arrays in CUBs interfaces with cuda::std::spans. Use fbusato's "minimal 'concepts' for array-like types" as proposed in #2286.

Describe alternatives you've considered

cuda::std::span<T, N> seems like the right candidate for the interfaces but with T being a template type in the CUB algorithm this would still need a duplication of the interfaces to stay backward compatible.

As C-style arrays decay to pointers when returned from functions, a simple function like cub::to_array(cuda::std::array) seems impossible. It could be done with a macro that accesses .__elems_ but macros have their own problems.

Adding an API to cuda::std::array that is not available for std::array is probably against libcu++'s principles.

Adding a whole new cuda::array for access to the C-style array member seems over the top although it might still be easier to maintain than adding overloads taking cuda::std::array parameters to all CUB block algorithms. (Edit: This can be solved using cuda::std::span for the interface as pointed out by miscco below)

I am currently using my own array wrapper to avoid plain C-style arrays in my code when interfacing with CUB's block algorithms.

Additional context

No response

@pauleonix pauleonix added the feature request New feature or request. label Jun 18, 2024
@miscco
Copy link
Collaborator

miscco commented Jun 18, 2024

Thanks for opening the discussion.

Sorry for dropping ArrayWrapper it was an internal type we considered not used anymore.

Regarding your request to expand the API of cuda::std::array:

I am not in favor of that because the actual problem is that the algorithm interface is suboptimal and not that cuda::std::array is lacking an API. Instead of adding a wrapper class to work around a suboptimal interface we should consider improving the interface of the cub algorithms to take e.g a cuda::std::span

@gevtushenko what is your opinion here. I am not that deep into the cub algorithms that I could estimate the effort needed to change their API

@pauleonix
Copy link
Contributor Author

pauleonix commented Jun 18, 2024

@miscco As mentioned below the PR I wasn't a user of ArrayWrapper although I would have been if I had known about it.

I agree that changing the algortihms interface would be ideal but is either a breaking change or blows up the amount of functions in the API.

I'm not sure if span is ideal here because it seems like CUB makes users use local arrays by design (to avoid performance blunders) which the more flexible span would change.

I just wanted to also consider what would be a minimal, non-breaking change (i.e. a conversion macro or a cuda::array/cub::array with .c_array which certainly both have pros and cons as well).

@pauleonix
Copy link
Contributor Author

Ah, I just realized that replacing C-style arrays with spans in the interface would probably be non-breaking as well. My point about that being a significant design change still stands though.

@bernhardmgruber
Copy link
Contributor

bernhardmgruber commented Jun 18, 2024

the actual problem is that the algorithm interface is suboptimal and not that cuda::std::array is lacking an API.

That actually nails the issue. I agree!

I am not that deep into the cub algorithms that I could estimate the effort needed to change their API

The CUB interfaces taking references to native arrays are widely spread around, but the refactoring should be straight forward, since native arrays passed into the APIs would just convert to spans now. Also, the spans could just be indexed like native arrays, so the changes should be somewhat contained locally.

I agree that changing the algortihms interface would be ideal but is either a breaking change or blows up the amount of functions in the API.

I think neither breaking change nor API duplication is needed. Native arrays should convert nicely:

#include <span>
void foo_old(int (&arr)[4]) { ... }
void foo_new(std::span<int, 4> arr) { ... }
int main() {
    int arr[4]{1, 2, 3, 4};
    foo_old(arr);
    foo_new(arr);
}

I'm not sure if span is ideal here because it seems like CUB makes users use local arrays by design (to avoid performance blunders) which the more flexible span would change.

If a span<int, 4> does not perform as fast as an int [4] we have a performance bug in the implementation. It should lead to the same generated code. It may take a bit more compile-time though.

@pauleonix
Copy link
Contributor Author

@bernhardmgruber I meant that users could pass spans that point to e.g. global memory. Although my hunch that that wasn't (easily) possible with the current API might be wrong.

@pauleonix
Copy link
Contributor Author

I would hope compilers are able to put local arrays into registers even when they are accessed through a span (assuming everything is inlined).

@pauleonix
Copy link
Contributor Author

@bernhardmgruber I fear using span would not be as non-breaking as we expected it to be. Implicitly casting a C-style array to a span<T, N> only works when T is not a template type like in your example.

This version fails to compile:

#include <span>
template <typename T>
void foo_old(T (&arr)[4]) {}
template <typename T>
void foo_new(std::span<T, 4> arr) {}
int main() {
    int arr[4]{1, 2, 3, 4};
    foo_old(arr);
    foo_new(arr);
}

See also this StackOverflow answer. So to avoid a breaking change one would have to duplicate the API's where the old interface would explicitly cast to span and call the new interface.

@fbusato
Copy link
Contributor

fbusato commented Sep 5, 2024

please also consider a solution based on minimal "concepts" for array-like types. #2286 proposes a similar approach.
This approach works with raw array, std::array. std::span, std::mdspan, std::vector, etc.

@pauleonix
Copy link
Contributor Author

pauleonix commented Sep 5, 2024

@fbusato I like the suggestion. The only downside I see is that the interface is less clear to the reader. But maybe calling the template type something like ArrayLike instead of Input is enough to make it readable.

Other than that I only see the same issue as with taking span which is that it allows users to shoot themselves in the foot performance-wise by passing spans that don't correspond to registers/local memory. Or is there a way to check this at compile-time as well? Either way I don't think that this downside would be significant enough not to do it anyway.

@miscco
Copy link
Collaborator

miscco commented Sep 5, 2024

I mean the obviously correct solution is to constrain the algorithms with std::ranges::contiguous_range

https://godbolt.org/z/T6ehMf158

The issue with that is that we only backport ranges to C++17, which ... is not C++11

@pauleonix
Copy link
Contributor Author

pauleonix commented Sep 5, 2024

@miscco I was rather thinking of someone using static size spans pointing to global memory instead of using cub::BlockLoad or similar first and therefore cause non-coalesced access.

@bernhardmgruber
Copy link
Contributor

@miscco I was rather thinking of someone using static size spans pointing to global memory instead of using cub::BlockLoad or similar first and therefore cause non-coalesced access.

You can also do that today if you have a statically-sized array in global memory. Can be passed straight to a CUB agent taking a reference to such an array.

bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Sep 6, 2024
This allows to accept more data types beyond arrays of exact static sizes.

No SASS changes on CUB device histogram test with CTK 12.6.

Fixes NVIDIA#1877 for AgentHistogram
@pauleonix
Copy link
Contributor Author

pauleonix commented Sep 6, 2024

@bernhardmgruber Yeah but having a global buffer of arrays is a rather rare occurrence, especially given that vector types exist. And using casts to achieve the same is hopefully off-the-beaten-path-enough to not happen accidentally.

Still, I'm a fan of more flexibility. Maybe there are even some very creative, proper use-cases that are enabled or rather simplified by this interface change. I have been thinking if ranges::contiguous_range is actually stricter than necessary as it does not accept e.g. something using range adaptors. I mean in terms of performance it probably does not matter if I pass such a view or put the results into an (register-) array first (i.e. it's not as important as it is for device algorithms where it enables kernel fusion). But in terms of simple code it could be quite nice to be able to pass views. Also in terms of code looking more similar between using STL ranges:: algorithms and CUB block algorithms.

Edit: I guess the main problem with that idea is that views from range adaptors don't have static size even if they could?

@bernhardmgruber
Copy link
Contributor

I am reopening this issue, because there are more CUB agents that take references to statically-sized arrays. However, our workaround of reaching into the guts of ::cuda::std::array to pass data to the histogram agents, because we deprecated cub::ArrayWrapper, was removed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request.
Projects
Status: Needs Triage
Development

Successfully merging a pull request may close this issue.

4 participants