Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Optimize thrust::transform for newer architectures #1947

Open
5 of 19 tasks
bernhardmgruber opened this issue Jul 5, 2024 · 3 comments
Open
5 of 19 tasks

[EPIC] Optimize thrust::transform for newer architectures #1947

bernhardmgruber opened this issue Jul 5, 2024 · 3 comments
Assignees
Labels
cub For all items related to CUB thrust For all items related to Thrust.

Comments

@bernhardmgruber
Copy link
Contributor

bernhardmgruber commented Jul 5, 2024

Motivation
It's increasingly harder to reach SOL on newer GPU architectures, starting with A100 and H100, especially for simple kernels, like:
thrust::transform(..., thrust::plus{}), which basically load a few values and perform little compute. CUB algorithms already counter this by processing several elements per thread, but internal research hints at the necessity to further increase the amount of data in flight.

Use case
thrust::transform is an important primitive for many algorithms and also occurs in BabelStream, i highly relevant HPC benchmark often used to produce representative numbers to compare the performance of hardware architectures. We should therefore dedicate some effort to ensure thrust::transform performs well.

Approach
The main strategy is to have more "bytes in flight" when reading, with the concrete amount depending on the target architecture (tuning parameter). There are multiple ways to generate more loads. Again, internal research points to using either prefetching or the tensor memory accelerator (TMA, e.g. via memcpy_async) on newer architectures. Excessive unrolling and loading to registers works as well, but has the drawback of consuming large amount of registers for architectures requiring a large number of bytes in flight.

Address stability
For the loading strategy we have to consider the address stability of data items as well. Users sometimes rely on the ability to retrieve the index inside an input array from the reference of a loaded element:

transform(par, a, a + n, a, [a,b,c](const T& e) { 
    const auto i = &e – a;     // &e expected to point into global memory
    return e + b[i] + c[i];
});

Such a user-provided function object inhibits any optimization which loads elements from global memory into registers or shared memory before passing them as arguments, thus only allowing prefetching as optimization. Address oblivious function objects can benefit from a larger variety of optimizations (like TMA or pipelined loading to registers.

Further concerns
Furthermore, the computational intensity and shared memory/register consumption of the user provided function object influence the loading strategy. Longer computations seem to require more data in flight. Shared memory is contested by TMA and user-side computation. Register pressure limits unrolling.

Status quo
thrust::transform (CUDA) is currently built on top of cub::DeviceFor::Bulk, which eventually dispatches independently of the uses data types or number of input and output streams. Because cub::DeviceFor::Bulk is index based, the involved input and output data streams are not visible and no tuning based on this information is possible. The situation is similar with cub::DeviceFor::ForEach et al.

Strategy
I propose to add a new CUB algorithm cub::DeviceTransform governing transformations of N input streams into a single output stream (maybe M output streams if use cases arrise) and rebasing thrust::transform on top of it.

Future tasks after merging cub::DeviceTransform

  1. 13 of 13
    cub thrust
    bernhardmgruber
  2. cub
    bernhardmgruber
  3. 6 of 6
    thrust
    bernhardmgruber
  4. cub
    bernhardmgruber
  5. 1 of 3
    thrust
    bernhardmgruber
  6. bernhardmgruber
@bernhardmgruber bernhardmgruber added thrust For all items related to Thrust. cub For all items related to CUB labels Jul 5, 2024
@github-project-automation github-project-automation bot moved this to Todo in CCCL Jul 5, 2024
@bernhardmgruber bernhardmgruber self-assigned this Jul 5, 2024
@bernhardmgruber
Copy link
Contributor Author

It turns out the C++ standard does not guarantee address stability of function arguments passed to user-provided callables in the context of parallel algorithms. See:
* https://eel.is/c++draft/algorithms.parallel#user-1
* https://eel.is/c++draft/algorithms.parallel#exec-3
The only exception are for_each and for_each_n:
* exception for for_each: https://eel.is/c++draft/alg.foreach#9
* exception for for_each_n: https://eel.is/c++draft/alg.foreach#25
IIRC from a discussion with @gevtushenko , this is consistent with what Thrust/CUB should promise. However, I could not find any solid hints in our documentation. We may thus break users when changing this behavior who relied on this effect by accident (see also: Hyrum's Law).

@wmaxey wmaxey moved this from Todo to In Progress in CCCL Jul 17, 2024
@bernhardmgruber
Copy link
Contributor Author

We discussed address stability again today and concluded the following:

  • Don't break users relying on stable parameter addresses, even if the C++ standard does not give that guarantee.
  • Provide an opt-in in Thrust, so users can communicate they do not require stable addresses
  • On top, try to detect whether users take parameters by value and if they do, allow copying of data.
  • Update our documentation and examples to take parameters by value
  • Provide a dual API in CUB, like cub::DeviceFor, with address-stable and unstable functions.

@bernhardmgruber
Copy link
Contributor Author

Address stability: Because I just encountered it in the tests on my A6000. If we use a kernel serving parameters from shared memory and the user performs pointer arithmetic with a pointer to global memory, the kernel crashes (Release build) and the following error code is reported at the next cudaDeviceSynchronize:

717 (operation not supported on global/shared address space)

That's at least better than a garbage result and the kernel continuing with wrong data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cub For all items related to CUB thrust For all items related to Thrust.
Projects
Status: In Progress
Development

No branches or pull requests

1 participant