[EPIC] Optimize `thrust::transform` for newer architectures #1947

bernhardmgruber · 2024-07-05T22:01:28Z

Motivation
It's increasingly harder to reach SOL on newer GPU architectures, starting with A100 and H100, especially for simple kernels, like:
thrust::transform(..., thrust::plus{}), which basically load a few values and perform little compute. CUB algorithms already counter this by processing several elements per thread, but internal research hints at the necessity to further increase the amount of data in flight.

Use case
thrust::transform is an important primitive for many algorithms and also occurs in BabelStream, i highly relevant HPC benchmark often used to produce representative numbers to compare the performance of hardware architectures. We should therefore dedicate some effort to ensure thrust::transform performs well.

Approach
The main strategy is to have more "bytes in flight" when reading, with the concrete amount depending on the target architecture (tuning parameter). There are multiple ways to generate more loads. Again, internal research points to using either prefetching or the tensor memory accelerator (TMA, e.g. via memcpy_async) on newer architectures. Excessive unrolling and loading to registers works as well, but has the drawback of consuming large amount of registers for architectures requiring a large number of bytes in flight.

Address stability
For the loading strategy we have to consider the address stability of data items as well. Users sometimes rely on the ability to retrieve the index inside an input array from the reference of a loaded element:

transform(par, a, a + n, a, [a,b,c](const T& e) { 
    const auto i = &e – a;     // &e expected to point into global memory
    return e + b[i] + c[i];
});

Such a user-provided function object inhibits any optimization which loads elements from global memory into registers or shared memory before passing them as arguments, thus only allowing prefetching as optimization. Address oblivious function objects can benefit from a larger variety of optimizations (like TMA or pipelined loading to registers.

Further concerns
Furthermore, the computational intensity and shared memory/register consumption of the user provided function object influence the loading strategy. Longer computations seem to require more data in flight. Shared memory is contested by TMA and user-side computation. Register pressure limits unrolling.

Status quo
thrust::transform (CUDA) is currently built on top of cub::DeviceFor::Bulk, which eventually dispatches independently of the uses data types or number of input and output streams. Because cub::DeviceFor::Bulk is index based, the involved input and output data streams are not visible and no tuning based on this information is possible. The situation is similar with cub::DeviceFor::ForEach et al.

Strategy
I propose to add a new CUB algorithm cub::DeviceTransform governing transformations of N input streams into a single output stream (maybe M output streams if use cases arrise) and rebasing thrust::transform on top of it.

The text was updated successfully, but these errors were encountered:

bernhardmgruber · 2024-07-16T00:52:03Z

It turns out the C++ standard does not guarantee address stability of function arguments passed to user-provided callables in the context of parallel algorithms. See:
* https://eel.is/c++draft/algorithms.parallel#user-1
* https://eel.is/c++draft/algorithms.parallel#exec-3
The only exception are for_each and for_each_n:
* exception for for_each: https://eel.is/c++draft/alg.foreach#9
* exception for for_each_n: https://eel.is/c++draft/alg.foreach#25
IIRC from a discussion with @gevtushenko , this is consistent with what Thrust/CUB should promise. However, I could not find any solid hints in our documentation. We may thus break users when changing this behavior who relied on this effect by accident (see also: Hyrum's Law).

bernhardmgruber · 2024-07-17T17:22:44Z

We discussed address stability again today and concluded the following:

Don't break users relying on stable parameter addresses, even if the C++ standard does not give that guarantee.
Provide an opt-in in Thrust, so users can communicate they do not require stable addresses
On top, try to detect whether users take parameters by value and if they do, allow copying of data.
Update our documentation and examples to take parameters by value
Provide a dual API in CUB, like cub::DeviceFor, with address-stable and unstable functions.

bernhardmgruber · 2024-07-18T09:19:03Z

Address stability: Because I just encountered it in the tests on my A6000. If we use a kernel serving parameters from shared memory and the user performs pointer arithmetic with a pointer to global memory, the kernel crashes (Release build) and the following error code is reported at the next cudaDeviceSynchronize:

717 (operation not supported on global/shared address space)

That's at least better than a garbage result and the kernel continuing with wrong data.

bernhardmgruber added thrust For all items related to Thrust. cub For all items related to CUB labels Jul 5, 2024

github-project-automation bot added this to CCCL Jul 5, 2024

github-project-automation bot moved this to Todo in CCCL Jul 5, 2024

bernhardmgruber self-assigned this Jul 5, 2024

wmaxey moved this from Todo to In Progress in CCCL Jul 17, 2024

bernhardmgruber mentioned this issue Jul 26, 2024

Create cub::DeviceTransform #2091

Closed

bernhardmgruber mentioned this issue Aug 19, 2024

Add cub::DeviceTransform #2086

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Optimize `thrust::transform` for newer architectures #1947

[EPIC] Optimize `thrust::transform` for newer architectures #1947

bernhardmgruber commented Jul 5, 2024 •

edited

Loading

Future tasks after merging `cub::DeviceTransform`

bernhardmgruber commented Jul 16, 2024

bernhardmgruber commented Jul 17, 2024

bernhardmgruber commented Jul 18, 2024

[EPIC] Optimize thrust::transform for newer architectures #1947

[EPIC] Optimize thrust::transform for newer architectures #1947

Comments

bernhardmgruber commented Jul 5, 2024 • edited Loading

Future tasks after merging cub::DeviceTransform

bernhardmgruber commented Jul 16, 2024

bernhardmgruber commented Jul 17, 2024

bernhardmgruber commented Jul 18, 2024

[EPIC] Optimize `thrust::transform` for newer architectures #1947

[EPIC] Optimize `thrust::transform` for newer architectures #1947

bernhardmgruber commented Jul 5, 2024 •

edited

Loading

Future tasks after merging `cub::DeviceTransform`