-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make thrust::transform
use cub::DeviceTransform
#2389
Conversation
66b7312
to
885b8e8
Compare
thrust::transform( | ||
c.begin(), c.end(), b.begin(), cuda::std::allow_copied_arguments([=] __device__ __host__(const T& ci) { | ||
return ci * scalar; | ||
})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what uses are expected to write. They need to wrap their callables in cuda::allow_copied_arguments(...)
or specialize cuda::allows_copied_arguments
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more on under what conditions cuda::std::allow_copied_arguments
will be required? Is it for peak performance?
It is exciting that the functions I am using in my own code are going to speedup just like that by simply updating CCCL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sad answer for now is that you need to wrap your callable in allow_copied_arguments
every time you call a thrust
API that may use thrust::transform
underneath and your callable does not rely on the address of your arguments. We have some ideas on how to detect this automatically for a set of known types and functors, e.g. thrust::plus<int>
, so this will automatically enable at least some use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it due to the use of shared memory? If you take address, as it won't be same as the original data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. cub::DeviceTranform
on Hopper+ will use an asynchronous bulk copy from global to shared memory and then run the transformation function on the data in shared memory. This generally leads to higher saturation of the memory bandwidth. But if your callable has a const T& value
parameter that reference will point to shared memory now instead of global memory, so &value
may not be what you expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the user uses const T value
as the callable argument, the address can not be taken anymore. So can the fast past optimization be enabled when the callable argument is passed by value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If const T value
happens to be const std::reference_wrapper<U> value
, or const thrust::device_reference<U> value
, or any other proxy reference type, and your buffer contains U
s, you can still recover the address of the U
s in global memory even though your function syntactically takes arguments by copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you want to change it in such a way that even for those types, the fast path can be used to dereference the reference_wrapper object? If not, why is the fast path not enabled by default when const std::reference_wrapper<U>
is the callable argument so that std::reference_wrapper<U>
is copied via the fast path and the dereferencing can be done inside the users' kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to understand if there is any disadvantage if the user has a kernel that takes const T value
and the user always wraps it with cuda::std::allow_copied_arguments
vs not wrapping with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you want to change it in such a way that even for those types, the fast path can be used to dereference the reference_wrapper object?
No, I cannot enable the fast-path for by-value arguments like std::reference_wrapper<U>
or thrust::referenc<U>
.
I want to understand if there is any disadvantage if the user has a kernel that takes
const T value
and the user always wraps it withcuda::std::allow_copied_arguments
vs not wrapping with it.
If T
is cheap to copy, or the compiler is able to inline your function object, then there is no downside of using const T value
. Wrapping your callable in cuda::std::allow_copied_arguments
also has no downside.
86d12c4
to
ef846de
Compare
ef846de
to
08b32dc
Compare
43ce459
to
c662a2f
Compare
We discussed this PR in the code review hour now and concluded that we are fine with putting the address stability traits into libcu++ and the |
🟨 CI finished in 7h 41m: Pass: 99%/417 | Total: 7d 07h | Avg: 25m 11s | Max: 2h 03m | Hits: 82%/39381
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
304 | linux-amd64-cpu16 |
62 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
18cd7ea
to
7839465
Compare
Benchmark of
However, 1.23% slowdown seem tolerable given the other improvements. |
🟩 CI finished in 11h 59m: Pass: 100%/433 | Total: 8d 02h | Avg: 26m 54s | Max: 1h 12m | Hits: 76%/41615
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 433)
# | Runner |
---|---|
320 | linux-amd64-cpu16 |
62 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
It seems we are blocked on: #2402 |
thrust::transform
use cub::DeviceTransform
e2c0dca
to
9cf075f
Compare
🟩 CI finished in 4h 14m: Pass: 100%/394 | Total: 8d 19h | Avg: 32m 10s | Max: 1h 37m | Hits: 55%/25800
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 394)
# | Runner |
---|---|
326 | linux-amd64-cpu16 |
28 | linux-arm64-cpu16 |
25 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
15d7373
to
b902ff8
Compare
Added benchmark on H200 (PR description). Looking good except our fibonaccy transformation:
|
🟩 CI finished in 1h 35m: Pass: 100%/394 | Total: 8d 00h | Avg: 29m 15s | Max: 1h 14m | Hits: 70%/25848
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 394)
# | Runner |
---|---|
326 | linux-amd64-cpu16 |
28 | linux-arm64-cpu16 |
25 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
static constexpr bool allows_copied_arguments = true; | ||
}; | ||
|
||
//! Creates a new function object from an existing one, allowing its arguments to be copies of whatever source they come |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use doxygen keywords?
//! Creates a new function object from an existing one, allowing its arguments to be copies of whatever source they come | |
//! @brief Creates a new function object from an existing one, allowing its arguments to be copies of whatever source they come |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, should it? I intended a simple block of documentation for this function, so I figured I would not need any Doxygen commands.
b902ff8
to
9eed1ed
Compare
🟩 CI finished in 2h 26m: Pass: 100%/394 | Total: 8d 05h | Avg: 30m 01s | Max: 1h 26m | Hits: 59%/25850
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 394)
# | Runner |
---|---|
326 | linux-amd64-cpu16 |
28 | linux-arm64-cpu16 |
25 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
@@ -18,6 +18,10 @@ Function wrapper | |||
- Creates a forwarding call wrapper that proclaims return type | |||
- libcu++ 1.9.0 / CCCL 2.0.0 / CUDA 11.8 | |||
|
|||
* - ``cuda::proclaim_copyable_arguments`` | |||
- Creates a forwarding call wrapper that proclaims that arguments can be freely copied before invocation the wrapped callable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Creates a forwarding call wrapper that proclaims that arguments can be freely copied before invocation the wrapped callable | |
- Creates a forwarding call wrapper that proclaims that arguments can be freely copied before invocation of the wrapped callable |
For the U32 2^20 fibonacci benchmark, we are launching 2048 blocks before and 683 blocks after. Achieved occupancy decreased by 36% (although theoretical occupancy did not). H200 has 144 SMs, which results in 113.7 (before) and 37.9 (after) warps per SM. The latter is smaller than the maximum number of resident warps per SM (48). Playing around a bit, I new assume the slowdown could be caused by the work division, especially items per thread and distribution of blocks to SMs. I made a few attempts at tuning how we compute items per thread. Diff of proposed implementation with new items per thread logic on H200 with 132 SMs:
|
Alright, the fix best fitting into the existing work divison model is: const int items_per_thread_evenly_spread =
- static_cast<int>(::cuda::std::min(Offset{items_per_thread}, num_items / (config->sm_count * block_dim)));
+ static_cast<int>(::cuda::std::min(Offset{items_per_thread}, num_items / (config->sm_count * block_dim * config->max_occupancy))); which maximizes occupancy for small problem sizes. That is, if the computed items per thread to sustain peak bytes in flight would lead to insufficient blocks to fill, or evenly fill, all SMs, we rather process less data per block and generate more blocks to max out the device. ncu is also indicating something like that in the PM Samling: We can see that before we launched a lot more blocks, but the SMs were maxed out. |
Should this use |
I was actually more leaning towards I am going to benchmark the new version soon, so I can also include |
Well, it's rather inconclusive. Normal division is about 1% faster on H100, and about 1% slower on H200. I will leave the code as is then. |
02b18f5
to
089504e
Compare
I think I am happy with the result. I only violate the "no regressions of more than 2% compared to previous implementation on 2^24+ problem sizes" rule for the fibonacci benchmark that @gevtushenko gave me a while back. We need 2.5%:
|
🟩 CI finished in 2h 22m: Pass: 100%/394 | Total: 8d 06h | Avg: 30m 14s | Max: 1h 26m | Hits: 61%/25850
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 394)
# | Runner |
---|---|
326 | linux-amd64-cpu16 |
28 | linux-arm64-cpu16 |
25 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
* Introduces address stability detection and opt-in in libcu++ * Mark lambdas in Thrust BabelStream benchmark address oblivious Fixes: NVIDIA#2263
a018e7c
to
52586c9
Compare
Talked offline with @gevtushenko who is fine with the results. |
🟩 CI finished in 1h 55m: Pass: 100%/394 | Total: 8d 00h | Avg: 29m 17s | Max: 1h 23m | Hits: 61%/25866
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 394)
# | Runner |
---|---|
326 | linux-amd64-cpu16 |
28 | linux-arm64-cpu16 |
25 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
* Add transform benchmark requiring a stable address * Make thrust::transform use cub::DeviceTransform * Introduces address stability detection and opt-in in libcu++ * Mark lambdas in Thrust BabelStream benchmark address oblivious * Optimize prefetch cub::DeviceTransform for small problems Fixes: NVIDIA#2263
* Add transform benchmark requiring a stable address * Make thrust::transform use cub::DeviceTransform * Introduces address stability detection and opt-in in libcu++ * Mark lambdas in Thrust BabelStream benchmark address oblivious * Optimize prefetch cub::DeviceTransform for small problems Fixes: NVIDIA#2263
* copy pasted sample * First draft * Kernel functor and some other things * Clean up and break up long main function * Needs launch fix * Switch to copy_bytes and cleanups * Missing include * Add exception print and waive value * Adjust copy count * Add license and switch benchmark streams * Remove a function left as a mistake * Update copyright date Co-authored-by: Eric Niebler <[email protected]> * Setup cudax examples. (#2697) * Move the sample to new location and fix warning * build fixes and 0 return code on waive * Some new MSVC errors * explicit cast * Rename enable/disable peer access and separate the sample loop * Add `cuda::minimum` and `cuda::maximum` (#2681) * Add cuda::minimum and cuda::maximum * Various fixes to cub::DeviceTransform (#2709) * Workaround non-copyable iterators * Use a named constant for SMEM * Cast to raw reference 2 * Fix passing non-copy-assignable iterators to transform_kernel via kernel_arg * Make `thrust::transform` use `cub::DeviceTransform` (#2389) * Add transform benchmark requiring a stable address * Make thrust::transform use cub::DeviceTransform * Introduces address stability detection and opt-in in libcu++ * Mark lambdas in Thrust BabelStream benchmark address oblivious * Optimize prefetch cub::DeviceTransform for small problems Fixes: #2263 * Ensure that we only use the inline variable trait when it is actually available (#2712) * Ensure that we only use the inline variable trait when it is actually available * Use the right define for internal traits * [CUDAX] Rename memory resource and memory pool from async to device (#2710) * Rename the type * Update tests * Rename async memory pool * Rename the tests * Change name in the docs * Generalise the memory_pool_properties name * Fix docs --------- Co-authored-by: Michael Schellenberger Costa <[email protected]> * Update memory resource name --------- Co-authored-by: Eric Niebler <[email protected]> Co-authored-by: Allison Piper <[email protected]> Co-authored-by: Jacob Faibussowitsch <[email protected]> Co-authored-by: Bernhard Manfred Gruber <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]>
* copy pasted sample * First draft * Kernel functor and some other things * Clean up and break up long main function * Needs launch fix * Switch to copy_bytes and cleanups * Missing include * Add exception print and waive value * Adjust copy count * Add license and switch benchmark streams * Remove a function left as a mistake * Update copyright date Co-authored-by: Eric Niebler <[email protected]> * Setup cudax examples. (NVIDIA#2697) * Move the sample to new location and fix warning * build fixes and 0 return code on waive * Some new MSVC errors * explicit cast * Rename enable/disable peer access and separate the sample loop * Add `cuda::minimum` and `cuda::maximum` (NVIDIA#2681) * Add cuda::minimum and cuda::maximum * Various fixes to cub::DeviceTransform (NVIDIA#2709) * Workaround non-copyable iterators * Use a named constant for SMEM * Cast to raw reference 2 * Fix passing non-copy-assignable iterators to transform_kernel via kernel_arg * Make `thrust::transform` use `cub::DeviceTransform` (NVIDIA#2389) * Add transform benchmark requiring a stable address * Make thrust::transform use cub::DeviceTransform * Introduces address stability detection and opt-in in libcu++ * Mark lambdas in Thrust BabelStream benchmark address oblivious * Optimize prefetch cub::DeviceTransform for small problems Fixes: NVIDIA#2263 * Ensure that we only use the inline variable trait when it is actually available (NVIDIA#2712) * Ensure that we only use the inline variable trait when it is actually available * Use the right define for internal traits * [CUDAX] Rename memory resource and memory pool from async to device (NVIDIA#2710) * Rename the type * Update tests * Rename async memory pool * Rename the tests * Change name in the docs * Generalise the memory_pool_properties name * Fix docs --------- Co-authored-by: Michael Schellenberger Costa <[email protected]> * Update memory resource name --------- Co-authored-by: Eric Niebler <[email protected]> Co-authored-by: Allison Piper <[email protected]> Co-authored-by: Jacob Faibussowitsch <[email protected]> Co-authored-by: Bernhard Manfred Gruber <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]>
This PR makes
thrust::transform
usecub::DeviceTransform
for the CUDA backend.It also:
cub::DeviceTransform
prefetch algorithm for small problem sizesBenchmark on H100
Benchmark on H200
Resolve before:
cub::DeviceTransform
#2086cub::DeviceTransform
#2396 (optional, but big gains)cub::DeviceFor
)