-
Notifications
You must be signed in to change notification settings - Fork 285
PTX shfl_sync
#3241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PTX shfl_sync
#3241
Conversation
🟩 CI finished in 1h 49m: Pass: 100%/170 | Total: 3d 02h | Avg: 26m 12s | Max: 1h 08m | Hits: 76%/22526
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 170)
| # | Runner |
|---|---|
| 125 | linux-amd64-cpu16 |
| 19 | linux-amd64-gpu-v100-latest-1 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 1 | linux-amd64-gpu-h100-latest-1-testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a quick review. I would love to have @ahendriksen's opinion, since it touches his work on the PTX exposure. Also, he has a way better PTX understanding than me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have sent some comments in private as well. The data parameter should be a template parameter to allow shuffling any 32-bit value.
Co-authored-by: Bernhard Manfred Gruber <[email protected]>
🟩 CI finished in 1h 37m: Pass: 100%/170 | Total: 2d 17h | Avg: 23m 17s | Max: 1h 05m | Hits: 82%/22529
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 170)
| # | Runner |
|---|---|
| 125 | linux-amd64-cpu16 |
| 19 | linux-amd64-gpu-v100-latest-1 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 1 | linux-amd64-gpu-h100-latest-1-testing |
|
@ahendriksen @miscco I modified the return type and added the predicate as an output parameter in the last commit |
🟨 CI finished in 2h 59m: Pass: 98%/164 | Total: 3d 03h | Avg: 27m 26s | Max: 1h 13m | Hits: 434%/15316
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 164)
| # | Runner |
|---|---|
| 122 | linux-amd64-cpu16 |
| 19 | linux-amd64-gpu-v100-latest-1 |
| 12 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 1 | linux-amd64-gpu-h100-latest-1-testing |
🟨 CI finished in 2h 42m: Pass: 97%/164 | Total: 2d 16h | Avg: 23m 27s | Max: 1h 13m | Hits: 452%/17656
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 164)
| # | Runner |
|---|---|
| 122 | linux-amd64-cpu16 |
| 19 | linux-amd64-gpu-v100-latest-1 |
| 12 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 1 | linux-amd64-gpu-h100-latest-1-testing |
🟩 CI finished in 3h 06m: Pass: 100%/164 | Total: 2d 17h | Avg: 23m 56s | Max: 1h 06m | Hits: 458%/17656
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 164)
| # | Runner |
|---|---|
| 122 | linux-amd64-cpu16 |
| 19 | linux-amd64-gpu-v100-latest-1 |
| 12 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 1 | linux-amd64-gpu-h100-latest-1-testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check for consistency with the PTX docs.
I see that there are some higher-level helper functions for shfl_sync as well. I am not opposed to including those. But we need a clean separation. @bernhardmgruber : what do you think would be good strategy here?
🟩 CI finished in 2h 13m: Pass: 100%/164 | Total: 2d 15h | Avg: 23m 15s | Max: 1h 07m | Hits: 452%/17656
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 164)
| # | Runner |
|---|---|
| 122 | linux-amd64-cpu16 |
| 19 | linux-amd64-gpu-v100-latest-1 |
| 12 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 1 | linux-amd64-gpu-h100-latest-1-testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding this and bearing with all the reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to trust @ahendriksen on the implementation. I would still like to see @miscco's opinion on the use of CCCL_ASSERT in the tests.
🟩 CI finished in 1h 24m: Pass: 100%/158 | Total: 2d 01h | Avg: 18m 57s | Max: 1h 11m | Hits: 86%/248208
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
| # | Runner |
|---|---|
| 111 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 8 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| } | ||
|
|
||
| auto res4 = cuda::ptx::shfl_sync_bfly(data, pred4, 2 /*offset*/, 0b11111 /*clamp*/, FullMask); | ||
| assert(res4 == threadIdx.x ^ 2 && pred4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion appears to be broken, assuming that the intent was to do res4 == (threadIdx.x ^ 2)
third_party/gpus/cccl/v3_1/libcudacxx/test/libcudacxx/cuda/ptx/ptx.shfl.compile.pass.cpp:48:30:
error: ^ has lower precedence than ==; == will be evaluated first [-Werror,-Wparentheses]
48 | assert(res4 == threadIdx.x ^ 2 && pred4);
| ~~~~~~~~~~~~~~~~~~~~^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same applies to other uses of ^ in this test file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect timing. I can add it to #6429
Related to #2976
Description
Provide C++ implementation of PTX
shfl_sync.In addition to CUDA intrinsics, the function provide the following features: