Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: CUB device_transform breaks nvc++ -stdpar #2402

Closed
1 task done
Tracked by #2263 ...
dkolsen-pgi opened this issue Sep 10, 2024 · 4 comments · Fixed by #2664
Closed
1 task done
Tracked by #2263 ...

[BUG]: CUB device_transform breaks nvc++ -stdpar #2402

dkolsen-pgi opened this issue Sep 10, 2024 · 4 comments · Fixed by #2664
Assignees
Labels
bug Something isn't working right.

Comments

@dkolsen-pgi
Copy link
Collaborator

Is this a duplicate?

Type of Bug

Compile-time Error

Component

CUB

Describe the bug

PR #2086 breaks stdexec example nvexec.launch when compiled with NVC++. Compilation fails with unhelpful errors such as error: namespace "cooperative_groups" has no member "thread_block_tile". @ericniebler

PR #2086 added two new files to the CUB headers. One of them, cub/device/dispatch/dispatch_transform.cuh, which is indirectly included from cub/cub.cuh, contains #include <cooperative_groups.h>. The header <cooperative_groups.h> is entirely wrapped by an #if defined(__cplusplus) && defined(__CUDACC__) block. When compiling with nvc++ -stdpar=gpu, the macro __CUDACC__ is not defined, so <cooperative_groups.h> is a no-op. Subsequent attempts to use stuff from the cooperative_groups namespace fail with undefined identifiers.

This doesn't break NVC++'s stdpar parallel algorithms yet because nothing in the parallel algorithm implementation includes cub/cub.cuh or cub/device/device_transform.cuh. But that will change if thrust::transform is changed to use the new CUB transform algorithms. I would like to get this fixed before that happens, when the impact of this bug is still small.

I don't know the correct way to fix this. Some possibilities are:

  1. Change <cooperative_groups.h> to work with nvc++ -stdpar. (CUB would still need to deal with the issue as long as a CUDA Toolkit without the cooperative groups change is still supported.)
  2. Change CUB to not use cooperative groups.
  3. Change CUB to not use cooperative groups when being compiled with 'nvc++ -stdpar'.
  4. Change cub/cub.cuh to not include <cub/device/device_transform.cuh>. Any code that wants to use the new CUB transform algorithms needs to include <cub/device/device_transform.cuh> explicitly. (This then pushes the problem to Thrust, which would need to adopt option 2 or 3.)

All the options have tradeoffs, and I don't know how best to balance those tradeoffs.

How to Reproduce

Though first noticed by stdexec example nvexec.launch, which includes <cub/cub.cuh>, it can be reproduced with a much smaller test, with NVC++ that uses the latest main branch of CCCL.

$ cat cg.cpp
#include <cub/cub.cuh>
int main() { }
$ nvc++ --c++20 -stdpar cg.cpp
"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 65: error: namespace "cooperative_groups" has no member "thread_block_tile"
  struct _async_copy_group_supported<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                         ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 65: error: expected a ">"
  struct _async_copy_group_supported<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                                          ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 65: error: expected a ";"
  struct _async_copy_group_supported<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                                                    ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 68: error: namespace "cooperative_groups" has no member "coalesced_group"
  struct _async_copy_group_supported<cooperative_groups::coalesced_group> : public _CG_STL_NAMESPACE::true_type {};
                                                         ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 70: error: namespace "cooperative_groups" has no member "thread_block"
  struct _async_copy_group_supported<cooperative_groups::thread_block> : public _CG_STL_NAMESPACE::true_type {};
                                                         ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 70: error: class "cooperative_groups::__v1::details::_async_copy_group_supported<<error-type>>" has already been defined (previous definition at line 68)
  struct _async_copy_group_supported<cooperative_groups::thread_block> : public _CG_STL_NAMESPACE::true_type {};
         ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 80: error: namespace "cooperative_groups" has no member "thread_block_tile"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<1, TyPar>>
                                                       ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 80: error: expected a ">"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<1, TyPar>>
                                                                        ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 80: error: expected a ";"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<1, TyPar>>
                                                                                 ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 84: error: namespace "cooperative_groups" has no member "thread_block_tile"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                       ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 84: error: expected a ">"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                                        ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 84: error: expected a ";"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                                                  ^

12 errors detected in the compilation of "cg.cpp".

Expected behavior

It should be possible to use CUB with nvc++ -stdpar without errors.

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

@dkolsen-pgi dkolsen-pgi added the bug Something isn't working right. label Sep 10, 2024
@github-project-automation github-project-automation bot moved this to Todo in CCCL Sep 10, 2024
@bernhardmgruber bernhardmgruber self-assigned this Sep 10, 2024
@bernhardmgruber
Copy link
Contributor

Hi! I am sorry this causes a breakage for nvc++. I didn't know that cooperative_groups are not supported on nvc++. I hope we can detect such a breakage sooner, e.g. when nvc++ CI jobs land #1488.

Since I am leaving for parental leave very soon, the only quick solution I see is

  1. Change cub/cub.cuh to not include <cub/device/device_transform.cuh>.

and then figure out how we can proceed later.

@bernhardmgruber
Copy link
Contributor

Discussed with @jrhemstad, who is going to follow-up on this for the short term.

@bernhardmgruber
Copy link
Contributor

bernhardmgruber commented Oct 30, 2024

I discussed this briefly with @jrhemstad yesterday and we would like to fix cooperative groups in the long run (option 1). However, this may still take a while. In the meantime, once #2396 is merged, we can disable the ublkcp kernel that uses cooperative groups when compiling with nvc++ (option 3). The prefetch implementation should work with nvc++ and also deliver solid runtime improvements.

bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Oct 30, 2024
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Oct 30, 2024
@bernhardmgruber
Copy link
Contributor

I could reproduce and workaround the issue by disabling CG and the ublkcp kernel:

~/cccl $ cat cg.cpp 
#include <cub/cub.cuh>
int main() {}
~/cccl $ nvc++ -Icub -Ithrust -Ilibcudacxx/include --c++20 -stdpar cg.cpp
~/cccl $

That's the extent to which I could test CUB with nvc++.

@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Oct 30, 2024
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Nov 4, 2024
fbusato pushed a commit to fbusato/cccl that referenced this issue Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants