Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] RAPIDS Should not need to patch CCCL #1939

Open
1 task
jrhemstad opened this issue Jul 3, 2024 · 9 comments
Open
1 task

[EPIC] RAPIDS Should not need to patch CCCL #1939

jrhemstad opened this issue Jul 3, 2024 · 9 comments

Comments

@jrhemstad
Copy link
Collaborator

jrhemstad commented Jul 3, 2024

RAPIDS currently has to apply a variety of patches to CCCL source code to work around various issues.

This issue is to track eliminating the need for RAPIDS to apply any patches to CCCL source code.

To start, for each patch we need a separate issue created with the following information:

  • Link to patch file
  • Summary of what the patch is for
  • Link to PR that added patch and any other relevant context
  • High level of proposed solution, or TBD

All issues should be added to this task list:

Tasks

@trxcllnt
Copy link
Member

trxcllnt commented Jul 3, 2024

The patches we apply to all RAPIDS libraries:

The patches we apply exclusively to cuDF:

@bdice
Copy link
Contributor

bdice commented Jul 7, 2024

I made some progress here. The patch reverting PR 211 is actually masking a bug in cuSpatial, which I fixed. This patch can be removed from rapids-cmake and cuDF. See these PRs I just opened. They should merge in this order:

  1. Fix multipolygon geometry iterator. rapidsai/cuspatial#1402
  2. Remove CCCL patch for PR 211. rapidsai/cudf#16207
  3. Remove CCCL patch for PR 211. rapidsai/rapids-cmake#640

Next steps would be to investigate the patches for only cuDF, specifically macros to reduce compile time (64 bit dispatch and fewer CUB arch policies).

@bdice bdice mentioned this issue Jul 8, 2024
2 tasks
@jrhemstad
Copy link
Collaborator Author

jrhemstad commented Jul 8, 2024

Next steps would be to investigate the patches for only cuDF, specifically macros to reduce compile time (64 bit dispatch and fewer CUB arch policies).

I just filed: #1958

For fewer CUB arch policies, I'm still a bit confused tbh. My understanding of ChainedPolicy in CUB is that this patch should have no impact on compile time because CUB only instantiates kernels for the architectures you build for. I need @gevtushenko to look at that patch because he knows ChainedPolicy way better than I do.

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Jul 9, 2024
While upgrading CCCL, we ran into a test failure in cuSpatial. We added a patch to revert some changes from CCCL but the root cause was a bug in cuSpatial. I have fixed that bug here: rapidsai/cuspatial#1402

Once that PR is merged, we can remove this CCCL patch.

See also:
- rapids-cmake patch removal: rapidsai/rapids-cmake#640
- Original rapids-cmake patch: rapidsai/rapids-cmake#511
- CCCL epic to remove RAPIDS patches: NVIDIA/cccl#1939

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)

URL: #16207
rapids-bot bot pushed a commit to rapidsai/rapids-cmake that referenced this issue Jul 9, 2024
While upgrading CCCL, we ran into a test failure in cuSpatial. We added a patch to revert some changes from CCCL but the root cause was a bug in cuSpatial. I have fixed that bug here: rapidsai/cuspatial#1402

Once that PR is merged, we can remove this CCCL patch.

See also:
- #511
- NVIDIA/cccl#1939

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)

URL: #640
@bdice
Copy link
Contributor

bdice commented Jul 9, 2024

All three PRs above have merged, which should fix up the RAPIDS CCCL devcontainer builds. The remaining cuDF patches should apply cleanly over the current CCCL. To continue removing those cuDF patches, I think #1958 is the next step.

@gevtushenko
Copy link
Collaborator

@jrhemstad thrust_faster_scan_compile_times.diff should not affect build times and can be deleted. cudf is compiled for Volata+. Mentioned patch essentially shifts the head of chained policy to cut older architectures. This saves a few instantiations of host functions but doesn't affect device code in any way. Since device code generation dominates the compilation time, this patch doesn't lead to compile time improvements. Here's one data point to prove this point:

With patch (chained policy starts at 600):

cat _deps/cccl-src/cub/cub/device/dispatch/tuning/tuning_scan.cuh | grep "ChainedPolicy<600"
      , ChainedPolicy<600, Policy600, Policy600>
for i in {1..10}; do echo "// comment " >> ../src/reductions/scan/scan_exclusive.cu; ninja > /dev/null; python ../scripts/sort_ninja_log.py .ninja_log | grep exclusive | awk -F',' '{print $1}' >> patched.csv; done; awk '{sum += $1} END {print sum/NR}' patched.csv
3407.3

cuobjdump --dump-sass CMakeFiles/cudf.dir/src/reductions/scan/scan_exclusive.cu.o | grep DeviceScanKernel | wc -l
480

Without patch (chained policy starts at 520):

cat _deps/cccl-src/cub/cub/device/dispatch/tuning/tuning_scan.cuh | grep "ChainedPolicy<600"
      , ChainedPolicy<600, Policy600, Policy520>
for i in {1..10}; do echo "// comment " >> ../src/reductions/scan/scan_exclusive.cu; ninja > /dev/null; python ../scripts/sort_ninja_log.py .ninja_log | grep exclusive | awk -F',' '{print $1}' >> unpatched.csv; done; awk '{sum += $1} END {print sum/NR}' unpatched.csv
3403.9

cuobjdump --dump-sass CMakeFiles/cudf.dir/src/reductions/scan/scan_exclusive.cu.o | grep DeviceScanKernel | wc -l
480

As you can see, the build time difference is within noise level. The number of kernels doesn't change as well. I believe the situation shouldn't be any different for reduction. But to be on the safe side, I'd recommend repeating the experiment for radix sort.

@jrhemstad
Copy link
Collaborator Author

@robertmaynard or @davidwendt would one of you mind checking to see if removing this patch from cudf does not impact compile time?

@davidwendt
Copy link

davidwendt commented Jul 17, 2024

Reference this patch file: https://github.com/rapidsai/cudf/blob/branch-24.08/cpp/cmake/thirdparty/patches/thrust_faster_scan_compile_times.diff
I ran several build tests on libcudf and the radix-sort and reduce patches are definitely still needed.
Removing the tuning_scan.cuh patch showed less impact in time but enough to be flagged by the Build Metrics Report -- it highlights any time changes greater than 20%.
Also the size of the libcudf.so increased by about 2% without the tuning_scan patch.
I could probably be convinced to remove this one patch but since we really want the other 2 I'm not sure why I would at this time.

@jrhemstad
Copy link
Collaborator Author

I'm not sure why I would at this time.

Because CCCL is currently building RAPIDS from source as part of CCCL's CI. That CI job is currently broken because internal changes within Thrust are now breaking RAPIDS' patches. We want to reduce the surface area of RAPIDS' patches as quickly as possible.

@robertmaynard
Copy link
Contributor

I'm not sure why I would at this time.

Because CCCL is currently building RAPIDS from source as part of CCCL's CI. That CI job is currently broken because internal changes within Thrust are now breaking RAPIDS' patches. We want to reduce the surface area of RAPIDS' patches as quickly as possible.

rapids-cmake patches failing to apply aren't a CMake error. This is infact currently leveraged in libcudf 24.06 where we apply two sets of patches to support different CCCL versions.

If I look at a recent CCCL PR ( #2006 ) I see the RAPIDS cudf builds occurring without issue. If we look at the logs we see:

2024-07-18T21:41:11.4476761Z -- Found libcudacxx: /home/coder/cudf/cpp/build/conda/cuda-12.2/release/_deps/cccl-src/libcudacxx/lib/cmake/libcudacxx/libcudacxx-config.cmake (found suitable version "2.6.0.0", minimum required is "2.6.0.0")
2024-07-18T21:41:11.4490667Z -- Found Thrust: /home/coder/cudf/cpp/build/conda/cuda-12.2/release/_deps/cccl-src/thrust/thrust/cmake/thrust-config.cmake (found suitable exact version "2.6.0.0")
2024-07-18T21:41:11.4510794Z -- Found CUB: /home/coder/cudf/cpp/build/conda/cuda-12.2/release/_deps/cccl-src/cub/cub/cmake/cub-config.cmake (found suitable version "2.6.0.0", minimum required is "2.6.0.0")
2024-07-18T21:41:11.4532772Z -- Found CCCL: /home/coder/cudf/cpp/build/conda/cuda-12.2/release/_deps/cccl-src/lib/cmake/cccl/cccl-config.cmake (found version "2.6.0.0")
2024-07-18T21:41:11.4581525Z -- rapids-cmake [CCCL]: failed to apply diff thrust_disable_64bit_dispatching.diffrapids-cmake [CCCL]: git diff output: error: patch failed: thrust/thrust/system/cuda/detail/dispatch.h:44
2024-07-18T21:41:11.4582554Z -- error: thrust/thrust/system/cuda/detail/dispatch.h: patch does not apply
2024-07-18T21:41:11.4583282Z -- 
2024-07-18T21:41:11.4584169Z -- rapids-cmake [CCCL]: applied diff thrust_faster_sort_compile_times.diff to fix issue: 'Improve Thrust sort compile times by not unrolling loops for inlined comparators [https://github.com/rapidsai/cudf/pull/10577]'
2024-07-18T21:41:11.4585668Z -- rapids-cmake [CCCL]: applied diff thrust_faster_scan_compile_times.diff to fix issue: 'Improve Thrust scan compile times by reducing the number of kernels generated [https://github.com/rapidsai/cudf/pull/8183]'
2024-07-18T21:41:11.5233800Z -- CPM: Using local package [email protected]
2024-07-18T21:41:11.5260202Z -- CPM: Using local package [email protected]

While we failed to apply the thrust_disable_64bit_dispatching patch the CMake process continued without issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

6 participants