Implement the new tuning API for `DeviceTransform` #6914

bernhardmgruber · 2025-12-08T17:57:22Z

Fixes: #6919
Fixes: #5057
Fixes: #3017

Compile time of cub.test.device.transform.lid_0 using nvcc 13.0 and clang 20 for sm86, sm120

branch:
1m49.900s
1m50.615s
1m50.255s

main:
1m56.917s
1m57.378s
1m59.371s

Compile time of cub.test.device.transform.lid_0 for sm86, sm120 using clang 20 in CUDA mode:

branch:
real 1m40.627s
real 1m40.675s
real 1m40.912s

main:
real 1m39.273s
real 1m39.669s
real 1m39.835s

copy-pr-bot · 2025-12-08T17:57:26Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

c/parallel/src/transform.cu

cub/cub/device/dispatch/kernels/kernel_transform.cuh

miscco · 2025-12-10T08:35:31Z

cub/cub/device/dispatch/kernels/kernel_transform.cuh

+#if _CCCL_HAS_CONCEPTS()
+  requires transform_policy_hub<ArchPolicies>
+#endif // _CCCL_HAS_CONCEPTS()


Nitpick: I believe we should either use the concept emulation or plain SFINAE in C++17 too

Hmm. We could also static_assert, but ArchPolicies is already used in the kernel attributes before we reach the body. And using a static_assert would only be evaluated in the device path.

How would I write that using concept emulation and have the concept check before the __launch_bounds__?

We could write:

Suggested change

#if _CCCL_HAS_CONCEPTS()

requires transform_policy_hub<ArchPolicies>

#endif // _CCCL_HAS_CONCEPTS()

_CCCL_TEMPLATE(typename PolicySelector,

typename Offset,

typename Predicate,

typename F,

typename RandomAccessIteratorOut,

typename... RandomAccessIteratorsIn)

_CCCL_REQUIRES(transform_policy_selector<PolicySelector>)

Yeah, but as discussed on Slack before, we would need to get transform_policy_selector and then policy_selector working, which we couldn't of the is_constant_expression check. Let's leave it.

cub/cub/device/dispatch/kernels/kernel_transform.cuh

cub/cub/device/dispatch/tuning/tuning_transform.cuh

miscco · 2025-12-10T08:45:51Z

cub/cub/device/dispatch/tuning/tuning_transform.cuh

+    bool all_inputs_contiguous                  = true;
+    bool all_input_values_trivially_reloc       = true;
+    bool can_memcpy_contiguous_inputs           = true;
+    bool all_value_types_have_power_of_two_size = ::cuda::is_power_of_two(output.value_type_size);
+    for (const auto& input : inputs)
+    {
+      all_inputs_contiguous &= input.is_contiguous;
+      all_input_values_trivially_reloc &= input.value_type_is_trivially_relocatable;
+      // the vectorized kernel supports mixing contiguous and non-contiguous iterators
+      can_memcpy_contiguous_inputs &= !input.is_contiguous || input.value_type_is_trivially_relocatable;
+      all_value_types_have_power_of_two_size &= ::cuda::is_power_of_two(input.value_type_size);
+    }


Nitpick: While it is technically more efficient, I believe it would improve readability if we did

const bool all_inputs_contiguous = ::cuda::std::all_of(input.begin(), input.end(), [](const auto& input) { return input.is_contiguous; })

Can I do this later? Maybe we have std::ranges::all_of by then.

cub/cub/device/dispatch/tuning/tuning_transform.cuh

bernhardmgruber · 2025-12-11T11:43:11Z

I see tiny changes in the generated SASS for cub.bench.transform.babelstream.base, notable in the filling kernels (no inputs) for complex<float>. The compiler now generates STG.E.ENL2.256, which it didn't do before.

The fill lernel for int128 seems to have degraded from generating STG.E.128 to a lot more STG.E.

All kernels with a functor marked as __callable_permitting_copied_arguments show no changes. That's good.

It feels a bit like the items per thread changed for the fill kernels.

bernhardmgruber · 2025-12-11T13:00:43Z

It feels a bit like the items per thread changed for the fill kernels.

They did. Before we had a tuning policy for sm_120, that was not taken into account :D This PR now uses it.

bernhardmgruber · 2025-12-11T13:19:45Z

I disabled the sm120 fill policy and now the only SASS diff for filling is on:

void cub::_V_300300_SM_1200::detail::transform::transform_kernel<cub::_V_300300_SM_1200::detail::transform::policy_hub<false, true, cuda::std::__4::tuple<cuda::__4::counting_iterator<long, 0, 0>>, unsigned long*>::policy1000, long, cub::_V_300300_SM_1200::detail::transform::always_true_predicate, cuda::__4::__callable_permitting_copied_arguments<(anonymous namespace)::lognormal_adjust_t<unsigned long>>, unsigned long*, cuda::__4::counting_iterator<long, 0, 0>>(long, int, bool, cub::_V_300300_SM_1200::detail::transform::always_true_predicate, cuda::__4::__callable_permitting_copied_arguments<(anonymous namespace)::lognormal_adjust_t<unsigned long>>, unsigned long*, cub::_V_300300_SM_1200::detail::transform::kernel_arg<cuda::__4::counting_iterator<long, 0, 0>>)

which is a thrust::tabulate of a counting_iterator<long> and an unsigned long*.

bernhardmgruber · 2025-12-11T16:45:54Z

Found the final issue with the fill kernels. Disabled the vectorized tunings when we have input streams (they were tuned for output only use cases). SASS of cub.bench.transform.fill.base now matches baseline on sm120.

gevtushenko · 2026-01-14T00:52:01Z

cub/cub/util_device.cuh

+concept policy_selector = requires(T hub, ::cuda::arch_id arch) {
+  requires ::cuda::std::regular<Policy>;
+  { hub(arch) } -> _CCCL_CONCEPT_VSTD::same_as<Policy>;
+  { __needs_a_constexpr_value(hub(arch)) };


question: what was the intention for __needs_a_constexpr_value here? Do you want to check if hub's operator() can return compile-time information? If so, I don't think this works. Let's add a test for non-constexpr operator to cover this. For instance, the following type satisfies this concept:

struct policy_selector_all { auto operator()(arch_id id) const -> a_policy { int r = rand() % 5; return a_policy{static_cast<arch_id>(r)}; } };

maybe something along the following lines would fix this:

template <auto> inline constexpr bool __needs_a_constexpr_value = true; // ... requires __needs_a_constexpr_value<T{}(arch_id{})>;

While fixing this I got this error and I am massively impressed:

/home/bgruber/dev/cccl/lib/cmake/cub/../../../cub/cub/device/dispatch/tuning/tuning_reduce.cuh(474): error: expression must have a constant value static_assert(__needs_a_constexpr_value2<policy_selector{}(::cuda::arch_id::sm_60)>); ^ /home/bgruber/dev/cccl/lib/cmake/cub/../../../cub/cub/util_arch.cuh(150): note #61-D: integer operation result is out of range ::cuda::std::clamp(nominal_4B_items_per_thread * 4 / target_type_size, 1, nominal_4B_items_per_thread * 2); ^ /home/bgruber/dev/cccl/lib/cmake/cub/../../../cub/cub/device/dispatch/tuning/tuning_reduce.cuh(440): note #2693-D: called from: auto [scaled_items, scaled_threads] = scale_mem_bound(threads_per_block, items_per_thread, accum_size); ^

The concept check did actually validate whether the computation of the tuning policy was sound. The reason it fails here is because a default constructed policy selector like reduce::policy_selector{} has a bunch of zero data members now, like accum_size, leading to the division by zero later. So we cannot test from just the type of a policy selector whether it returns a policy at compile-time. I will drop the constexpr test from the concept.

I was btw doing this investigation while I had Cursor (with claude-4.5-opus-high) trying to fix it as well. Cursor was fast in figuring out how to run the tests, but got really lost when the first two attempts of changing the concept definition didn't fix the problem.

gevtushenko

Excited to see the new tuning machinery at work! Code is much more readable now and we no longer have to parse PTX 🎉

cub/benchmarks/bench/transform/common.h

cub/benchmarks/bench/transform/babelstream.cu

cub/cub/device/dispatch/tuning/tuning_transform.cuh

gevtushenko · 2026-01-15T01:13:29Z

c/parallel/src/transform.cu

+  build_ptr->cache                      = new transform::cache();
+


important: to my understanding, we no longer have to parse policy from the PTX. Let's drop this.

This is for caching the items per thread we computed based on the occupancy in the transform dispatch, which involves a bunch of CUDA API calls and that's why we cache it. In ordinary CUB we can do this in a static variable, since there is one such variable for each template instantiation of the CUB algo, and each template instantiation with a distinct set of types may have a different config. In CCCL.C, we type-erase the iterators so the caching mechanism has to work differently, which is handled by the above cache. In my understanding, this is still needed.

github-actions · 2026-01-15T06:27:00Z

😬 CI Workflow Results

🟥 Finished in 6h 01m: Pass: 98%/133 | Total: 7d 08h | Max: 6h 01m | Hits: 52%/174121

See results here.

github-project-automation bot added this to CCCL Dec 8, 2025

github-project-automation bot moved this to Todo in CCCL Dec 8, 2025

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Dec 8, 2025

bernhardmgruber force-pushed the tuning_transform branch from 4244463 to 43feb21 Compare December 8, 2025 22:44

bernhardmgruber commented Dec 9, 2025

View reviewed changes

c/parallel/src/transform.cu Outdated Show resolved Hide resolved

bernhardmgruber commented Dec 9, 2025

View reviewed changes

c/parallel/src/transform.cu Outdated Show resolved Hide resolved

bernhardmgruber marked this pull request as ready for review December 9, 2025 07:44

bernhardmgruber requested review from a team as code owners December 9, 2025 07:44

bernhardmgruber requested review from fbusato and gevtushenko December 9, 2025 07:44

cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Dec 9, 2025

bernhardmgruber force-pushed the tuning_transform branch from fca1221 to 2aade5f Compare December 9, 2025 08:03