Skip to content

Conversation

@Aminsed
Copy link
Contributor

@Aminsed Aminsed commented Nov 2, 2025

Summary

  • keep ThreadReduce accumulator types pinned to the block value type across BlockScan and BlockReduce
  • apply the same accumulator fix to the raking specialization so all paths use the intended type
  • add a regression test that exercises BlockScan with a functor returning a wider type

Motivation

#5668 shows that BlockScan widens the accumulator when the scan functor returns a wider type than the block value. That implicit widening breaks user code that relies on the original type and can even hit deleted overloads.

Explanation

ThreadReduce was deducing its accumulator type from the functor instead of the block value T. The patch explicitly instantiates ThreadReduce with AccumT = T everywhere BlockScan and BlockReduce dispatch through it, including the raking specialization. The new unit test exercises an operator that returns long long for int inputs and verifies the accumulator remains int.

Rationale

  • Minimal surface area: the change touches only the ThreadReduce call sites; public APIs and template parameters stay the same.
  • Consistent behavior: every BlockScan reduction path now uses the same accumulator type, avoiding divergent code paths.
  • Regression coverage: the new Catch2 test guards against future regressions triggered by wider returning ops.

Testing

  • pre-commit run --files cub/cub/block/block_scan.cuh cub/cub/block/block_reduce.cuh cub/cub/block/specializations/block_reduce_raking_commutative_only.cuh cub/test/catch2_test_block_scan.cu

@Aminsed Aminsed requested review from a team as code owners November 2, 2025 17:27
@Aminsed Aminsed requested review from fbusato and pciolkosz November 2, 2025 17:27
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 2, 2025
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Nov 2, 2025
@Aminsed Aminsed force-pushed the fix-blockscan-accum branch from bc27d15 to c122fad Compare November 2, 2025 17:31
// Reduce partials
T partial = cub::ThreadReduce(inputs, reduction_op);
T partial =
cub::ThreadReduce<::cuda::std::remove_reference_t<decltype(inputs)>, ReductionOp, T, T>(inputs, reduction_op);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks a partial solution. It could regress for small integer types. For example, reduction/scan over int8_t. It is better to perform the computation with 32-bit and cast back at the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I dropped the explicit T and taught ThreadReduce to keep its __accumulator_t promotion, so int8_t still widens to 32-bit.

{
// Reduce partials
T partial = cub::ThreadReduce(inputs, ::cuda::std::plus<>{});
T partial = cub::ThreadReduce<::cuda::std::remove_reference_t<decltype(inputs)>, ::cuda::std::plus<>, T, T>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. please isolate the first template parameter with using to improve readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fbusato Thanks! I reverted that spot to plain ThreadReduce(inputs, …), so there’s nothing left to alias. Let me know if you’d still like a using helper there.

@github-project-automation github-project-automation bot moved this from In Review to In Progress in CCCL Nov 3, 2025
@Aminsed Aminsed force-pushed the fix-blockscan-accum branch from 85c2484 to 0bcd084 Compare November 4, 2025 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants