Skip to content

Conversation

@griwes
Copy link
Contributor

@griwes griwes commented Dec 18, 2025

Description

This PR adds {lower,upper}_bound device algorithms to both CUB and c.parallel.

In the case of CUB, the implementation is very straightforward and directly follows current implementation in Thrust (which I have cleaned up as a drive by change).

In the case of c.parallel, because of how CUB's for_each passes in kernel arguments, repeating the slight madness of the current for operator construction for the for_each algorithm itself felt beyond annoying, but I needed a kernel pointer; so instead of reusing the kernels available in CUB, I adapted the static-block-size for_each kernel to accept all the necessary arguments as separate kernel arguments, then construct the for operator expected by CUB inside the kernel, and finally invoke the CUB for_each agent with that operator. So, it's a manually constructed kernel, but it reuses both the agent code and binary search helper types from CUB.

Resolves #6695

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@griwes griwes requested review from a team as code owners December 18, 2025 06:13
@github-project-automation github-project-automation bot moved this to Todo in CCCL Dec 18, 2025
@griwes griwes requested a review from elstehle December 18, 2025 06:13
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Dec 18, 2025
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Comment on lines +59 to +60
const unsigned int thread_count = 256;
const size_t items_per_block = 512;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these essentially hardcoded somewhere in the CUB as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There's no tunings for for, and this just follows that.

I'm planning to start working on a warp level binary search algorithm in the new year, and then build a device wide one on top of that, as a replacement of the current approach - we'll do actual tunings then.

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C side looks good to me. Thanks!

@github-actions

This comment has been minimized.

@griwes griwes enabled auto-merge (squash) December 19, 2025 22:51
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

{
pushed = try_push_context();
auto exec_status =
Invoke(d_data, num_items, d_values, num_values, d_out, op, build.cc, (CUfunction) build.kernel, stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: can we use static_cast here? If not, maybe reinterpret_cast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mirrors what for does, I can go clean it up a little in both places later but I'd rather not multiply the drive by fixes here.

@github-actions

This comment has been minimized.

@griwes griwes disabled auto-merge January 9, 2026 10:25
@github-actions

This comment has been minimized.

@griwes griwes requested a review from a team as a code owner January 12, 2026 08:02
@griwes griwes enabled auto-merge (squash) January 12, 2026 08:03
@github-actions
Copy link
Contributor

😬 CI Workflow Results

🟥 Finished in 5h 38m: Pass: 99%/133 | Total: 6d 05h | Max: 5h 37m | Hits: 71%/177855

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[FEA]: Add vectorized cub::DeviceFind::UpperBound and cub::DeviceFind::LowerBound algorithms

5 participants