-
Notifications
You must be signed in to change notification settings - Fork 78
MCPDLP #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.08
Are you sure you want to change the base?
MCPDLP #231
Conversation
…r of primal step size and dual step size, update the kernels to launch multiple threads and support a very wide batch size accordingly
… if batch is called with trust region restart
…less movement member
…d dual and just make them wider
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
namespace cuopt::linear_programming::detail { | ||
|
||
// This class is used to start a batched dot product | ||
// With large problem size (>10K) and small batch size (<100), this is faster than using Segmented Reduce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Come to think of it I'm not surprised, iirc SegmentedReduce does a 1 block:1 segment mapping and in your case that's pretty terrible, I'm not surprised parallel device-wide blasdot calls beats it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I just realized they added a new overload optimized for fixed sizes, I wasn't aware of it, maybe this performs better?
NVIDIA/cccl#3969
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good catch!! I will test that right away. It might make my life way simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still slower than using multiple dot products :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dang .-.
Looking at their benchmarks they only test for segment sizes up to 1024 so I guess they don't optimize whatsoever for few-segments scenarios. Would be nice if they said so in their docs!
This is possibly a candidate for 25.10 but may still make 25.08 |
@Kh4ster Shall we move this to 25.10 |
I'm going to move this to 25.10, we can move it back if it gets finished |
De-prioritized for 25.10 and slating for 25.12 release |
This is still under a lot of work.
This is to allow preliminary reviews.