Skip to content

Conversation

Kh4ster
Copy link
Contributor

@Kh4ster Kh4ster commented Jul 24, 2025

This is still under a lot of work.

This is to allow preliminary reviews.

Kh4ster added 30 commits July 2, 2025 18:33
…r of primal step size and dual step size, update the kernels to launch multiple threads and support a very wide batch size accordingly
… if batch is called with trust region restart
@Kh4ster Kh4ster added this to the 25.08 milestone Jul 24, 2025
@Kh4ster Kh4ster requested a review from a team as a code owner July 24, 2025 16:48
@Kh4ster Kh4ster added feature request New feature or request non-breaking Introduces a non-breaking change labels Jul 24, 2025
@Kh4ster Kh4ster requested review from kaatish and hlinsen July 24, 2025 16:48
@Kh4ster Kh4ster added the pdlp label Jul 24, 2025
@Kh4ster Kh4ster marked this pull request as draft July 24, 2025 16:49
Copy link

copy-pr-bot bot commented Jul 24, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@Kh4ster Kh4ster removed request for kaatish and hlinsen July 24, 2025 16:50
@Kh4ster Kh4ster self-assigned this Jul 24, 2025
namespace cuopt::linear_programming::detail {

// This class is used to start a batched dot product
// With large problem size (>10K) and small batch size (<100), this is faster than using Segmented Reduce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it I'm not surprised, iirc SegmentedReduce does a 1 block:1 segment mapping and in your case that's pretty terrible, I'm not surprised parallel device-wide blasdot calls beats it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I just realized they added a new overload optimized for fixed sizes, I wasn't aware of it, maybe this performs better?
NVIDIA/cccl#3969

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good catch!! I will test that right away. It might make my life way simpler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still slower than using multiple dot products :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang .-.
Looking at their benchmarks they only test for segment sizes up to 1024 so I guess they don't optimize whatsoever for few-segments scenarios. Would be nice if they said so in their docs!

@tmckayus
Copy link
Contributor

This is possibly a candidate for 25.10 but may still make 25.08

@rgsl888prabhu
Copy link
Collaborator

@Kh4ster Shall we move this to 25.10

@tmckayus
Copy link
Contributor

@Kh4ster Shall we move this to 25.10

I'm going to move this to 25.10, we can move it back if it gets finished

@tmckayus tmckayus modified the milestones: 25.08, 25.10 Jul 31, 2025
@anandhkb anandhkb modified the milestones: 25.10, 25.12 Sep 17, 2025
@anandhkb
Copy link

De-prioritized for 25.10 and slating for 25.12 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Introduces a non-breaking change pdlp
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants