-
Notifications
You must be signed in to change notification settings - Fork 69
[Greedy scheduler] Cap fusion to limit shared memory usage #5328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Review updated until commit 0ef2c6f Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
|
||
| void handle(TopKOp* topk) override { | ||
| checkDomainConstraints( | ||
| ir_utils::getTvOutput(topk)->getLogicalDomain(), {topk->dim()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use of the output here was wrong. Need to check the input as it has a larger extent.
|
!test --diff |
|
!test --diff |
1876804 to
9855676
Compare
|
!test --diff |
This reverts commit 0b87334.
| // may be too strict and fragile as a test. After all, we would | ||
| // just need a reasonably tight upper bound. Consider relaxing the | ||
| // condition if necessary. | ||
| EXPECT_EQ(expected_size, ke.getStaticSmemSize()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 impressive job by the agent!
tests/cpp/test_argsort.cpp
Outdated
| // It doesn't seem consistent whether compilation or launch should | ||
| // fail if the requirement of static shared memory exceeds the default | ||
| // limit but within the opt-in larger limit. As we should move to | ||
| // dynamic allocaitons anyway, don't assert for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // dynamic allocaitons anyway, don't assert for now. | |
| // dynamic allocations anyway, don't assert for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly, I think we should skip on these since we didn't run anything.
|
!build |
In the greedy schedule so far, there's just a very naive limit on the usage of shared memory buffers (https://github.com/NVIDIA/Fuser/blob/main/csrc/scheduler/greedy.cpp#L703-L719). This PR improves the estimation of the buffer requirement. Specifically, it adds two utility functions to compute the required sizes of the block radix sort and block scan (
computeBlockRadixSortTempStorageBytesandcomputeBlockScanTempStorageBytes), both of which are automatically generated with Cursor. They seem to make sense and the unit tests agree.The scheduler is updated to use them whenever batching is supported. If batching is not supported (e.g., topk), each call still needs to be limited so that a single thread block can handle, i.e., the constrained size is allowed to be up to 1024.
Note: We are still using statically allocated buffers, but we should manage them ourselves and dynamically allocate or reuse buffers much like we do for other normal shared memory uses. It is not yet critical for our immediate problems. In either case, we would still need to know how much each CUB call would require, so the two util functions would still be necessary.
Closes #5044