-
Notifications
You must be signed in to change notification settings - Fork 68
Allow allocation and loop to have different positions of device IDs #5323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Review updated until commit 2e4e25b Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
!test --diff |
|
!test |
|
!test |
csrc/multidevice/utils.cpp
Outdated
| }(); | ||
|
|
||
| for (auto&& [index, id] : enumerate(domain)) { | ||
| for (auto&& [index, id] : enumerate(domain | TensorDomain::kNoReductions)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reduce-scatter outputs, we may return the rDIDx in unshardedSizes if it is ordered before iDIDx in the loop domain. This leads to incorrect shape deduction. For e.g.: LowerCollectiveTest.NoncontigReduceScatter
| for (size_t i = tv->getMaybeAllocationDomain().size(); i > 0; i--) { | ||
| auto r_id = tv->getMaybeAllocationDomain()[i - 1]; | ||
| if (r_id->isReduction() || r_id->isBroadcast()) { | ||
| if (r_id->isReduction() || r_id->isBroadcast() || r_id->isDeviceDim()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DIDx is not reordered to the front in allocation domain and may appear in the innermost position. For e..g. test_matmul.test_linear_reduce_scatter (the copy kernel is pointwise scheduled and codegen-ed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful!
|
!test --diff |
The previous implementation ensured that allocation and loop domains were identical due to limitations in stack (#4381). With recent changes, we are able to support different allocation and loop domains, and this reordering is unnecessary.
This PR reorders device IDs to the front of the loop domain. They are at the position of origin in allocation domain.
Following PRs will update this pass for
Streamparallel type.Benchmark results: No notable difference.
This table compares the maximum of the minimum time across all ranks for the Transformer forward and backward passes. Results are for 8 H100 GPUs.
test_transformer_forward)test_transformer_backward)