Allow allocation and loop to have different positions of device IDs #5323

Priya2698 · 2025-10-04T01:07:48Z

The previous implementation ensured that allocation and loop domains were identical due to limitations in stack (#4381). With recent changes, we are able to support different allocation and loop domains, and this reordering is unnecessary.

This PR reorders device IDs to the front of the loop domain. They are at the position of origin in allocation domain.

Following PRs will update this pass for Stream parallel type.

Benchmark results: No notable difference.

This table compares the maximum of the minimum time across all ranks for the Transformer forward and backward passes. Results are for 8 H100 GPUs.

Test	Main	PR
Forward (`test_transformer_forward`)	2.2129	2.2107
Backward (`test_transformer_backward`)	4.1178	4.1033

github-actions · 2025-10-04T01:08:54Z

Review updated until commit 2e4e25b

Description

Fix vectorization validation by ignoring device dimensions
Update sharded ID lookup to skip reduction domains
Remove redundant loop-allocation domain reordering
Strengthen error checking for logical axis mapping

Changes walkthrough 📝

Relevant files

Bug fix

validation.cpp `Skip device dims in vectorization validation` csrc/device_lower/validation.cpp Skip device dimensions during vectorization validation Prevent device dims from affecting contiguity checks	+1/-1

Enhancement

utils.cpp `Improve sharded ID lookup and error handling` csrc/multidevice/utils.cpp Use `kNoReductions` filter when searching sharded IDs Replace index-based loop with direct domain iteration Add NVF_ERROR for missing producing logical axis	+5/-4
finalize_multidevice_domains.cpp `Simplify allocation domain finalization` csrc/preseg_passes/finalize_multidevice_domains.cpp Remove redundant loop domain reordering Simplify allocation domain setup Directly apply allocation domain without permutation Call `reorderParallelizedToFront` unconditionally	+2/-33

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review Logic Change The condition in the loop domain validation has been extended to skip device dimensions, which may affect contiguity checks. This change should be validated to ensure it does not inadvertently skip meaningful non-reduction, non-broadcast dimensions that are not device-related. if (r_id->isReduction() \|\| r_id->isBroadcast() \|\| r_id->isDeviceDim()) { continue; Error Handling The logic for handling cases where the producing logical axis is not found has changed from continuing the loop to throwing an error. This stricter behavior should be reviewed to confirm it is safe across all calling contexts and does not break existing valid use cases. NVF_ERROR( sharded_axis != -1, "Producing logical axis not found for ", sharded_id); Allocation-Loop Domain Mismatch The PR removes logic that enforces alignment between allocation and loop domains via permutation and reordering. While this is intentional, the removal of safety checks and the new assumption of independent domain ordering should be carefully validated, especially for resharding cases. tv->setAllocationDomain(new_allocation_domain, new_contiguity); reorderParallelizedToFront(tv); }

Priya2698 · 2025-10-04T02:48:43Z

!test --diff

Priya2698 · 2025-10-07T01:20:12Z

!test

Priya2698 · 2025-10-07T03:09:22Z

!test

Priya2698 · 2025-10-07T05:04:10Z

csrc/multidevice/utils.cpp

  }();

-  for (auto&& [index, id] : enumerate(domain)) {
+  for (auto&& [index, id] : enumerate(domain | TensorDomain::kNoReductions)) {


For reduce-scatter outputs, we may return the rDIDx in unshardedSizes if it is ordered before iDIDx in the loop domain. This leads to incorrect shape deduction. For e.g.: LowerCollectiveTest.NoncontigReduceScatter

Priya2698 · 2025-10-07T05:05:16Z

csrc/device_lower/validation.cpp

    for (size_t i = tv->getMaybeAllocationDomain().size(); i > 0; i--) {
      auto r_id = tv->getMaybeAllocationDomain()[i - 1];
-      if (r_id->isReduction() || r_id->isBroadcast()) {
+      if (r_id->isReduction() || r_id->isBroadcast() || r_id->isDeviceDim()) {


DIDx is not reordered to the front in allocation domain and may appear in the innermost position. For e..g. test_matmul.test_linear_reduce_scatter (the copy kernel is pointwise scheduled and codegen-ed).

wujingyue

Wonderful!

csrc/device_lower/validation.cpp

csrc/multidevice/utils.cpp

Priya2698 · 2025-10-07T06:49:13Z

!test --diff

diff order of alloc and loop

e5818cc

Merge branch 'main' into pm/alloc_order

8812fd4

do not return reduction IDs in getShardedIterDomain

d17bc04

ignore device dim when verifying vectorized id allocation

aa99629

Priya2698 requested a review from wujingyue October 7, 2025 05:02

Priya2698 marked this pull request as ready for review October 7, 2025 05:02

Priya2698 commented Oct 7, 2025

View reviewed changes

wujingyue approved these changes Oct 7, 2025

View reviewed changes

csrc/device_lower/validation.cpp Show resolved Hide resolved

csrc/multidevice/utils.cpp Outdated Show resolved Hide resolved

review

2e4e25b

wujingyue mentioned this pull request Oct 7, 2025

Lower stream to for loop #5229

Draft

Priya2698 merged commit 4b8dd52 into main Oct 7, 2025
59 of 61 checks passed

Priya2698 deleted the pm/alloc_order branch October 7, 2025 23:40

This was referenced Oct 7, 2025

Decouple loop and allocation during preseg for multi-GPU #4381

Closed

Lower stream-parallelized matmul #5302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow allocation and loop to have different positions of device IDs #5323

Allow allocation and loop to have different positions of device IDs #5323

Uh oh!

Priya2698 commented Oct 4, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 4, 2025 •

edited

Loading

Uh oh!

Priya2698 commented Oct 4, 2025

Uh oh!

Priya2698 commented Oct 7, 2025

Uh oh!

Priya2698 commented Oct 7, 2025

Uh oh!

Priya2698 Oct 7, 2025

Uh oh!

Priya2698 Oct 7, 2025

Uh oh!

wujingyue left a comment

Uh oh!

Uh oh!

Uh oh!

Priya2698 commented Oct 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Allow allocation and loop to have different positions of device IDs #5323

Allow allocation and loop to have different positions of device IDs #5323

Uh oh!

Conversation

Priya2698 commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

Priya2698 commented Oct 4, 2025

Uh oh!

Priya2698 commented Oct 7, 2025

Uh oh!

Priya2698 commented Oct 7, 2025

Uh oh!

Priya2698 Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Priya2698 Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Priya2698 commented Oct 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Priya2698 commented Oct 4, 2025 •

edited

Loading

github-actions bot commented Oct 4, 2025 •

edited

Loading