Lower stream-parallelized matmul #5302

wujingyue · 2025-10-02T03:22:34Z

For #5289

github-actions · 2025-10-02T03:23:46Z

Review updated until commit 89f3173

Description

Added stream-parallelized matmul support via loop sharding
Introduced shardByStream helper for tensor stream sharding
Implemented findStreamIterDomain to detect stream-parallelized domains
Refactored ForLoop creation with static createFromIterDomain

Changes walkthrough 📝

Relevant files

Enhancement

host_ir.cpp `Add shardByStream and update ForLoop creation` csrc/host_ir/host_ir.cpp Added `shardByStream` function to create stream-sharded tensor views Implemented conservative allocation domain setup using self-replay Converted `createForLoopFromIterDomain` to static method of `ForLoop`	+17/-1
lowering.cpp `Support stream-parallelized lowering with for-loops` csrc/host_ir/lowering.cpp Added `findStreamIterDomain` to detect stream-parallelized loop domains Enhanced `lowerSegment` to handle stream-parallelized expressions via for-loops Implemented cloning and sharding of expressions in stream-parallel context Added allocation and replacement logic for stream-sharded tensors	+115/-5
host_ir.h `Declare shardByStream and ForLoop factory method` csrc/host_ir/host_ir.h Declared `shardByStream` helper function Added static `createFromIterDomain` method to `ForLoop` Maintained existing class structures and clone declarations	+6/-2

Tests

test_stream.cpp `Add stream-parallelized matmul test` tests/cpp/test_stream.cpp Added `StreamTest` fixture with Host IR lowering enabled Added `Matmul` test for stream-parallelized matmul correctness Updated test structure to use fixture-based setup	+54/-3

Miscellaneous

container.h `Suggest performance improvement for insertion` csrc/host_ir/container.h Added comment suggesting linked list for faster insertion Kept existing top-level expressions and executor structures	+3/-0

Dependencies

multidevice.h `Include vector header` csrc/multidevice/multidevice.h Added vector include for future container use No functional changes to multidevice declarations	+2/-0
transform_replay.h `Update includes in transform_replay.h` csrc/transform_replay.h Added unordered_map include Removed unused algorithm, unordered_set, and vector includes	+2/-5

Bug fix

fusion_executor_cache.h `Fix warning typo in comment` csrc/runtime/fusion_executor_cache.h Fixed typo: "WARING" to "WARNING" No other changes to fusion executor cache interface	+1/-1

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review Possible Issue The function `findStreamIterDomain(TensorView)` checks only the front of the loop domain for a stream IterDomain, based on the assumption that `FinalizeMultideviceDomains` places it there. However, this assumption may not always hold, leading to missed stream IterDomains if their position is not guaranteed. The code should validate this positioning assumption or search the entire loop domain. IterDomain findStreamIterDomain(TensorView* tv) { const std::vector<IterDomain>& loop = tv->getLoopDomain(); // FinalizeMultideviceDomains pass puts the stream IterDomain to the // front. if (!loop.empty() && loop.front()->isStream()) { return loop.front(); } return nullptr; } Performance* The comment suggests that setting the allocation domain via `setAllocationDomain(out->getLoopDomain(), false)` in `shardByStream` is conservative and suboptimal. This may lead to inefficient memory layouts. Consider implementing the suggested algorithm from alias_analysis.cpp to make a more informed decision on contiguity. Maintainability There are two similar functions, `findStreamIterDomain` and `getShardedIterDomain`, searching for stream IterDomains in different domains (loop vs. allocation). This duplication increases maintenance cost. Consider unifying them into a single function with a domain type parameter as suggested in the comment. // Finds the stream-parallelized IterDomain in the loop domain of a TensorView, // or nullptr if not found. This is different from `getShardedIterDomain(tv, // ParallelType::Stream)`, which searches the allocation domain. Consider // unifying them into one function with an extra DomainType parameter. IterDomain* findStreamIterDomain(TensorView* tv) { const std::vector<IterDomain>& loop = tv->getLoopDomain(); // FinalizeMultideviceDomains pass puts the stream IterDomain to the // front. if (!loop.empty() && loop.front()->isStream()) { return loop.front(); } return nullptr; } // Finds the stream IterDomain in the outputs of a segment. IterDomain findStreamIterDomain(const std::vector<Val>& outs) { for (auto out : ir_utils::filterByType<TensorView>(outs)) { if (auto* stream_id = findStreamIterDomain(out)) { return stream_id; } } return nullptr; }

wujingyue · 2025-10-02T03:26:36Z

!test

wujingyue · 2025-10-08T00:57:22Z

csrc/host_ir/host_ir.h


  NVFUSER_DECLARE_CLONE_AND_CREATE

+  static ForLoop* createFromIterDomain(Val* index, IterDomain* iter_domain);


I'm on the fence about this. The method is coupled with the ForLoop class so I moved here to save some typing. The downside is less access control because createFromIterDomain could access private fields/methods of ForLoop.

wujingyue · 2025-10-08T00:58:49Z

!test

wujingyue · 2025-10-08T01:02:50Z

!test

csrc/preseg_passes/finalize_multidevice_domains.cpp

nsarka

LGTM, I just had a minor question

nsarka · 2025-10-09T16:15:21Z

csrc/host_ir/lowering.cpp

+      }
+
+      std::vector<Val*> cloned_outs = ir_cloner.clone(group.outputs());
+      // All expressions in the group are expected to be stream parallelized in


Do we enforce this constraint? If so is there an assertion somewhere?

We don't but we should. I'm waiting for a isResharding-like method to do that easily.

Priya2698

Apologies for the delayed review, fell off my radar.
I have left some initial questions.
I am working on #5309, and hope to have a PR soon, which should unblock this PR.

Priya2698 · 2025-10-09T16:45:04Z

csrc/host_ir/lowering.cpp

+
+// Finds the stream IterDomain in the outputs of a segment.
+IterDomain* findStreamIterDomain(const std::vector<Val*>& outs) {
+  for (auto* out : ir_utils::filterByType<TensorView>(outs)) {


So we are finding the stream ID in any of the outputs of a segment? Why not use the above variation directly with any of the segment outputs as they must have mapped stream IDs.

Because I'm not sure about CPU-scalar TensorViews from composite ops. But I should probably harden the check to enforce every TensorView to have a Stream IterDomain. Wdyt?

In their blackbox state, it does not look we can currently support SDPA ops, for example. So adding an assert makes sense to signal something is wrong. I guess this is something I need to fix in PropagateShardingsPass also.

In their blackbox state, it does not look we can currently support SDPA ops, for example.

Why not? At least, batch and/or head can be easily parallelized on stream without changing the implementation of the SDPA op, assuming ShardByStreams are added properly of course.

csrc/runtime/fusion_executor_cache.h

Priya2698 · 2025-10-09T17:05:25Z

csrc/host_ir/host_ir.cpp

+  auto* out = ops::newValLike(in, *in->getDataType())->as<TensorView>();
+
+  TransformReplay::selfReplay(in->domain(), out->domain());
+  // This is conservative and suboptimal. Consider reusing the algorithm in


This will resolved using #5316?

No. It's one of the cases where out's contiguity ought to be different from in due to the slicing effect.

Oh okay, got it!
So in such cases the replay may in fact overwrite a correct contiguity as most users of selfReplay create the new TensorDomain using ops API, which sets the contiguity correctly. This is something we should consider for #5316.

wujingyue · 2025-10-19T16:43:11Z

!test

For #5289

wujingyue force-pushed the wjy/matmul branch from 3e27a68 to dc3ece7 Compare October 2, 2025 03:26

WIP

aae88a4

wujingyue force-pushed the wjy/matmul branch from dc3ece7 to aae88a4 Compare October 8, 2025 00:08

wujingyue commented Oct 8, 2025

View reviewed changes

comments

a9e64ca

wujingyue requested review from Priya2698, mdavis36 and nsarka October 8, 2025 00:58

comment

eec7ad7

wujingyue marked this pull request as ready for review October 8, 2025 01:02

wujingyue commented Oct 8, 2025

View reviewed changes

csrc/preseg_passes/finalize_multidevice_domains.cpp Outdated Show resolved Hide resolved

nsarka approved these changes Oct 9, 2025

View reviewed changes

Priya2698 reviewed Oct 9, 2025

View reviewed changes

wujingyue added 3 commits October 13, 2025 22:21

Merge remote-tracking branch 'origin/main' into wjy/matmul

d8cf165

Review

0e2f762

Merge branch 'main' into wjy/matmul

89f3173

wujingyue merged commit 3260d70 into main Oct 19, 2025
65 of 67 checks passed

wujingyue deleted the wjy/matmul branch October 19, 2025 23:09

tbqh pushed a commit that referenced this pull request Nov 12, 2025

Lower stream-parallelized matmul (#5302)

2cafb54

For #5289


		NVFUSER_DECLARE_CLONE_AND_CREATE

		static ForLoop* createFromIterDomain(Val* index, IterDomain* iter_domain);

Lower stream-parallelized matmul #5302

Lower stream-parallelized matmul #5302

Uh oh!

Conversation

wujingyue commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

wujingyue commented Oct 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wujingyue commented Oct 8, 2025

Uh oh!

wujingyue commented Oct 8, 2025

Uh oh!

Uh oh!

nsarka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Priya2698 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wujingyue commented Oct 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wujingyue commented Oct 2, 2025 •

edited

Loading

github-actions bot commented Oct 2, 2025 •

edited

Loading