Stream-parallelize loops #5751

wujingyue · 2026-01-03T07:30:03Z

---------------------------------------------------------------------------------------------- benchmark: 3 tests ----------------------------------------------------------------------------------------------
Name (time in ms)                                      Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_row_parallel_linear_forward_benchmark[s=2]     3.5521 (1.0)      3.6830 (1.0)      3.5952 (1.0)      0.0510 (1.71)     3.5788 (1.0)      0.0460 (1.0)           1;1  278.1505 (1.0)           5           1
test_row_parallel_linear_forward_benchmark[s=4]     3.6751 (1.03)     3.7427 (1.02)     3.7021 (1.03)     0.0298 (1.0)      3.6876 (1.03)     0.0498 (1.08)          1;0  270.1204 (0.97)          5           1
test_row_parallel_linear_forward_benchmark[s=1]     3.6866 (1.04)     4.1345 (1.12)     3.8824 (1.08)     0.2257 (7.58)     3.7571 (1.05)     0.4190 (9.11)          2;0  257.5757 (0.93)          5           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Overlapping improves the wall time slightly.

Stream assignment and overlapping are verified by the following:

$ nsys profile --capture-range=cudaProfilerApi --capture-range-end=stop mpirun -np 2 pytest tests/python/multidevice/test_overlap.py::'test_row_parallel_linear_forward_benchmark[s=4]' --only-mpi -vs

$ nsys stats report3.nsys-rep --report cuda_gpu_trace | grep '(0)'
    7840730           1184     345                                                                                0.000              3.378  Device              NVIDIA H100 80GB HBM3 (0)    1              20  [CUDA memset]
    7858970         666943     346     2    66     1   384     1     1      168         0.000         0.213                                                     NVIDIA H100 80GB HBM3 (0)    1              20  nvjet_sm90_tst_256x128_64x4_1x2_h_bz_coopA_TNT
    8276377            960     421                                                                                0.000              4.167  Device              NVIDIA H100 80GB HBM3 (0)    1              28  [CUDA memset]
    8357049         846078     422     2    66     1   384     1     1      168         0.000         0.213                                                     NVIDIA H100 80GB HBM3 (0)    1              28  nvjet_sm90_tst_256x128_64x4_1x2_h_bz_coopA_TNT
    8629561            800     497                                                                                0.000              5.000  Device              NVIDIA H100 80GB HBM3 (0)    1              32  [CUDA memset]
    8958648           1504     573                                                                                0.000              2.660  Device              NVIDIA H100 80GB HBM3 (0)    1              36  [CUDA memset]
    9029464          47392     350                                                                               33.554         707998.515  Device    Device    NVIDIA H100 80GB HBM3 (0)    1              20  [CUDA memcpy Device-to-Device]
    9075640         832766     498     2    66     1   384     1     1      168         0.000         0.213                                                     NVIDIA H100 80GB HBM3 (0)    1              32  nvjet_sm90_tst_256x128_64x4_1x2_h_bz_coopA_TNT
    9729238         888798     574     2    66     1   384     1     1      168         0.000         0.213                                                     NVIDIA H100 80GB HBM3 (0)    1              36  nvjet_sm90_tst_256x128_64x4_1x2_h_bz_coopA_TNT
   10440469         265567     376    24     1     1   544     1     1       96         0.037         0.082                                                     NVIDIA H100 80GB HBM3 (0)    1              24  ncclDevKernel_AllReduce_Sum_bf16_RING_LL(ncclDevKernelArgsStorage<(unsigned long)4096>)
   10444405         114368     426                                                                               33.554         293366.399  Device    Device    NVIDIA H100 80GB HBM3 (0)    1              28  [CUDA memcpy Device-to-Device]
   10520725          49440     502                                                                               33.554         678671.942  Device    Device    NVIDIA H100 80GB HBM3 (0)    1              32  [CUDA memcpy Device-to-Device]
   10619732          29408     578                                                                               33.554        1140984.906  Device    Device    NVIDIA H100 80GB HBM3 (0)    1              36  [CUDA memcpy Device-to-Device]
   10708500         141408     452    24     1     1   544     1     1       96         0.037         0.082                                                     NVIDIA H100 80GB HBM3 (0)    1              24  ncclDevKernel_AllReduce_Sum_bf16_RING_LL(ncclDevKernelArgsStorage<(unsigned long)4096>)
   10852884         139456     528    24     1     1   544     1     1       96         0.037         0.082                                                     NVIDIA H100 80GB HBM3 (0)    1              24  ncclDevKernel_AllReduce_Sum_bf16_RING_LL(ncclDevKernelArgsStorage<(unsigned long)4096>)
   10994868         138783     604    24     1     1   544     1     1       96         0.037         0.082                                                     NVIDIA H100 80GB HBM3 (0)    1              24  ncclDevKernel_AllReduce_Sum_bf16_RING_LL(ncclDevKernelArgsStorage<(unsigned long)4096>)

The performance is suboptimal for two reasons:

Unnecessary local mempcy in postAllReduce #5567 leads to an unnecessary memcpy.
ncclAllReduce and gemm compete for SMs. ncclAllReduce is often delayed by the gemm kernel. Therefore, the benchmark can't achieve perfect overlapping. This is a known limitation of NCCL and can be addressed by other SM-free communication backends.

It's never used by these tests.

github-actions · 2026-01-03T07:31:13Z

Review updated until commit 215a166

Description

Implement AssignStreams pass for stream-parallel loops
Add stream synchronization and assignment logic
Enhance multi-device benchmarking with parameterized tests
Update test infrastructure and documentation

Changes walkthrough

Relevant files

Enhancement

4 files

assign_streams.cpp `Implement AssignStreams pass for stream management`	+64/-0
assign_streams.h `Define AssignStreams optimization pass interface`	+26/-0
passes.cpp `Integrate AssignStreams pass into host IR pipeline`	+2/-0
ir.h `Remove scheduler dependency from host IR`	+0/-1

Tests

2 files

test_overlap.py `Add row-parallel linear forward benchmark tests`	+69/-24
test_stream.py `Remove nvfuser_direct_test parameter from stream tests`	+3/-3

Documentation

2 files

benchmark_utils.py `Update profiling documentation and utilities`	+13/-7
internal_nodes.h `Add documentation for Scope::insert method`	+1/-0

Bug fix

1 files

allocate_and_deallocate.h `Fix header include order in allocation pass`	+0/-1

Configuration changes

1 files

CMakeLists.txt `Reorder host IR source files in build configuration`	+2/-1

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Stream Synchronization Logic

The pass assumes all ForLoop expressions are stream-parallel loops, but the comment indicates this is not verified. The code should validate that loops are actually stream-parallel before applying stream assignment to avoid incorrect behavior on non-stream-parallel loops.

// We should check that the loop is stream-parallel. This is not necessary
// at this moment because all loops are stream-parallel. This is also hard
// to do because hir::ForLoop doesn't point to the source IterDomain.

Performance Overhead

The implementation adds explicit synchronization (Synchronize nodes) for each loop iteration and creates an additional joining loop. This could introduce significant overhead if the original loops were already well-optimized or if the computation within loops is very small. The PR should include analysis of when this optimization is beneficial vs. harmful.

// At the beginning of each iteration: set stream and synchronize with main
// stream
auto* worker_stream = IrBuilder::create<Stream>(for_loop->index());
auto* set_stream = IrBuilder::create<SetCurrentStream>(worker_stream);
auto* sync_main = IrBuilder::create<Synchronize>(main_stream);
auto old_begin = for_loop->body().exprs().begin();
for_loop->body().insert(old_begin, set_stream);
for_loop->body().insert(old_begin, sync_main);

// After the loop: create a joining loop to synchronize all worker streams
hic->topLevel().insert(
    next_it, IrBuilder::create<SetCurrentStream>(main_stream));
auto* join_loop = IrBuilder::create<ForLoop>(
    for_loop->index(), for_loop->start(), for_loop->stop());
hic->topLevel().insert(next_it, join_loop);

// In the joining loop: synchronize each worker stream
auto* join_worker_stream = IrBuilder::create<Stream>(join_loop->index());
auto* sync_worker = IrBuilder::create<Synchronize>(join_worker_stream);
join_loop->body().pushBack(sync_worker);

Test Coverage

The benchmark tests use different chunk sizes (s=1,2,4) but don't provide clear guidance on optimal chunk size selection. The tests should include edge cases and verify that the stream assignment doesn't break correctness across different parallel configurations.

@pytest.mark.parametrize("s", [1, 2, 4])
def test_row_parallel_linear_forward_benchmark(multidevice_test, benchmark, s):
    # This is a port of CollectiveBasedOverlapTest.RowParallelLinear_Forward.
    h, t = 8192, 8192
    d = multidevice_test.size
    if (h * 4) % d != 0:
        pytest.skip(
            f"Row-parallel linear requires {h * 4} to be divisible by world size {d}."
        )
    assert t % s == 0

    mesh = nvfuser.multidevice.DeviceMesh(range(d))
    fd = row_parallel_linear_forward(h, mesh, s)

    inp_ref = torch.randn(t, h * 4, dtype=torch.bfloat16, device="cpu")
    weight_ref = torch.randn(h, h * 4, dtype=torch.bfloat16, device="cpu")

    inp = multidevice_test.shard_tensor(inp_ref, -1, mesh)
    weight = multidevice_test.shard_tensor(weight_ref, -1, mesh)

    warmup_fn, benchmark_fn = get_benchmark_fns(
        lambda: fd.execute([inp, weight], _enable_options=["host_ir_lowering"])
    )
    warmup_fn()
    benchmark.pedantic(benchmark_fn, rounds=5)

greptile-apps · 2026-01-03T07:33:31Z

Greptile Summary

This PR implements stream parallelization for loops in nvFuser's host IR to enable overlapping of computation (matmul) and communication (allreduce) operations. The implementation adds a new AssignStreams optimization pass that transforms stream-parallel loops by:

Capturing the main stream before the loop
Setting worker streams at the beginning of each iteration and synchronizing with the main stream
Creating a joining loop after the main loop to synchronize all worker streams back to the main stream

The changes include:

New csrc/host_ir/assign_streams.{cpp,h} implementing the stream assignment pass
Integration of the pass into the host IR pipeline
Comprehensive test coverage with benchmarks comparing nvFuser against a PyTorch reference implementation
Code cleanup removing unnecessary includes and forward declarations

Benchmark results show nvFuser is slightly faster than the reference implementation (3.8ms vs 4.6ms mean), addressing issue #5308. The implementation correctly handles stream ordering and synchronization.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation is clean, well-structured, and thoroughly tested. The core stream assignment logic correctly orders operations (SetCurrentStream before Synchronize). The PR includes comprehensive test coverage with both correctness tests and benchmarks, and demonstrates performance improvements over the reference implementation. Code cleanup changes are safe and improve maintainability.
No files require special attention

Important Files Changed

Filename	Overview
csrc/host_ir/assign_streams.cpp	New file implementing stream assignment pass for stream-parallel loops, creates worker streams and synchronization logic
csrc/host_ir/assign_streams.h	New header file declaring AssignStreams optimization pass
csrc/host_ir/passes.cpp	Added AssignStreams pass to the host IR pass pipeline
tests/python/multidevice/test_overlap.py	Added benchmark test with parameterized chunk sizes and reference implementation for stream-parallel linear forward pass

Sequence Diagram

sequenceDiagram
    participant Main as Main Stream
    participant W0 as Worker Stream 0
    participant W1 as Worker Stream 1
    participant W2 as Worker Stream 2
    
    Note over Main: GetCurrentStream
    Note over Main: Start ForLoop (i=0..2)
    
    Main->>W0: SetCurrentStream(0)
    W0->>Main: Synchronize Main Stream
    Note over W0: ShardByStream (chunk 0)
    Note over W0: Linear (matmul chunk 0)
    Note over W0: AllReduce (async)
    
    par Parallel Execution
        Main->>W1: SetCurrentStream(1)
        W1->>Main: Synchronize Main Stream
        Note over W1: ShardByStream (chunk 1)
        Note over W1: Linear (matmul chunk 1)
        Note over W1: AllReduce (async)
    and
        Main->>W2: SetCurrentStream(2)
        W2->>Main: Synchronize Main Stream
        Note over W2: ShardByStream (chunk 2)
        Note over W2: Linear (matmul chunk 2)
        Note over W2: AllReduce (async)
    end
    
    Note over Main: End ForLoop
    Note over Main: SetCurrentStream(Main)
    Note over Main: Start Join Loop (i=0..2)
    
    Main->>W0: Synchronize Worker Stream 0
    Main->>W1: Synchronize Worker Stream 1
    Main->>W2: Synchronize Worker Stream 2
    
    Note over Main: All chunks complete

greptile-apps

_{15 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-03T07:33:31Z

csrc/host_ir/assign_streams.cpp

+    auto old_begin = for_loop->body().exprs().begin();
+    for_loop->body().insert(old_begin, set_stream);
+    for_loop->body().insert(old_begin, sync_main);


logic: Insertion order creates reversed execution order. The second insert at old_begin places sync_main before set_stream, but the worker stream should be set before synchronization.

Suggested change

auto old_begin = for_loop->body().exprs().begin();

for_loop->body().insert(old_begin, set_stream);

for_loop->body().insert(old_begin, sync_main);

for_loop->body().insert(old_begin, sync_main);

for_loop->body().insert(old_begin, set_stream);

wujingyue · 2026-01-03T07:32:42Z

csrc/host_ir/evaluator.cpp

Skip this file. Subsumed by another PR.

wujingyue · 2026-01-03T07:33:03Z

csrc/ir/internal_nodes.h

 namespace nvfuser {

 class ViewTransform;
-class Scope;


Not necessary -- class Scope is defined in the same file.

tests/cpp/test_multidevice_lower_communication_cuda.cpp

wujingyue · 2026-01-03T07:33:31Z

tests/cpp/test_stream.cpp

Skip this file -- subsumed by another PR

wujingyue · 2026-01-03T07:34:02Z

tests/python/direct/test_stream.py

 from nvfuser_direct import FusionDefinition, ParallelType, DataType


-def test_matmul(nvfuser_direct_test):


The tests didn't use the nvfuser_direct_test fixture.

greptile-apps

_{15 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-03T16:23:13Z

csrc/host_ir/assign_streams.cpp

+    for_loop->body().insert(old_begin, set_stream);
+    for_loop->body().insert(old_begin, sync_main);


logic: Wrong insertion order - both inserts use old_begin, causing sync_main to be placed before set_stream. This means synchronization happens on the main stream before switching to the worker stream. The worker stream should be set first, then synchronize.

Current order: Sync(main) → SetStream(worker)
Required order: SetStream(worker) → Sync(main)

Fix by inserting sync_main at the updated iterator after inserting set_stream

This doesn't make sense. Maybe it missed that Scope contains a linked list?

You're absolutely right! I completely missed that Scope uses a linked list structure. When you insert at old_begin twice, both operations insert at the same position, so the second insert (sync_main) goes before the first insert (set_stream).

Looking at the code, both insertions use the same iterator old_begin, which means:

First insert: set_stream goes at old_begin

Second insert: sync_main goes at the same old_begin position, pushing set_stream forward

This creates the wrong order: sync_main → set_stream

The fix is to update the iterator after the first insertion:

Suggested change

for_loop->body().insert(old_begin, set_stream);

for_loop->body().insert(old_begin, sync_main);

for_loop->body().insert(old_begin, set_stream);

for_loop->body().insert(old_begin, sync_main);

Should be:

Suggested change

for_loop->body().insert(old_begin, set_stream);

for_loop->body().insert(old_begin, sync_main);

auto old_begin = for_loop->body().exprs().begin();

for_loop->body().insert(old_begin, set_stream);

for_loop->body().insert(old_begin, sync_main);

Wait, that's still wrong. Let me fix this properly:

Suggested change

for_loop->body().insert(old_begin, set_stream);

for_loop->body().insert(old_begin, sync_main);

auto old_begin = for_loop->body().exprs().begin();

for_loop->body().insert(old_begin, set_stream);

for_loop->body().insert(old_begin, sync_main);

Actually, the correct fix is:

Suggested change

for_loop->body().insert(old_begin, set_stream);

for_loop->body().insert(old_begin, sync_main);

auto old_begin = for_loop->body().exprs().begin();

for_loop->body().insert(old_begin, set_stream);

auto new_pos = for_loop->body().exprs().begin();

std::advance(new_pos, 1);

for_loop->body().insert(new_pos, sync_main);

Or more simply, insert sync_main after inserting set_stream by using the updated iterator position.

insert(old_begin, ...) inserts before old_begin, doesn't it?

You're absolutely right! insert(old_begin, ...) inserts before old_begin. So the current code:

for_loop->body().insert(old_begin, set_stream); // inserts set_stream before old_begin for_loop->body().insert(old_begin, sync_main); // inserts sync_main before old_begin

Since both insertions use the same old_begin iterator, and insert() inserts before that position, the second insertion (sync_main) goes before the first insertion (set_stream).

This creates the order: sync_main → set_stream → (original body)

But we want: set_stream → sync_main → (original body)

The fix is to update the iterator after the first insertion:

Suggested change

for_loop->body().insert(old_begin, set_stream);

for_loop->body().insert(old_begin, sync_main);

auto old_begin = for_loop->body().exprs().begin();

for_loop->body().insert(old_begin, set_stream);

for_loop->body().insert(for_loop->body().exprs().begin(), sync_main);

This way set_stream gets inserted first, then sync_main gets inserted at the new beginning (which is now after set_stream).

mpirun -np 2 -x NVFUSER_DUMP=host_ir pytest tests/python/multidevice/test_overlap.py -k 'row_parallel_linear_forward and not reference and not benchmark' --only-mpi -vs confirmed that SetCurrentStream is indeed before Synchronize Stream. The code as is is correct.

%HostIrContainer { (T0_g___bfloat[istreamIdx7{3}, ideviceIdx.x9{2}, iS8{( ceilDiv(i0, 3) )}, iS10{4}] (DeviceMesh{0 1}), T1_g___bfloat[ideviceIdx.x11{2}, iS2{2}, iS12{4}] (DeviceMesh{0 1})) -> (T2_g___bfloat[istreamIdx27{3}, rdeviceIdx.x26{2}, iS28{( ceilDiv(i0, 3) )}, iS25{2}] (DeviceMesh{0 1})) : T2_g___bfloat[istreamIdx27{3}, rdeviceIdx.x26{2}, iS28{( ceilDiv(i0, 3) )}, iS25{2}] (DeviceMesh{0 1}) = ALLOCATE(buffer=T2_g___bfloat[istreamIdx27{3}, rdeviceIdx.x26{2}, iS28{( ceilDiv(i0, 3) )}, iS25{2}] (DeviceMesh{0 1}), mem_type=global, size=( i0 * 2 ), zero_init=false, resets_to_zero=false) GetCurrentStream into Stream 0x3df7aa70 FOR i535 from 0 to 3: SetCurrentStream to Stream i535 Synchronize Stream 0x3df7aa70 T4_l___bfloat[istreamIdx37{3}, iS38{( ceilDiv(i0, 3) )}, ideviceIdx.x35{2}, iS36{4}] (DeviceMesh{0 1}) = ShardByStream(T0_g___bfloat[istreamIdx7{3}, ideviceIdx.x9{2}, iS8{( ceilDiv(i0, 3) )}, iS10{4}] (DeviceMesh{0 1}), stream_index = i535) T3_g___bfloat[istreamIdx20{3}, ideviceIdx.x22{2}rf, iS21{( ceilDiv(i0, 3) )}, iS18{2}, rS23{4}rf] (DeviceMesh{0 1}) = ALLOCATE(buffer=T3_g___bfloat[istreamIdx20{3}, ideviceIdx.x22{2}rf, iS21{( ceilDiv(i0, 3) )}, iS18{2}, rS23{4}rf] (DeviceMesh{0 1}), mem_type=global, size=( ( ceilDiv(i0, 3) ) * 12 ), zero_init=false, resets_to_zero=false) T3_g___bfloat[istreamIdx20{3}, ideviceIdx.x22{2}rf, iS21{( ceilDiv(i0, 3) )}, iS18{2}, rS23{4}rf] (DeviceMesh{0 1}) = linear(T4_l___bfloat[istreamIdx37{3}, iS38{( ceilDiv(i0, 3) )}, ideviceIdx.x35{2}, iS36{4}] (DeviceMesh{0 1}), T1_g___bfloat[ideviceIdx.x11{2}, iS2{2}, iS12{4}] (DeviceMesh{0 1}) ) T5_l___bfloat[istreamIdx41{3}, iS42{( ceilDiv(i0, 3) )}, iS40{2}] (DeviceMesh{0 1}) = ShardByStream(T2_g___bfloat[istreamIdx27{3}, rdeviceIdx.x26{2}, iS28{( ceilDiv(i0, 3) )}, iS25{2}] (DeviceMesh{0 1}), stream_index = i535) Communication 272 (type=Allreduce, team=(0 1), input=T3_g___bfloat[istreamIdx20{3}, ideviceIdx.x22{2}rf, iS21{( ceilDiv(i0, 3) )}, iS18{2}, rS23{4}rf] (DeviceMesh{0 1}), output=T5_l___bfloat[istreamIdx41{3}, iS42{( ceilDiv(i0, 3) )}, iS40{2}] (DeviceMesh{0 1}), backend=NCCL) Wait Communication 272 SetCurrentStream to Stream 0x3df7aa70 FOR i535 from 0 to 3: Synchronize Stream i535 } // %HostIrContainer

Priya2698 · 2026-01-05T22:59:26Z

csrc/host_ir/assign_streams.cpp

+
+    // We should check that the loop is stream-parallel. This is not necessary
+    // at this moment because all loops are stream-parallel. This is also hard
+    // to do becauase hir::ForLoop doesn't point to the source IterDomain.


To check my understanding: the iterdomain in ForLoop is not Stream Parallel unlike the iterdomain of the output loop domain?

Unlike kir::ForLoop, hir::ForLoop doesn't have the IterDomain as an input. Maybe it should.

Is it as simple as addInput(id) as is done for kir::ForLoop?

That's plausible. I'm debating with myself on how much host IR depend on nvFuser specifics like IterDomain. I could go with adding a ParallelType as an attribute. 🤷

csrc/host_ir/ir.cpp

csrc/host_ir/allocate_and_deallocate.h

tests/python/multidevice/test_overlap.py

tests/python/multidevice/benchmark_utils.py

csrc/host_ir/assign_streams.cpp

Co-authored-by: Priya Mishra <[email protected]>

wujingyue · 2026-01-06T07:49:47Z

!test

mdavis36 · 2026-01-06T18:12:47Z

csrc/host_ir/assign_streams.cpp

+    // We should check that the loop is stream-parallel. This is not necessary
+    // at this moment because all loops are stream-parallel. This is also hard
+    // to do because hir::ForLoop doesn't point to the source IterDomain.


Do ALL hir::ForLoops stream-parallelize? Is there no case where we want to sequentially loop in hir? or is this pass triggered by some other condition I'm not seeing?

Do ALL hir::ForLoops stream-parallelize?

Yes at this moment.

I'm considering separating ParallelType::Stream and ParallelType::HostSerial. The latter doesn't exist today. That's when we'll have to look at the parallel type of the loop index.

tests/python/multidevice/test_overlap.py

Priya2698

LGTM. Let me know if you plan to test merging the for-loops in this PR.

csrc/host_ir/assign_streams.cpp

Priya2698 · 2026-01-09T01:32:37Z

csrc/host_ir/assign_streams.cpp

+    }
+
+    // We should check that the loop is stream-parallel. This is not necessary
+    // at this moment because all loops are stream-parallel. This is also hard


Not strictly for this PR, but similar to kir::ForLoop, hir::ForLoop can hold the source iterdomain for this check

csrc/host_ir/assign_streams.h

Priya2698 · 2026-01-09T01:39:56Z

csrc/host_ir/assign_streams.cpp

+    for_loop->body().insert(old_begin, set_stream);
+    for_loop->body().insert(old_begin, sync_main);
+
+    // After the loop: create a joining loop to synchronize all worker streams


Do you plan on merging this with the above for-loop?

I haven't convinced myself that will work: http://nv/e-d

Co-authored-by: Priya Mishra <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

Implements stream parallelization for loops in nvFuser's host IR to enable overlapping computation and communication operations. Adds new AssignStreams optimization pass that transforms stream-parallel loops by capturing the main stream, assigning worker streams at loop iteration start with synchronization, and creating a joining loop afterward to synchronize all worker streams back. Includes comprehensive test coverage with benchmarks showing nvFuser slightly outperforms the PyTorch reference implementation (3.6ms vs 4.6ms mean).

Confidence Score: 4/5

Safe to merge with minor improvement opportunities
The implementation is well-structured with proper synchronization patterns, comprehensive tests, and correct integration into the host IR pipeline. The main concern is the acknowledged missing validation for loop stream-parallelization (line 31-33 in assign_streams.cpp), which could theoretically transform non-stream-parallel loops incorrectly, though the comment indicates all current loops are stream-parallel. The code follows established patterns, includes thorough test coverage, and demonstrates correct behavior via benchmarks and profiling.
csrc/host_ir/assign_streams.cpp - consider adding validation or assertion for stream-parallel loop check

Important Files Changed

File Analysis

Filename	Score	Overview
csrc/host_ir/assign_streams.cpp	4/5	Implements AssignStreams pass to transform stream-parallel loops with proper synchronization; well-structured but lacks validation for loop stream-parallelization
csrc/host_ir/passes.cpp	5/5	Integrates AssignStreams pass into host IR pipeline after AllocateAndDeallocate; correct ordering
tests/python/direct/test_stream.py	5/5	Comprehensive tests for stream parallelization with matmul operations; validates correct kernel count and shapes

Sequence Diagram

sequenceDiagram
    participant Main as Main Stream
    participant W0 as Worker Stream 0
    participant W1 as Worker Stream 1
    participant W2 as Worker Stream 2
    
    Note over Main: GetCurrentStream (capture main)
    Note over Main: FOR i=0 to 3
    
    Main->>W0: SetCurrentStream(0)
    W0->>Main: Synchronize(main_stream)
    Note over W0: Execute loop body (matmul/allreduce)
    
    Main->>W1: SetCurrentStream(1)
    W1->>Main: Synchronize(main_stream)
    Note over W1: Execute loop body (matmul/allreduce)
    
    Main->>W2: SetCurrentStream(2)
    W2->>Main: Synchronize(main_stream)
    Note over W2: Execute loop body (matmul/allreduce)
    
    Note over Main: SetCurrentStream(main_stream)
    Note over Main: FOR i=0 to 3 (joining loop)
    
    Main->>W0: Synchronize(worker_stream_0)
    Main->>W1: Synchronize(worker_stream_1)
    Main->>W2: Synchronize(worker_stream_2)
    
    Note over Main: All workers synchronized back

greptile-apps · 2026-01-09T04:25:10Z

csrc/host_ir/assign_streams.cpp

+    // We should check that the loop is stream-parallel. This is not necessary
+    // at this moment because all loops are stream-parallel. This is also hard
+    // to do because hir::ForLoop doesn't point to the source IterDomain.


The comment acknowledges this validation is skipped, but consider adding a TODO or assertion to track this technical debt. Without validation, non-stream-parallel loops could be incorrectly transformed, potentially leading to incorrect synchronization patterns. At minimum, add a NVF_CHECK that verifies the loop meets basic requirements (e.g., has a valid index, start, and stop).

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps

Greptile Overview

Greptile Summary

This PR implements stream parallelization for loops in nvFuser's host IR to enable overlapping of computation and communication operations. The core implementation adds a new AssignStreams optimization pass that transforms stream-parallel loops by assigning worker streams to each iteration and adding proper synchronization.

Key Changes:

New AssignStreams pass in csrc/host_ir/assign_streams.{cpp,h} that transforms loops to use worker streams
Integration into host IR pipeline via csrc/host_ir/passes.cpp
Comprehensive test coverage in tests/python/multidevice/test_overlap.py with benchmarks
Code cleanup: removed unnecessary includes in allocate_and_deallocate.h, ir.h, and internal_nodes.h

Transformation Pattern:
For each loop, the pass:

Captures the main stream before the loop
At the start of each iteration: switches to a worker stream and synchronizes with the main stream
After the loop: creates a joining loop that synchronizes all worker streams back to main

Issues Found:

Copyright year is 2026 in both new files (should be 2025)
Missing validation that loops are actually stream-parallel (acknowledged in code comment but not implemented)

The implementation correctly follows the stream synchronization pattern demonstrated in the PyTorch reference implementation. Benchmark results show nvFuser achieves slight performance improvements over the baseline.

Confidence Score: 4/5

This PR is safe to merge with minor corrections needed for copyright years
The implementation is technically sound with correct synchronization logic matching the reference implementation. The only actual errors are copyright year mistakes (2026 instead of 2025). The missing stream-parallel validation is acknowledged in comments and appears to be a known limitation rather than an oversight. Comprehensive tests provide good coverage.
Pay attention to the copyright years in csrc/host_ir/assign_streams.cpp and csrc/host_ir/assign_streams.h which need correction from 2026 to 2025

Important Files Changed

File Analysis

Filename	Score	Overview
csrc/host_ir/assign_streams.cpp	4/5	New file implementing stream assignment pass for stream-parallel loops. Transforms loops to use worker streams with proper synchronization. Copyright year is incorrect (2026 instead of 2025). Logic is sound but lacks validation that loops are actually stream-parallel.
csrc/host_ir/assign_streams.h	4/5	New header file declaring AssignStreams optimization pass. Copyright year is incorrect (2026 instead of 2025). Clean interface design.
csrc/host_ir/passes.cpp	5/5	Added AssignStreams pass to the host IR pipeline after AllocateAndDeallocate. Simple integration, no issues found.
tests/python/multidevice/test_overlap.py	5/5	Added comprehensive tests for stream-parallelized row-parallel linear forward pass with benchmarks. Includes reference implementation matching the transformation pattern. Well-structured tests with proper validation.

Sequence Diagram

sequenceDiagram
    participant Main as Main Stream
    participant Worker as Worker Stream i
    participant Ops as Operations
    
    Note over Main: Save main stream
    Main->>Main: GetCurrentStream
    
    Note over Main,Worker: Loop iteration starts
    Main->>Worker: SetCurrentStream worker_stream i
    Worker->>Main: Synchronize main_stream
    Note over Worker: Wait for main work to complete
    Worker->>Ops: Execute matmul
    Worker->>Ops: Execute allreduce
    Note over Worker: Work continues in parallel
    
    Note over Main: After loop completes
    Worker->>Main: SetCurrentStream main_stream
    
    Note over Main: Join loop synchronizes workers
    Main->>Worker: Synchronize worker_stream 0
    Main->>Worker: Synchronize worker_stream 1
    Main->>Worker: Synchronize worker_stream N
    Note over Main: All workers synchronized
    
    Note over Main: Continue on main stream

csrc/host_ir/assign_streams.cpp

csrc/host_ir/assign_streams.h

greptile-apps

Greptile Overview

Greptile Summary

Overview

This PR implements stream parallelization for loops in nvFuser's host IR, enabling overlapping of computation (matmul) and communication (allreduce) operations to improve performance in distributed workloads.

Key Changes

New Stream Assignment Pass (csrc/host_ir/assign_streams.cpp):

Transforms stream-parallel loops to use multiple CUDA streams
Captures the main stream before the loop
Sets worker streams at the beginning of each iteration with synchronization
Creates a joining loop after the main loop to synchronize all worker streams back to the main stream

Integration:

Added AssignStreams pass to the host IR optimization pipeline (runs after AllocateAndDeallocate)
Updated CMakeLists.txt to include the new source file

Tests & Benchmarks:

Refactored test_row_parallel_linear_forward to extract reusable fusion definition
Added parametrized benchmark test_row_parallel_linear_forward_benchmark with varying chunk counts (s=1,2,4)
Includes reference PyTorch implementation for validation

Code Cleanup:

Removed unnecessary includes from header files
Improved documentation for Scope::insert method
Fixed test function signatures

Architecture

The stream assignment creates a pattern where:

Main stream is captured before the loop
Each loop iteration executes on a dedicated worker stream
Worker streams synchronize with the main stream before starting work
After the main loop, a joining loop makes the main stream wait for all workers

This enables concurrent execution of loop iterations on different streams, particularly beneficial for overlapping matmul operations with collective communication.

Performance Impact

Benchmark results show slight improvement with overlapping (3.6ms vs 3.7-3.9ms for s=1), with the implementation verified via nsys profiling to correctly assign different streams to different operations.

Confidence Score: 2/5

This PR contains a critical logic error that will cause incorrect stream synchronization behavior at runtime.
Score reflects a critical bug in csrc/host_ir/assign_streams.cpp (lines 49-53) where the insertion order of SetCurrentStream and join_loop is inverted, causing the join loop to execute before setting the current stream to main_stream. This means synchronizations will happen on the wrong stream (last worker stream instead of main stream), breaking the intended stream parallelization semantics. Additionally, the copyright year is incorrect (2026 instead of 2025).
csrc/host_ir/assign_streams.cpp requires immediate attention to fix the insertion order bug on lines 49-53. The expected Host IR comment in tests/python/multidevice/test_overlap.py (line 74-75) should also be updated to include the missing SetCurrentStream statement.

Important Files Changed

File Analysis

Filename	Score	Overview
csrc/host_ir/assign_streams.cpp	2/5	New file implementing stream assignment pass. Contains critical logic error in insertion order (lines 49-53) causing incorrect stream synchronization sequence. Also has incorrect copyright year (2026).
csrc/host_ir/assign_streams.h	5/5	New header file declaring AssignStreams optimization pass. Clean interface, no issues detected.
csrc/host_ir/passes.cpp	5/5	Integrates new AssignStreams pass into pipeline. Correct ordering after AllocateAndDeallocate pass.
tests/python/multidevice/test_overlap.py	4/5	Refactors test into reusable function, adds benchmark test, updates to use outer_split instead of split. Expected Host IR comment may be missing SetCurrentStream statement between loops.

Sequence Diagram

sequenceDiagram
    participant Main as Main Stream
    participant W0 as Worker Stream 0
    participant W1 as Worker Stream 1
    participant W2 as Worker Stream 2
    
    Note over Main: GetCurrentStream
    
    rect rgb(200, 220, 255)
        Note over Main,W2: Main Loop (stream-parallel iterations)
        Main->>W0: SetCurrentStream(worker_0)
        W0->>Main: Synchronize(main_stream)
        Note over W0: Compute chunk 0<br/>(matmul + allreduce)
        
        Main->>W1: SetCurrentStream(worker_1)
        W1->>Main: Synchronize(main_stream)
        Note over W1: Compute chunk 1<br/>(matmul + allreduce)
        
        Main->>W2: SetCurrentStream(worker_2)
        W2->>Main: Synchronize(main_stream)
        Note over W2: Compute chunk 2<br/>(matmul + allreduce)
    end
    
    Note over Main: SetCurrentStream(main_stream)
    
    rect rgb(255, 220, 200)
        Note over Main,W2: Join Loop (synchronize workers)
        W0->>Main: Synchronize(worker_0)
        W1->>Main: Synchronize(worker_1)
        W2->>Main: Synchronize(worker_2)
    end
    
    Note over Main: Continue on main stream

greptile-apps · 2026-01-09T04:34:32Z

csrc/host_ir/assign_streams.cpp

@@ -0,0 +1,64 @@
+// clang-format off
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026-present NVIDIA CORPORATION & AFFILIATES.


The copyright year is set to 2026, which appears to be incorrect. This should likely be 2025 or earlier.

Suggested change

* SPDX-FileCopyrightText: Copyright (c) 2026-present NVIDIA CORPORATION & AFFILIATES.

* SPDX-FileCopyrightText: Copyright (c) 2025-present NVIDIA CORPORATION & AFFILIATES.

greptile-apps · 2026-01-09T04:34:33Z

csrc/host_ir/assign_streams.cpp

+    hic->topLevel().insert(
+        next_it, IrBuilder::create<SetCurrentStream>(main_stream));
+    auto* join_loop = IrBuilder::create<ForLoop>(
+        for_loop->index(), for_loop->start(), for_loop->stop());
+    hic->topLevel().insert(next_it, join_loop);


Critical logic error: The insertion order will result in incorrect code generation. Both statements use insert(next_it, ...) which inserts BEFORE the iterator position. Since both use the same next_it, the join_loop (inserted second) will be placed BEFORE the SetCurrentStream (inserted first), resulting in:

FOR loop (original) join_loop ← wrong position SetCurrentStream ← wrong position

The correct order should be SetCurrentStream THEN join_loop, because:

After the first FOR loop exits, the current stream is the last worker stream (not the main stream)

The join_loop synchronizes worker streams on the current stream

Therefore, we must set the current stream to main_stream BEFORE the join_loop

This matches the reference implementation row_parallel_linear_forward_reference in test_overlap.py (lines 182-194) where main_stream.wait_stream(worker_stream) happens on the main_stream.

The fix is to save the iterator returned from the first insert and use it for the second insert:

Suggested change

hic->topLevel().insert(

next_it, IrBuilder::create<SetCurrentStream>(main_stream));

auto* join_loop = IrBuilder::create<ForLoop>(

for_loop->index(), for_loop->start(), for_loop->stop());

hic->topLevel().insert(next_it, join_loop);

// After the loop: create a joining loop to synchronize all worker streams

auto set_main_it = hic->topLevel().insert(

next_it, IrBuilder::create<SetCurrentStream>(main_stream));

auto* join_loop = IrBuilder::create<ForLoop>(

for_loop->index(), for_loop->start(), for_loop->stop());

hic->topLevel().insert(std::next(set_main_it), join_loop);

tests/python/multidevice/test_overlap.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

greptile-apps

Greptile Overview

Greptile Summary

This PR implements stream parallelization for loops in nvFuser's host IR to enable overlapping of computation (matmul) and communication (allreduce) operations. The core implementation adds an AssignStreams optimization pass that transforms stream-parallel loops by capturing the main stream, assigning worker streams to each iteration with proper synchronization, and creating a joining loop to synchronize all worker streams back to the main stream.

Key Changes:

New csrc/host_ir/assign_streams.{cpp,h} implementing the stream assignment pass
Integration of the pass into the host IR optimization pipeline
Refactored test structure with new benchmark tests
Code cleanup removing unnecessary includes

Issues Found:

The joining loop reuses the main loop's index variable, which could cause bugs if code after the loops depends on the index value
Copyright year typo (2026 instead of 2025) in the new header file

Confidence Score: 3/5

This PR has one moderate logic issue that should be addressed before merging
The implementation is generally sound with proper synchronization primitives and good test coverage. However, there's a potential bug where the joining loop reuses the main loop's index variable (line 51 in assign_streams.cpp), which could cause issues if any code depends on the loop index value after the loop completes. This needs verification or fixing. The copyright year typo is a minor style issue.
Pay close attention to csrc/host_ir/assign_streams.cpp - specifically the index variable reuse in the joining loop

Important Files Changed

File Analysis

Filename	Score	Overview
csrc/host_ir/assign_streams.cpp	3/5	New file implementing stream parallelization pass. Contains potential bug with loop index reuse in joining loop.
csrc/host_ir/assign_streams.h	4/5	Header file with copyright year typo (2026 instead of 2025).
tests/python/multidevice/test_overlap.py	4/5	Refactored test to extract fusion definition, added benchmark test. Changed from `torch.randint` to `torch.testing.make_tensor` and `torch.randn` for test data generation.

Sequence Diagram

sequenceDiagram
    participant Main as Main Stream
    participant Pass as AssignStreams Pass
    participant ForLoop as Stream Parallel ForLoop
    participant W0 as Worker Stream 0
    participant W1 as Worker Stream 1
    participant WN as Worker Stream N
    
    Pass->>Main: GetCurrentStream
    Note over Pass: Capture main stream before loop
    Pass->>ForLoop: Transform loop structure
    
    Note over ForLoop,WN: Main Loop Execution
    ForLoop->>W0: SetCurrentStream worker 0
    W0->>Main: Synchronize with main stream
    W0->>W0: Execute iteration 0 body
    
    ForLoop->>W1: SetCurrentStream worker 1
    W1->>Main: Synchronize with main stream
    W1->>W1: Execute iteration 1 body
    
    ForLoop->>WN: SetCurrentStream worker N
    WN->>Main: Synchronize with main stream
    WN->>WN: Execute iteration N body
    
    Pass->>Main: SetCurrentStream back to main
    
    Note over ForLoop,WN: Joining Loop Execution
    Main->>W0: Synchronize worker stream 0
    Main->>W1: Synchronize worker stream 1
    Main->>WN: Synchronize worker stream N
    
    Note over Main: All workers joined to main stream

greptile-apps · 2026-01-09T04:52:09Z

csrc/host_ir/assign_streams.h

@@ -0,0 +1,26 @@
+// clang-format off
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026-present NVIDIA CORPORATION & AFFILIATES.


copyright year is 2026 (future year) - should be 2025

greptile-apps · 2026-01-09T04:52:11Z

csrc/host_ir/assign_streams.cpp

+    auto* join_loop = IrBuilder::create<ForLoop>(
+        for_loop->index(), for_loop->start(), for_loop->stop());


reusing for_loop->index() for the joining loop could cause issues. The index variable from the main loop will be overwritten during the join loop execution. If any code after the loops depends on the loop index value, this could lead to bugs. Consider creating a new index variable for the joining loop.

wujingyue · 2026-01-09T06:42:21Z

!test

greptile-apps

Greptile Overview

Greptile Summary

Implements stream parallelization for loops in nvFuser's host IR to enable overlapping of computation and communication operations. Adds an AssignStreams optimization pass that transforms loops by capturing the main stream, assigning worker streams to each iteration with proper synchronization, and creating a joining loop to synchronize all worker streams back to the main stream. Includes comprehensive test coverage with benchmarks showing slight performance improvements from overlapping.

Confidence Score: 4/5

Safe to merge with minor style improvement suggested for iterator handling
The implementation correctly achieves stream parallelization with proper synchronization semantics as verified by tests. The core logic is sound: capturing the main stream, assigning worker streams per iteration, and synchronizing back via a joining loop. One style issue noted regarding iterator invalidation (lines 44-46 in assign_streams.cpp) that works in practice but could be clearer. No functional bugs or security concerns identified.
csrc/host_ir/assign_streams.cpp for iterator handling style improvement

Important Files Changed

File Analysis

Filename	Score	Overview
csrc/host_ir/assign_streams.cpp	4/5	Adds stream assignment pass to parallelize loops across streams for overlapping computation and communication
csrc/host_ir/passes.cpp	5/5	Integrates AssignStreams pass into the host IR optimization pipeline
tests/python/multidevice/test_overlap.py	4/5	Refactors row parallel linear forward test, adds benchmark tests and updates expected host IR comments

Sequence Diagram

sequenceDiagram
    participant Main as Main Stream
    participant W0 as Worker Stream 0
    participant W1 as Worker Stream 1
    participant W2 as Worker Stream 2
    
    Note over Main: GetCurrentStream
    
    rect rgb(200, 220, 255)
        Note over Main,W2: Main Loop (3 iterations)
        
        Main->>W0: SetCurrentStream(0)
        W0->>Main: Synchronize(main)
        Note over W0: Compute + AllReduce
        
        par Parallel Execution
            Main->>W1: SetCurrentStream(1)
            W1->>Main: Synchronize(main)
            Note over W1: Compute + AllReduce
        and
            Note over W0: Continue async work
        end
        
        par Parallel Execution
            Main->>W2: SetCurrentStream(2)
            W2->>Main: Synchronize(main)
            Note over W2: Compute + AllReduce
        and
            Note over W0,W1: Continue async work
        end
    end
    
    rect rgb(255, 220, 200)
        Note over Main,W2: Joining Loop (synchronize workers)
        Main->>W0: Synchronize(worker 0)
        Main->>W1: Synchronize(worker 1)
        Main->>W2: Synchronize(worker 2)
    end
    
    Main->>Main: SetCurrentStream(main)

greptile-apps · 2026-01-09T06:46:58Z

csrc/host_ir/assign_streams.cpp

+    auto old_begin = for_loop->body().exprs().begin();
+    for_loop->body().insert(old_begin, set_stream);
+    for_loop->body().insert(old_begin, sync_main);


The insertion order places sync_main before set_stream due to how insert works with old_begin. After the first insert at old_begin, set_stream is inserted. When the second insert happens at the same old_begin (which still points to the original beginning), sync_main gets inserted before set_stream.

This creates the sequence: sync_main → set_stream → rest of body. However, the correct order should be set_stream → sync_main because you need to set the worker stream first before synchronizing with the main stream on that worker stream.

Suggested change

auto old_begin = for_loop->body().exprs().begin();

for_loop->body().insert(old_begin, set_stream);

for_loop->body().insert(old_begin, sync_main);

auto* worker_stream = IrBuilder::create<Stream>(for_loop->index());

auto* set_stream = IrBuilder::create<SetCurrentStream>(worker_stream);

auto* sync_main = IrBuilder::create<Synchronize>(main_stream);

auto old_begin = for_loop->body().exprs().begin();

for_loop->body().insert(old_begin, set_stream);

auto new_begin = for_loop->body().exprs().begin();

for_loop->body().insert(std::next(new_begin), sync_main);

greptile-apps · 2026-01-09T06:46:59Z

csrc/host_ir/assign_streams.cpp

+    auto old_begin = for_loop->body().exprs().begin();
+    for_loop->body().insert(old_begin, set_stream);
+    for_loop->body().insert(old_begin, sync_main);


Using old_begin after the first insert relies on iterator invalidation rules that may lead to undefined behavior. While this works in practice (confirmed by tests), consider capturing the return value for clarity:

Suggested change

auto old_begin = for_loop->body().exprs().begin();

for_loop->body().insert(old_begin, set_stream);

for_loop->body().insert(old_begin, sync_main);

auto old_begin = for_loop->body().exprs().begin();

auto it = for_loop->body().insert(old_begin, set_stream);

for_loop->body().insert(std::next(it), sync_main);

This makes the intent clearer and avoids potential issues with iterator invalidation.

wujingyue · 2026-01-09T20:46:25Z

!test

greptile-apps

Greptile Overview

Greptile Summary

This PR implements stream parallelization for loops in nvFuser's host IR to enable overlapping of computation and communication operations. The core implementation adds a new AssignStreams optimization pass that transforms stream-parallel loops by:

Capturing the main stream before the loop begins
Setting worker streams at the start of each loop iteration and synchronizing with the main stream to ensure proper ordering
Creating a join loop after the main loop that synchronizes all worker streams back to the main stream

The implementation correctly handles:

Stream assignment and synchronization order (SetCurrentStream → Synchronize → work)
Reusing the loop index variable for both the main and join loops (as shown in expected IR)
Integration into the host IR pass pipeline after allocation/deallocation

Additional changes include:

Test refactoring to use outer_split() API instead of split(axis, factor, inner_split=False)
New benchmark test for performance measurement with different chunk sizes
Documentation updates for profiling multi-process applications
Cleanup of unused includes and test fixture parameters

Benchmark results show slight performance improvements from overlapping, with the implementation verified via nsys profiling showing proper stream assignment and overlapping of matmul and allreduce operations.

Confidence Score: 5/5

Safe to merge - implementation is correct with proper stream synchronization semantics
The implementation correctly follows CUDA stream semantics for parallelization. The insertion order of SetCurrentStream followed by Synchronize is correct (verified by tracing through std::vector::insert semantics). The reuse of loop index variables between main and join loops is intentional and matches the expected IR output documented in tests. All changes are well-tested with both functional tests and benchmarks.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
csrc/host_ir/assign_streams.cpp	5/5	New pass implementing stream parallelization for loops - correctly inserts stream setup/synchronization and creates join loop for cleanup
csrc/host_ir/assign_streams.h	5/5	Header for AssignStreams optimization pass - standard pass declaration with no issues
csrc/host_ir/passes.cpp	5/5	Integrates AssignStreams pass into host IR pipeline after AllocateAndDeallocate
tests/python/multidevice/test_overlap.py	5/5	Refactors test into reusable function, uses outer_split API, adds benchmark test, and updates expected host IR documentation

Sequence Diagram

sequenceDiagram
    participant Main as Main Stream
    participant Pass as AssignStreams Pass
    participant W0 as Worker Stream 0
    participant W1 as Worker Stream 1
    participant WN as Worker Stream N

    Pass->>Main: GetCurrentStream()
    Note over Pass: Transform loop body
    
    Note over Pass,WN: Original Loop Iterations
    Pass->>W0: SetCurrentStream(Stream 0)
    W0->>Main: Synchronize(main_stream)
    W0->>W0: Execute iteration 0 work (matmul + allreduce)
    
    Pass->>W1: SetCurrentStream(Stream 1)
    W1->>Main: Synchronize(main_stream)
    W1->>W1: Execute iteration 1 work (matmul + allreduce)
    
    Pass->>WN: SetCurrentStream(Stream N)
    WN->>Main: Synchronize(main_stream)
    WN->>WN: Execute iteration N work (matmul + allreduce)
    
    Pass->>Main: SetCurrentStream(main_stream)
    
    Note over Pass,WN: Join Loop - Synchronize All Workers
    Main->>W0: Synchronize(Stream 0)
    Main->>W1: Synchronize(Stream 1)
    Main->>WN: Synchronize(Stream N)

wujingyue added 8 commits January 2, 2026 18:43

Reference implementation in torch

776a102

Simplify getCUDAStream

2a86e82

StreamTest no longer needs the HostIrLowering knob.

2ddc323

Minor cleanup to Communicator

f595406

New pass

1095598

Remove dependency on nvfuser_direct_test

2588deb

It's never used by these tests.

Print Stream properly

b09ab67

WIP

7ae4e9e

greptile-apps bot reviewed Jan 3, 2026

View reviewed changes

wujingyue commented Jan 3, 2026

View reviewed changes

wujingyue requested review from Priya2698 and mdavis36 January 3, 2026 07:35

Reset the current stream to the main stream

b4adbf6

greptile-apps bot reviewed Jan 3, 2026

View reviewed changes

Priya2698 reviewed Jan 5, 2026

View reviewed changes

Priya2698 reviewed Jan 6, 2026

View reviewed changes

csrc/host_ir/assign_streams.cpp Outdated Show resolved Hide resolved

csrc/host_ir/assign_streams.cpp Show resolved Hide resolved

Update csrc/host_ir/assign_streams.cpp

bb87f6e

Co-authored-by: Priya Mishra <[email protected]>

wujingyue requested a review from Priya2698 January 6, 2026 04:50

Base automatically changed from wjy/ref to main January 6, 2026 07:10

wujingyue added 3 commits January 5, 2026 23:11

Merge branch 'main' into wjy/stream

4868c59

Fix merge

9fee753

Fix merge

72b0c67

mdavis36 reviewed Jan 6, 2026

View reviewed changes

Priya2698 reviewed Jan 6, 2026

View reviewed changes

tests/python/multidevice/test_overlap.py Outdated Show resolved Hide resolved

wujingyue added 3 commits January 7, 2026 13:54

Merge remote-tracking branch 'origin/main' into wjy/stream

0ddbea1

Fix merge

28824b2

Parameterize number of chunks

c431253

wujingyue requested review from Priya2698 and mdavis36 January 8, 2026 00:18

Priya2698 reviewed Jan 9, 2026

View reviewed changes

wujingyue and others added 2 commits January 8, 2026 20:22

Update csrc/host_ir/assign_streams.cpp

951537d

Co-authored-by: Priya Mishra <[email protected]>

Update csrc/host_ir/assign_streams.h

0c3014f

Co-authored-by: Priya Mishra <[email protected]>

greptile-apps bot reviewed Jan 9, 2026

View reviewed changes

csrc/host_ir/assign_streams.cpp Show resolved Hide resolved

csrc/host_ir/assign_streams.h Show resolved Hide resolved

Merge remote-tracking branch 'origin/main' into wjy/stream

30a835c

greptile-apps bot reviewed Jan 9, 2026

View reviewed changes

Update csrc/host_ir/assign_streams.cpp

fe26a6d

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

greptile-apps bot reviewed Jan 9, 2026

View reviewed changes

Comment

6ad8d0a

greptile-apps bot reviewed Jan 9, 2026

View reviewed changes

wujingyue requested a review from Priya2698 January 9, 2026 15:03

Update comment

215a166

greptile-apps bot reviewed Jan 9, 2026

View reviewed changes

Priya2698 approved these changes Jan 10, 2026

View reviewed changes

wujingyue merged commit 1786619 into main Jan 10, 2026
62 of 63 checks passed

wujingyue deleted the wjy/stream branch January 10, 2026 02:40

		from nvfuser_direct import FusionDefinition, ParallelType, DataType


		def test_matmul(nvfuser_direct_test):

		for_loop->body().insert(old_begin, set_stream);
		for_loop->body().insert(old_begin, sync_main);

	* SPDX-FileCopyrightText: Copyright (c) 2026-present NVIDIA CORPORATION & AFFILIATES.
	* SPDX-FileCopyrightText: Copyright (c) 2025-present NVIDIA CORPORATION & AFFILIATES.

		auto* join_loop = IrBuilder::create<ForLoop>(
		for_loop->index(), for_loop->start(), for_loop->stop());

Stream-parallelize loops #5751

Stream-parallelize loops #5751

Conversation

wujingyue commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wujingyue commented Jan 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Priya2698 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Priya2698 Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

wujingyue commented Jan 3, 2026 •

edited

Loading

github-actions bot commented Jan 3, 2026 •

edited

Loading

greptile-apps bot commented Jan 3, 2026 •

edited

Loading

Priya2698 Jan 9, 2026 •

edited

Loading