Fix `decomposeLinearWithBias` to shard all created tensorviews #5563

Priya2698 · 2025-11-20T12:13:12Z

Some of the created tensorviews were not sharded consistently and hence led to more communication than needed.

github-actions · 2025-11-20T12:15:10Z

Review updated until commit 97af3bf

Description

Fix decomposeLinearWithBias to shard all created tensorviews consistently
Add backpropagation of shardings to intermediate expressions between bias and output
Update broadcast mapping in propagation to handle dimensions properly
Enhance test coverage with profiler validation for communication kernels

Changes walkthrough

Relevant files

Bug fix

decompose_reshardings.cpp `Fix sharding consistency in decomposeLinearWithBias` csrc/preseg_passes/decompose_reshardings.cpp Remove redundant selfReplay calls Add backpropagation loop to shard intermediate expressions consistently Use shardLoopLike with backward propagation for proper sharding Ensure all tensorviews between bias and output are properly sharded	+22/-2

Enhancement

propagation.cpp `Enhance broadcast mapping in propagation` csrc/multidevice/propagation.cpp Add mapBroadcast(false) to forward direction mapping Add mapBroadcast(false) to backward direction mapping Improve broadcast dimension handling in logical domain mapping	+6/-2
communication_executor.cpp `Update communication executor profiler` csrc/runtime/communication_executor.cpp Change profiler scheduler type from ExprEval to Communication Update kernel profiling to reflect communication operations	+1/-1

Tests

test_matmul.py `Enhance linear reduce scatter test with bias and profiling` tests/python/multidevice/test_matmul.py Update test_linear_reduce_scatter with bias parameter and bfloat16 Add profiler validation for communication kernel scheduling Modify tensor dimensions and initialization for better test coverage Ensure single communication kernel execution	+23/-21

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Complex Sharding Logic The new backpropagation loop for sharding intermediate expressions is complex and could potentially miss edge cases. The nested loops over expressions and tensorviews need careful validation to ensure all intermediate tensorviews are properly sharded without creating inconsistencies. for (Expr* expr : StmtSort::getExprsBetween( {without_bias, broadcasted_bias}, {new_out}) \| std::views::reverse) { for (auto* output : ir_utils::filterByType<TensorView>(expr->outputs())) { for (auto* input : ir_utils::filterByType<TensorView>(expr->inputs())) { shardLoopLike( /ref=/output, /target=/input, deviceAndStreamParallelTypes(), PropagateDirection::kBackward); } } } TransformReplay Calls Multiple `TransformReplay::selfReplay` calls are added which could impact performance. The order and necessity of these replay operations should be validated to ensure they don't introduce unnecessary computational overhead. TransformReplay::selfReplay(out->domain(), without_bias->domain()); TransformReplay::selfReplay(out->domain(), new_out->domain());

greptile-apps · 2025-11-20T12:15:45Z

Greptile Overview

Greptile Summary

Fixed inconsistent sharding in decomposeLinearWithBias that caused unnecessary communication operations. The change ensures all intermediate tensorviews created during the decomposition are properly sharded by adding backward propagation logic and preventing broadcast dimensions from being mapped during sharding propagation.

Key improvements:

Reordered TransformReplay calls to apply after IR graph modification
Added backward propagation loop through intermediate expressions to consistently shard all tensorviews between bias and output
Modified getRef2TargetMap to exclude broadcast dimensions (.mapBroadcast(false)) preventing incorrect dimension mapping
Fixed profiler to correctly report communication scheduler type
Enhanced test to validate only one reduce-scatter operation is scheduled, confirming the fix reduces communication overhead

Confidence Score: 4/5

This PR is safe to merge with minor verification recommended
The changes correctly address the sharding consistency issue with a well-structured solution. The backward propagation logic follows existing patterns in the codebase, and the test enhancement validates the fix reduces communication overhead. Score reflects solid implementation with comprehensive test coverage, though additional validation on diverse model architectures would provide extra confidence
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
csrc/preseg_passes/decompose_reshardings.cpp	4/5	Fixed sharding propagation in `decomposeRowParallelLinearWithBias` by adding backward propagation loop and reordering `TransformReplay` calls to ensure all intermediate tensorviews are consistently sharded
csrc/multidevice/propagation.cpp	5/5	Added `.mapBroadcast(false)` to `PairwiseLogicalDomainMap` calls in `getRef2TargetMap` to prevent broadcast dimensions from being incorrectly mapped during sharding propagation
csrc/runtime/communication_executor.cpp	5/5	Corrected scheduler type from `ExprEval` to `Communication` in profiler for communication kernels
tests/python/multidevice/test_matmul.py	5/5	Enhanced `test_linear_reduce_scatter` to validate correct sharding by adding bias parameter, profiling, and asserting only one communication kernel is scheduled

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Linear as decomposeRowParallelLinearWithBias
    participant Fusion as Fusion IR
    participant Propagate as shardLoopLike
    participant Map as PairwiseLogicalDomainMap
    
    User->>Linear: linear_with_bias operation
    Linear->>Fusion: Create without_bias = linear(A, B)
    Linear->>Fusion: Create broadcasted_bias
    Linear->>Fusion: Create new_out = add(without_bias, broadcasted_bias)
    Linear->>Fusion: Replace old out with new_out
    
    Note over Linear: Apply sharding transformations
    Linear->>Fusion: TransformReplay on without_bias
    Linear->>Fusion: TransformReplay on new_out
    
    Note over Linear: Backward propagate shardings
    loop For each expr (reverse order)
        loop For each output TV
            loop For each input TV
                Linear->>Propagate: shardLoopLike(output, input, kBackward)
                Propagate->>Map: getRef2TargetMap with mapBroadcast(false)
                Map-->>Propagate: Domain mapping (excluding broadcasts)
                Propagate->>Fusion: Apply sharding to input TV
            end
        end
    end
    
    Linear-->>User: Consistently sharded computation graph

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

csrc/preseg_passes/decompose_reshardings.cpp

tests/python/multidevice/test_matmul.py

Priya2698 · 2025-11-20T12:20:36Z

csrc/runtime/communication_executor.cpp

        group_id_);
    SegmentProfiler& sprof = FusionProfiler::segment(group_id_);
    sprof.inputBytesAccessed(computeBytes(args));
-    sprof.scheduler(toString(SchedulerType::ExprEval));


Caused the wrong scheduler name in profiler output.

Priya2698 · 2025-11-20T12:22:08Z

!test

wujingyue · 2025-11-20T19:19:32Z

tests/python/multidevice/test_matmul.py


-    (out,) = fd.execute([inp, weight])
+    with PythonProfiler() as prof:
+        (out,) = fd.execute([inp, weight, bias.cuda()])


Does this synchronize? Could we miss kernels?

~~fd.execute should not return until kernels have completed.~~
There is a cudaStreamSynchronize at the end of nsys trace too.

Is this what you are referring to?

There's a difference between cudaStreamSynchronize and cudaDeviceSynchronize though. The former blocks the stream and the latter blocks the host.

You're right. I assumed cudaStreamSynchronize would be enough here but pointwise kernel and nccl call on different streams.

FusionProfiler/PythonProfiler synchronize at start but not on stop. So I will add an explicit call here.

Note for myself: See if FusionProfiler should synchronize before reading data.

tests/python/multidevice/test_matmul.py

csrc/preseg_passes/decompose_reshardings.cpp

Priya2698 · 2025-11-21T11:22:04Z

!test

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Priya2698 · 2025-12-04T17:41:42Z

!test

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

fix decomposelinear to shard all created tvs

f6ef1c8

greptile-apps bot reviewed Nov 20, 2025

View reviewed changes

Priya2698 commented Nov 20, 2025

View reviewed changes

csrc/preseg_passes/decompose_reshardings.cpp Outdated Show resolved Hide resolved

Priya2698 commented Nov 20, 2025

View reviewed changes

tests/python/multidevice/test_matmul.py Outdated Show resolved Hide resolved

Priya2698 commented Nov 20, 2025

View reviewed changes

Priya2698 requested a review from wujingyue November 20, 2025 12:24

Priya2698 mentioned this pull request Nov 20, 2025

Transformer sequence parallel forward #5560

Merged

wujingyue reviewed Nov 20, 2025

View reviewed changes

wujingyue approved these changes Nov 20, 2025

View reviewed changes

tests/python/multidevice/test_matmul.py Outdated Show resolved Hide resolved

csrc/preseg_passes/decompose_reshardings.cpp Outdated Show resolved Hide resolved

selfReplay is replaced by backprop shardings

3e4d810

greptile-apps bot reviewed Nov 21, 2025

View reviewed changes

Priya2698 added 2 commits December 4, 2025 09:39

sync, fix condition

7dbe534

Merge remote-tracking branch 'origin/main' into pm/decompose_linear

97af3bf

greptile-apps bot reviewed Dec 4, 2025

View reviewed changes

Priya2698 merged commit 5d8efce into main Dec 4, 2025
59 of 60 checks passed

Priya2698 deleted the pm/decompose_linear branch December 4, 2025 23:30

Fix decomposeLinearWithBias to shard all created tensorviews #5563

Fix decomposeLinearWithBias to shard all created tensorviews #5563

Conversation

Priya2698 commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Priya2698 Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Priya2698 commented Nov 20, 2025

Uh oh!

wujingyue Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Priya2698 Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wujingyue Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Priya2698 Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Priya2698 commented Nov 21, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Priya2698 commented Dec 4, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix `decomposeLinearWithBias` to shard all created tensorviews #5563

Fix `decomposeLinearWithBias` to shard all created tensorviews #5563

Priya2698 commented Nov 20, 2025 •

edited

Loading

github-actions bot commented Nov 20, 2025 •

edited

Loading

greptile-apps bot commented Nov 20, 2025 •

edited

Loading

Priya2698 Nov 21, 2025 •

edited

Loading

wujingyue Nov 21, 2025 •

edited

Loading

Priya2698 Nov 21, 2025 •

edited

Loading