-
Notifications
You must be signed in to change notification settings - Fork 69
Fix decomposeLinearWithBias to shard all created tensorviews
#5563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Review updated until commit 97af3bf Description
|
| Relevant files | |||||
|---|---|---|---|---|---|
| Bug fix |
| ||||
| Enhancement |
| ||||
| Tests |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Complex Sharding Logic
|
Greptile OverviewGreptile SummaryFixed inconsistent sharding in Key improvements:
Confidence Score: 4/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User as User Code
participant Linear as decomposeRowParallelLinearWithBias
participant Fusion as Fusion IR
participant Propagate as shardLoopLike
participant Map as PairwiseLogicalDomainMap
User->>Linear: linear_with_bias operation
Linear->>Fusion: Create without_bias = linear(A, B)
Linear->>Fusion: Create broadcasted_bias
Linear->>Fusion: Create new_out = add(without_bias, broadcasted_bias)
Linear->>Fusion: Replace old out with new_out
Note over Linear: Apply sharding transformations
Linear->>Fusion: TransformReplay on without_bias
Linear->>Fusion: TransformReplay on new_out
Note over Linear: Backward propagate shardings
loop For each expr (reverse order)
loop For each output TV
loop For each input TV
Linear->>Propagate: shardLoopLike(output, input, kBackward)
Propagate->>Map: getRef2TargetMap with mapBroadcast(false)
Map-->>Propagate: Domain mapping (excluding broadcasts)
Propagate->>Fusion: Apply sharding to input TV
end
end
end
Linear-->>User: Consistently sharded computation graph
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, no comments
| group_id_); | ||
| SegmentProfiler& sprof = FusionProfiler::segment(group_id_); | ||
| sprof.inputBytesAccessed(computeBytes(args)); | ||
| sprof.scheduler(toString(SchedulerType::ExprEval)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caused the wrong scheduler name in profiler output.
|
!test |
|
|
||
| (out,) = fd.execute([inp, weight]) | ||
| with PythonProfiler() as prof: | ||
| (out,) = fd.execute([inp, weight, bias.cuda()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this synchronize? Could we miss kernels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fd.execute should not return until kernels have completed.
There is a cudaStreamSynchronize at the end of nsys trace too.
Is this what you are referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a difference between cudaStreamSynchronize and cudaDeviceSynchronize though. The former blocks the stream and the latter blocks the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I assumed cudaStreamSynchronize would be enough here but pointwise kernel and nccl call on different streams.
FusionProfiler/PythonProfiler synchronize at start but not on stop. So I will add an explicit call here.
Note for myself: See if FusionProfiler should synchronize before reading data.
|
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, no comments
|
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, no comments
Some of the created tensorviews were not sharded consistently and hence led to more communication than needed.