-
Notifications
You must be signed in to change notification settings - Fork 75
Use meta device tensor to infer contiguity for expr-eval segments #5772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Review updated until commit 9665cf0 Description
|
| Relevant files | |||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Enhancement | 8 files
| ||||||||||||||||||||||
| Tests | 11 files
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Function naming inconsistency
inferOutputShapeAndContiguousStrides to inferContiguousOutputMetaTensor in the implementation, but the declaration in the header file still uses the old name. This creates a mismatch between declaration and definition. |
…andling - Renamed `inferOutputShapeAndContiguousStrides` to `inferContiguousOutputMetaTensor` for clarity. - Updated function signatures to remove unnecessary parameters. - Introduced `inferOutputMetaTensor` in `FusionKernelRuntime` to handle output shape inference for segmented groups. - Enhanced `updateWithSegmentOutputs` to streamline output management without updating contiguity directly. - Improved overall code organization and readability.
|
!test |
|
!test |
|
!test |
|
!test |
2 similar comments
|
!test |
|
!test |
|
!test |
1 similar comment
|
!test |
|
Can you merge #5842 into this PR? LGTM otherwise! |
|
!test |
|
|
||
| TEST_F(AliasTest, AliasOutputBeforeNonAliasOutput) { | ||
| EnableOptionsGuard opt_guard; | ||
| EnableOptionsGuard::getCurOptions().unset(EnableOption::InferContiguity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a no-op? AliasTest doesn't seem to enable InferContiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AliasTest is NVFuserTest, which enables InferContiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it -- I didn't realize NVFuserTest enables the option by default. Then, why do some tests enable it again, e.g., https://github.com/NVIDIA/Fuser/pull/5772/files#diff-3675636f2228bd2f8c3f308c28fa88f1d659d8eb3d869570dcfdf013f77908aaR29?
| protected: | ||
| void SetUp() override { | ||
| BlackwellBase::SetUp(); | ||
| EnableOptionsGuard::getCurOptions().unset(EnableOption::InferContiguity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://google.github.io/googletest/faq.html#CtorVsSetUp
Prefer constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing. I think we should migrate our entire codebase to prefer ctor, instead of just this specific test. I don't think it is a good idea to mix ctor and SetUp. Because NVFuserTest setup InferContiguity and everything else in SetUp, we need to be consistent here, because otherwise whatever we set in ctor will be overriden by NVFuserTest::SetUp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should migrate our entire codebase to prefer ctor, instead of just this specific test
I actually did that for NVFuserTest. In fact, most setup code for NVFuserTest is in its constructor already except this line (https://github.com/NVIDIA/Fuser/pull/5772/files#diff-16f891fd5f846480392227c6bbf81ead352f59fdc9964e5d6e4dc6089bb622c5R61) which was added later without my notice. The only thing that should be kept in NVFuserTest::SetUp at this moment is GTEST_SKIP.
csrc/runtime/fusion_cache_utils.cpp
Outdated
|
|
||
| #include <fusion_segmenter.h> | ||
| #include <ir/all_nodes.h> | ||
| #include <ir/utils.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is an artifact of rebase. Removed.
|
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
19 files reviewed, 1 comment
| } | ||
| KernelArgumentHolder group_runtime_outputs; | ||
| for (Val* v : fusion_to_run->outputs()) { | ||
| auto result = eval_fusion.evaluate(v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: no error handling for evaluate() failure - if evaluation fails or returns an invalid result, it will silently continue
Consider validating the result or wrapping in try-catch, especially since the PR description mentions some ATen ops on meta device can hang due to GIL issues
Fixes #4888
Stacked on #5766
I used to work on #5082 for the fix, but I hit too many blockers, because this PR could interact with many new assumptions/hacks/unfinalized designs on things like allocation domain, stream-sharded tensor, multidevice, etc., and we keep having new things committed to the main branch that break #5082. This situation delayed the PR for a very long time. So I recreated this PR that is more friendly to incremental development.
Today, in the main branch, in
FusionExecutorCache, we were assuming fusion segments always generate contiguous tensors. This is not true forExpressionEvaluatorsegments. For example, ATen's slice op returns non-contiguous tensors. It is worth mentioning that, because segmentation and scheduler selection depend on inputs, the contiguity of intermediate results also depends on inputs.This PR adds
FusionKernelRuntime::inferOutputMetaTensor(, which replacesinferOutputShapeAndContiguousStridesto infer the output shape and stride of each segment. BothFusionKernelRuntime::inferOutputMetaTensor(andinferOutputShapeAndContiguousStridesstore their result as a tensor on the meta device. The difference is,FusionKernelRuntime::inferOutputMetaTensor(will actually run the segment on device type meta if this segment is scheduled to run byExpressionEvaluator, whileinferOutputShapeAndContiguousStridesjust assumes the output to be contiguous.Because
FusionKernelRuntime::inferOutputMetaTensor(will run the segment on device type meta, related op'sMyOp::evaluateshould work for device type meta. There is good and bad news for this design. The good news is, mostMyOp::evaluatejust callsat::ops, which usually already support meta device, and PyTorch designed meta device to try to make its behavior on par with CUDA. The bad news is, because many op's meta device implementation is on Python, runningat::opon these kinds of ops would hang due to the inability to grab Python's GIL (Thanks @naoyam for help debugging!). If this is the case, the correspondingMyOp::evaluatemust manually compute the shape and stride and useat::empty_strided(device=meta)to create the result.Besides
FusionKernelRuntime::inferOutputMetaTensor(, this PR also addsFusionKernelRuntime::updateContiguityOfSegmentOutputs(. Which updates the segment outputTensorViews' contiguity based on the inferred shape and stride.This PR adds an enable option "infer-contiguity" to incrementally enable this feature. When "infer-contiguity" is disabled,
FusionKernelRuntime::inferOutputMetaTensor(will fallback to the behavior ofinferOutputShapeAndContiguousStrides, andFusionKernelRuntime::updateContiguityOfSegmentOutputs(will be no-op. The plan is, we merge this PR and not set "infer-contiguity" for the currently failed tests. I will write new PRs fixing the failed tests one by one.