Skip to content

Conversation

@zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Jan 7, 2026

Fixes #4888

Stacked on #5766

I used to work on #5082 for the fix, but I hit too many blockers, because this PR could interact with many new assumptions/hacks/unfinalized designs on things like allocation domain, stream-sharded tensor, multidevice, etc., and we keep having new things committed to the main branch that break #5082. This situation delayed the PR for a very long time. So I recreated this PR that is more friendly to incremental development.

Today, in the main branch, in FusionExecutorCache, we were assuming fusion segments always generate contiguous tensors. This is not true for ExpressionEvaluator segments. For example, ATen's slice op returns non-contiguous tensors. It is worth mentioning that, because segmentation and scheduler selection depend on inputs, the contiguity of intermediate results also depends on inputs.

This PR adds FusionKernelRuntime::inferOutputMetaTensor(, which replaces inferOutputShapeAndContiguousStrides to infer the output shape and stride of each segment. Both FusionKernelRuntime::inferOutputMetaTensor( and inferOutputShapeAndContiguousStrides store their result as a tensor on the meta device. The difference is, FusionKernelRuntime::inferOutputMetaTensor( will actually run the segment on device type meta if this segment is scheduled to run by ExpressionEvaluator, while inferOutputShapeAndContiguousStrides just assumes the output to be contiguous.

Because FusionKernelRuntime::inferOutputMetaTensor( will run the segment on device type meta, related op's MyOp::evaluate should work for device type meta. There is good and bad news for this design. The good news is, most MyOp::evaluate just calls at:: ops, which usually already support meta device, and PyTorch designed meta device to try to make its behavior on par with CUDA. The bad news is, because many op's meta device implementation is on Python, running at::op on these kinds of ops would hang due to the inability to grab Python's GIL (Thanks @naoyam for help debugging!). If this is the case, the corresponding MyOp::evaluate must manually compute the shape and stride and use at::empty_strided(device=meta) to create the result.

Besides FusionKernelRuntime::inferOutputMetaTensor(, this PR also adds FusionKernelRuntime::updateContiguityOfSegmentOutputs(. Which updates the segment output TensorViews' contiguity based on the inferred shape and stride.

This PR adds an enable option "infer-contiguity" to incrementally enable this feature. When "infer-contiguity" is disabled, FusionKernelRuntime::inferOutputMetaTensor( will fallback to the behavior of inferOutputShapeAndContiguousStrides, and FusionKernelRuntime::updateContiguityOfSegmentOutputs( will be no-op. The plan is, we merge this PR and not set "infer-contiguity" for the currently failed tests. I will write new PRs fixing the failed tests one by one.

@github-actions
Copy link

github-actions bot commented Jan 7, 2026

Review updated until commit 9665cf0

Description

  • Add new EnableOption::InferContiguity feature flag for contiguity inference

  • Replace inferOutputShapeAndContiguousStrides with inferContiguousOutputMetaTensor

  • Implement FusionKernelRuntime::inferOutputMetaTensor for expr-eval segments

  • Update test files to enable/disable new contiguity inference option

  • Add test_issue4888 to validate the fix for contiguity handling

Changes walkthrough

Relevant files
Enhancement
8 files
options.cpp
Add InferContiguity option to available options map           
+1/-0     
options.h
Add InferContiguity enum value to EnableOption                     
+1/-0     
allocations.cpp
Rename function to inferContiguousOutputMetaTensor             
+2/-2     
allocations.h
Update function signature for inferContiguousOutputMetaTensor
+1/-1     
fusion_kernel_runtime.cpp
Implement inferOutputMetaTensor and update prepareInputs calls
+48/-8   
fusion_kernel_runtime.h
Add inferOutputMetaTensor method declaration                         
+10/-0   
conftest.py
Add enable_options and disable_options parameters to exec_nvfuser
+11/-1   
utils.py
Update check_captured_python_definition to handle enable/disable
options
+18/-2   
Tests
11 files
test_alias.cpp
Disable InferContiguity option for specific test case       
+3/-0     
test_indexing_advanced.cpp
Enable InferContiguity option in test setup                           
+2/-0     
test_layout_op.cpp
Disable InferContiguity option in test setup                         
+1/-0     
test_loop_domain_scheduling.cpp
Enable InferContiguity option in test setup                           
+1/-0     
test_low_precision_recipe.cpp
Disable InferContiguity option in test setup                         
+7/-1     
test_matmul_aten_evaluation.cpp
Remove OutputStrides test case                                                     
+0/-33   
test_matmul_scheduler.cpp
Enable InferContiguity option in test setup                           
+1/-0     
test_pointwise.cpp
Enable InferContiguity option in test setup                           
+1/-0     
test_rng.cpp
Enable InferContiguity option in test setup                           
+1/-0     
utils.cpp
Enable InferContiguity option in NVFuserTest setup             
+1/-0     
test_python_frontend.py
Add test_issue4888 to validate contiguity inference fix   
+98/-0   

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Function naming inconsistency

The function name was changed from inferOutputShapeAndContiguousStrides to inferContiguousOutputMetaTensor in the implementation, but the declaration in the header file still uses the old name. This creates a mismatch between declaration and definition.

KernelArgumentHolder inferContiguousOutputMetaTensor(
    Fusion* fusion,
    const KernelArgumentHolder& args,
    PrecomputedValues* evaluator_precomputed_values) {
Missing fusion creation logic

The old code created a fusion for the segment using segmented_fusion_->makeFusion(group_to_run).second but this logic is completely removed in the new code. The new inferOutputMetaTensor method receives the fusion directly via group_to_run->getFusion(), but it's unclear if this provides the same level of isolation as the previous approach.

auto group_runtime_outputs = inferOutputMetaTensor(
    heuristics_.get(), group_to_run, group_runtime_inputs);
Test complexity concern

The new test test_issue4888 is very complex with many tensor operations. While it tests the specific issue, such complexity might make it harder to debug if the test fails. Consider if a simpler test case could demonstrate the same contiguity inference behavior.

def test_issue4888(nvfuser_direct_test):
    # https://github.com/NVIDIA/Fuser/issues/4888
    def nvfuser_fusion_id2(fd: FusionDefinition) -> None:
        T0 = fd.define_tensor(
            shape=[4096, 4097],
            contiguity=[True, True],
            dtype=DataType.BFloat16,
            is_cpu=False,
            stride_order=[1, 0],
        )
        T1 = fd.define_tensor(
            shape=[4096, 4097],
            contiguity=[True, True],
            dtype=DataType.Bool,
            is_cpu=False,
            stride_order=[1, 0],
        )
        T2 = fd.define_tensor(
            shape=[4096, 4097],
            contiguity=[True, True],
            dtype=DataType.Bool,
            is_cpu=False,
            stride_order=[1, 0],
        )
        T3 = fd.define_tensor(
            shape=[1, 32, 4096, 4096],
            contiguity=[None, True, True, True],
            dtype=DataType.BFloat16,
            is_cpu=False,
            stride_order=[3, 2, 1, 0],
        )
        T4 = fd.ops.cast(T0, dtype=DataType.Float)
        T5 = fd.ops.bitwise_or(T1, T2)
        T6 = fd.ops.set(T5)
        fd.add_output(T6, T1)
        T7 = fd.ops.cast(T6, dtype=DataType.Float)
        T8 = fd.ops.mul(T4, T7)
        T9 = fd.ops.cast(T8, dtype=DataType.BFloat16)
        T10 = fd.ops.set(T9)
        fd.add_output(T10, T0)
        T15 = fd.ops.broadcast_in_dim(T10, shape=[1, 4096, 4097], broadcast_dims=[1, 2])
        T21 = fd.ops.broadcast_in_dim(
            T15, shape=[1, 1, 4096, 4097], broadcast_dims=[0, 2, 3]
        )
        T27 = fd.ops.broadcast_in_dim(
            T21, shape=[1, 1, 4096, 4097], broadcast_dims=[0, 1, 2, 3]
        )
        T43 = fd.ops.slice(
            T27,
            start_indices=[0, 0, 0, 0],
            end_indices=[1, 1, 4096, 4096],
            strides=[1, 1, 1, 1],
            manual_normalization=0,
        )
        T49 = fd.ops.broadcast_in_dim(
            T43, shape=[1, 32, 4096, 4096], broadcast_dims=[0, 1, 2, 3]
        )
        T50 = fd.ops.cast(T49, dtype=DataType.Float)
        T51 = fd.ops.cast(T3, dtype=DataType.Float)
        S52 = fd.define_scalar(0.0883883, dtype=DataType.Double)
        T53 = fd.ops.mul(T51, S52)
        T54 = fd.ops.add(T53, T50)
        T55 = fd.ops.max(T54, dims=[3], keepdim=False, dtype=DataType.Null)
        T61 = fd.ops.broadcast_in_dim(
            T55, shape=[1, 32, 4096, 1], broadcast_dims=[0, 1, 2]
        )
        T67 = fd.ops.broadcast_in_dim(
            T61, shape=[1, 32, 4096, 4096], broadcast_dims=[0, 1, 2, 3]
        )
        T68 = fd.ops.sub(T54, T67)
        T69 = fd.ops.exp(T68)
        T70 = fd.ops.sum(T69, dims=[3], keepdim=False, dtype=DataType.Null)
        T76 = fd.ops.broadcast_in_dim(
            T70, shape=[1, 32, 4096, 1], broadcast_dims=[0, 1, 2]
        )
        T82 = fd.ops.broadcast_in_dim(
            T76, shape=[1, 32, 4096, 4096], broadcast_dims=[0, 1, 2, 3]
        )
        T83 = fd.ops.reciprocal(T82)
        T84 = fd.ops.mul(T69, T83)
        T85 = fd.ops.cast(T84, dtype=DataType.BFloat16)
        fd.add_output(T49)
        fd.add_output(T84)
        fd.add_output(T85)

    inputs = [
        torch.testing.make_tensor((4096, 4097), dtype=torch.bfloat16, device="cuda:0"),
        torch.testing.make_tensor((4096, 4097), dtype=torch.bool, device="cuda:0"),
        torch.testing.make_tensor((4096, 4097), dtype=torch.bool, device="cuda:0"),
        torch.testing.make_tensor(
            (1, 32, 4096, 4096), dtype=torch.bfloat16, device="cuda:0"
        ),
    ]
    nvfuser_direct_test.exec_nvfuser(
        nvfuser_fusion_id2, inputs, enable_options=["infer_contiguity"]
    )

zasdfgbnm and others added 15 commits January 13, 2026 09:43
…andling

- Renamed `inferOutputShapeAndContiguousStrides` to `inferContiguousOutputMetaTensor` for clarity.
- Updated function signatures to remove unnecessary parameters.
- Introduced `inferOutputMetaTensor` in `FusionKernelRuntime` to handle output shape inference for segmented groups.
- Enhanced `updateWithSegmentOutputs` to streamline output management without updating contiguity directly.
- Improved overall code organization and readability.
@zasdfgbnm
Copy link
Collaborator Author

!test

@zasdfgbnm
Copy link
Collaborator Author

!test

@zasdfgbnm
Copy link
Collaborator Author

!test

@zasdfgbnm zasdfgbnm requested a review from wujingyue January 15, 2026 22:56
@zasdfgbnm
Copy link
Collaborator Author

!test

2 similar comments
@zasdfgbnm
Copy link
Collaborator Author

!test

@zasdfgbnm
Copy link
Collaborator Author

!test

@zasdfgbnm
Copy link
Collaborator Author

!test

1 similar comment
@zasdfgbnm
Copy link
Collaborator Author

!test

@wujingyue
Copy link
Collaborator

Can you merge #5842 into this PR? LGTM otherwise!

@zasdfgbnm zasdfgbnm changed the base branch from resetContiguityFromTensor to main January 21, 2026 20:39
@zasdfgbnm
Copy link
Collaborator Author

!test


TEST_F(AliasTest, AliasOutputBeforeNonAliasOutput) {
EnableOptionsGuard opt_guard;
EnableOptionsGuard::getCurOptions().unset(EnableOption::InferContiguity);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a no-op? AliasTest doesn't seem to enable InferContiguity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AliasTest is NVFuserTest, which enables InferContiguity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- I didn't realize NVFuserTest enables the option by default. Then, why do some tests enable it again, e.g., https://github.com/NVIDIA/Fuser/pull/5772/files#diff-3675636f2228bd2f8c3f308c28fa88f1d659d8eb3d869570dcfdf013f77908aaR29?

protected:
void SetUp() override {
BlackwellBase::SetUp();
EnableOptionsGuard::getCurOptions().unset(EnableOption::InferContiguity);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing. I think we should migrate our entire codebase to prefer ctor, instead of just this specific test. I don't think it is a good idea to mix ctor and SetUp. Because NVFuserTest setup InferContiguity and everything else in SetUp, we need to be consistent here, because otherwise whatever we set in ctor will be overriden by NVFuserTest::SetUp

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should migrate our entire codebase to prefer ctor, instead of just this specific test

I actually did that for NVFuserTest. In fact, most setup code for NVFuserTest is in its constructor already except this line (https://github.com/NVIDIA/Fuser/pull/5772/files#diff-16f891fd5f846480392227c6bbf81ead352f59fdc9964e5d6e4dc6089bb622c5R61) which was added later without my notice. The only thing that should be kept in NVFuserTest::SetUp at this moment is GTEST_SKIP.


#include <fusion_segmenter.h>
#include <ir/all_nodes.h>
#include <ir/utils.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is an artifact of rebase. Removed.

@zasdfgbnm
Copy link
Collaborator Author

!test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

19 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

}
KernelArgumentHolder group_runtime_outputs;
for (Val* v : fusion_to_run->outputs()) {
auto result = eval_fusion.evaluate(v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: no error handling for evaluate() failure - if evaluation fails or returns an invalid result, it will silently continue

Consider validating the result or wrapping in try-catch, especially since the PR description mentions some ATen ops on meta device can hang due to GIL issues

@zasdfgbnm zasdfgbnm merged commit 9266379 into main Jan 23, 2026
66 of 67 checks passed
@zasdfgbnm zasdfgbnm deleted the meta-eval branch January 23, 2026 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FusionKernelRuntime::getMaybeHeuristicsFor computes the wrong strides.

2 participants