Use meta device tensor to infer contiguity for expr-eval segments #5772

zasdfgbnm · 2026-01-07T22:18:50Z

Stacked on #5766

I used to work on #5082 for the fix, but I hit too many blockers, because this PR could interact with many new assumptions/hacks/unfinalized designs on things like allocation domain, stream-sharded tensor, multidevice, etc., and we keep having new things committed to the main branch that break #5082. This situation delayed the PR for a very long time. So I recreated this PR that is more friendly to incremental development.

Today, in the main branch, in FusionExecutorCache, we were assuming fusion segments always generate contiguous tensors. This is not true for ExpressionEvaluator segments. For example, ATen's slice op returns non-contiguous tensors. It is worth mentioning that, because segmentation and scheduler selection depend on inputs, the contiguity of intermediate results also depends on inputs.

This PR adds FusionKernelRuntime::inferOutputMetaTensor(, which replaces inferOutputShapeAndContiguousStrides to infer the output shape and stride of each segment. Both FusionKernelRuntime::inferOutputMetaTensor( and inferOutputShapeAndContiguousStrides store their result as a tensor on the meta device. The difference is, FusionKernelRuntime::inferOutputMetaTensor( will actually run the segment on device type meta if this segment is scheduled to run by ExpressionEvaluator, while inferOutputShapeAndContiguousStrides just assumes the output to be contiguous.

Because FusionKernelRuntime::inferOutputMetaTensor( will run the segment on device type meta, related op's MyOp::evaluate should work for device type meta. There is good and bad news for this design. The good news is, most MyOp::evaluate just calls at:: ops, which usually already support meta device, and PyTorch designed meta device to try to make its behavior on par with CUDA. The bad news is, because many op's meta device implementation is on Python, running at::op on these kinds of ops would hang due to the inability to grab Python's GIL (Thanks @naoyam for help debugging!). If this is the case, the corresponding MyOp::evaluate must manually compute the shape and stride and use at::empty_strided(device=meta) to create the result.

Besides FusionKernelRuntime::inferOutputMetaTensor(, this PR also adds FusionKernelRuntime::updateContiguityOfSegmentOutputs(. Which updates the segment output TensorViews' contiguity based on the inferred shape and stride.

This PR adds an enable option "infer-contiguity" to incrementally enable this feature. When "infer-contiguity" is disabled, FusionKernelRuntime::inferOutputMetaTensor( will fallback to the behavior of inferOutputShapeAndContiguousStrides, and FusionKernelRuntime::updateContiguityOfSegmentOutputs( will be no-op. The plan is, we merge this PR and not set "infer-contiguity" for the currently failed tests. I will write new PRs fixing the failed tests one by one.

github-actions · 2026-01-07T22:19:51Z

Review updated until commit 9665cf0

Description

Add new EnableOption::InferContiguity feature flag for contiguity inference
Replace inferOutputShapeAndContiguousStrides with inferContiguousOutputMetaTensor
Implement FusionKernelRuntime::inferOutputMetaTensor for expr-eval segments
Update test files to enable/disable new contiguity inference option
Add test_issue4888 to validate the fix for contiguity handling

Changes walkthrough

Relevant files

Enhancement

8 files

options.cpp `Add InferContiguity option to available options map`	+1/-0
options.h `Add InferContiguity enum value to EnableOption`	+1/-0
allocations.cpp `Rename function to inferContiguousOutputMetaTensor`	+2/-2
allocations.h `Update function signature for inferContiguousOutputMetaTensor`	+1/-1
fusion_kernel_runtime.cpp `Implement inferOutputMetaTensor and update prepareInputs calls`	+48/-8
fusion_kernel_runtime.h `Add inferOutputMetaTensor method declaration`	+10/-0
conftest.py `Add enable_options and disable_options parameters to exec_nvfuser`	+11/-1
utils.py `Update check_captured_python_definition to handle enable/disable` `options`	+18/-2

Tests

11 files

test_alias.cpp `Disable InferContiguity option for specific test case`	+3/-0
test_indexing_advanced.cpp `Enable InferContiguity option in test setup`	+2/-0
test_layout_op.cpp `Disable InferContiguity option in test setup`	+1/-0
test_loop_domain_scheduling.cpp `Enable InferContiguity option in test setup`	+1/-0
test_low_precision_recipe.cpp `Disable InferContiguity option in test setup`	+7/-1
test_matmul_aten_evaluation.cpp `Remove OutputStrides test case`	+0/-33
test_matmul_scheduler.cpp `Enable InferContiguity option in test setup`	+1/-0
test_pointwise.cpp `Enable InferContiguity option in test setup`	+1/-0
test_rng.cpp `Enable InferContiguity option in test setup`	+1/-0
utils.cpp `Enable InferContiguity option in NVFuserTest setup`	+1/-0
test_python_frontend.py `Add test_issue4888 to validate contiguity inference fix`	+98/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Function naming inconsistency

The function name was changed from inferOutputShapeAndContiguousStrides to inferContiguousOutputMetaTensor in the implementation, but the declaration in the header file still uses the old name. This creates a mismatch between declaration and definition.

KernelArgumentHolder inferContiguousOutputMetaTensor(
    Fusion* fusion,
    const KernelArgumentHolder& args,
    PrecomputedValues* evaluator_precomputed_values) {

Missing fusion creation logic

The old code created a fusion for the segment using segmented_fusion_->makeFusion(group_to_run).second but this logic is completely removed in the new code. The new inferOutputMetaTensor method receives the fusion directly via group_to_run->getFusion(), but it's unclear if this provides the same level of isolation as the previous approach.

auto group_runtime_outputs = inferOutputMetaTensor(
    heuristics_.get(), group_to_run, group_runtime_inputs);

Test complexity concern

The new test test_issue4888 is very complex with many tensor operations. While it tests the specific issue, such complexity might make it harder to debug if the test fails. Consider if a simpler test case could demonstrate the same contiguity inference behavior.

def test_issue4888(nvfuser_direct_test):
    # https://github.com/NVIDIA/Fuser/issues/4888
    def nvfuser_fusion_id2(fd: FusionDefinition) -> None:
        T0 = fd.define_tensor(
            shape=[4096, 4097],
            contiguity=[True, True],
            dtype=DataType.BFloat16,
            is_cpu=False,
            stride_order=[1, 0],
        )
        T1 = fd.define_tensor(
            shape=[4096, 4097],
            contiguity=[True, True],
            dtype=DataType.Bool,
            is_cpu=False,
            stride_order=[1, 0],
        )
        T2 = fd.define_tensor(
            shape=[4096, 4097],
            contiguity=[True, True],
            dtype=DataType.Bool,
            is_cpu=False,
            stride_order=[1, 0],
        )
        T3 = fd.define_tensor(
            shape=[1, 32, 4096, 4096],
            contiguity=[None, True, True, True],
            dtype=DataType.BFloat16,
            is_cpu=False,
            stride_order=[3, 2, 1, 0],
        )
        T4 = fd.ops.cast(T0, dtype=DataType.Float)
        T5 = fd.ops.bitwise_or(T1, T2)
        T6 = fd.ops.set(T5)
        fd.add_output(T6, T1)
        T7 = fd.ops.cast(T6, dtype=DataType.Float)
        T8 = fd.ops.mul(T4, T7)
        T9 = fd.ops.cast(T8, dtype=DataType.BFloat16)
        T10 = fd.ops.set(T9)
        fd.add_output(T10, T0)
        T15 = fd.ops.broadcast_in_dim(T10, shape=[1, 4096, 4097], broadcast_dims=[1, 2])
        T21 = fd.ops.broadcast_in_dim(
            T15, shape=[1, 1, 4096, 4097], broadcast_dims=[0, 2, 3]
        )
        T27 = fd.ops.broadcast_in_dim(
            T21, shape=[1, 1, 4096, 4097], broadcast_dims=[0, 1, 2, 3]
        )
        T43 = fd.ops.slice(
            T27,
            start_indices=[0, 0, 0, 0],
            end_indices=[1, 1, 4096, 4096],
            strides=[1, 1, 1, 1],
            manual_normalization=0,
        )
        T49 = fd.ops.broadcast_in_dim(
            T43, shape=[1, 32, 4096, 4096], broadcast_dims=[0, 1, 2, 3]
        )
        T50 = fd.ops.cast(T49, dtype=DataType.Float)
        T51 = fd.ops.cast(T3, dtype=DataType.Float)
        S52 = fd.define_scalar(0.0883883, dtype=DataType.Double)
        T53 = fd.ops.mul(T51, S52)
        T54 = fd.ops.add(T53, T50)
        T55 = fd.ops.max(T54, dims=[3], keepdim=False, dtype=DataType.Null)
        T61 = fd.ops.broadcast_in_dim(
            T55, shape=[1, 32, 4096, 1], broadcast_dims=[0, 1, 2]
        )
        T67 = fd.ops.broadcast_in_dim(
            T61, shape=[1, 32, 4096, 4096], broadcast_dims=[0, 1, 2, 3]
        )
        T68 = fd.ops.sub(T54, T67)
        T69 = fd.ops.exp(T68)
        T70 = fd.ops.sum(T69, dims=[3], keepdim=False, dtype=DataType.Null)
        T76 = fd.ops.broadcast_in_dim(
            T70, shape=[1, 32, 4096, 1], broadcast_dims=[0, 1, 2]
        )
        T82 = fd.ops.broadcast_in_dim(
            T76, shape=[1, 32, 4096, 4096], broadcast_dims=[0, 1, 2, 3]
        )
        T83 = fd.ops.reciprocal(T82)
        T84 = fd.ops.mul(T69, T83)
        T85 = fd.ops.cast(T84, dtype=DataType.BFloat16)
        fd.add_output(T49)
        fd.add_output(T84)
        fd.add_output(T85)

    inputs = [
        torch.testing.make_tensor((4096, 4097), dtype=torch.bfloat16, device="cuda:0"),
        torch.testing.make_tensor((4096, 4097), dtype=torch.bool, device="cuda:0"),
        torch.testing.make_tensor((4096, 4097), dtype=torch.bool, device="cuda:0"),
        torch.testing.make_tensor(
            (1, 32, 4096, 4096), dtype=torch.bfloat16, device="cuda:0"
        ),
    ]
    nvfuser_direct_test.exec_nvfuser(
        nvfuser_fusion_id2, inputs, enable_options=["infer_contiguity"]
    )

…andling - Renamed `inferOutputShapeAndContiguousStrides` to `inferContiguousOutputMetaTensor` for clarity. - Updated function signatures to remove unnecessary parameters. - Introduced `inferOutputMetaTensor` in `FusionKernelRuntime` to handle output shape inference for segmented groups. - Enhanced `updateWithSegmentOutputs` to streamline output management without updating contiguity directly. - Improved overall code organization and readability.

zasdfgbnm · 2026-01-14T20:00:31Z

!test

zasdfgbnm · 2026-01-14T20:12:10Z

!test

zasdfgbnm · 2026-01-15T22:14:00Z

!test

zasdfgbnm · 2026-01-16T04:49:30Z

!test

zasdfgbnm · 2026-01-16T06:06:44Z

!test

zasdfgbnm · 2026-01-16T07:02:49Z

!test

zasdfgbnm · 2026-01-16T07:37:34Z

!test

zasdfgbnm · 2026-01-16T19:31:53Z

!test

csrc/runtime/fusion_kernel_runtime.cpp

…ns. (#5848) To unblock #5772

csrc/runtime/fusion_kernel_runtime.cpp

wujingyue · 2026-01-21T04:56:37Z

Can you merge #5842 into this PR? LGTM otherwise!

zasdfgbnm · 2026-01-21T21:04:27Z

!test

wujingyue · 2026-01-21T21:33:49Z

tests/cpp/test_alias.cpp


 TEST_F(AliasTest, AliasOutputBeforeNonAliasOutput) {
+  EnableOptionsGuard opt_guard;
+  EnableOptionsGuard::getCurOptions().unset(EnableOption::InferContiguity);


Is this a no-op? AliasTest doesn't seem to enable InferContiguity.

AliasTest is NVFuserTest, which enables InferContiguity.

Got it -- I didn't realize NVFuserTest enables the option by default. Then, why do some tests enable it again, e.g., https://github.com/NVIDIA/Fuser/pull/5772/files#diff-3675636f2228bd2f8c3f308c28fa88f1d659d8eb3d869570dcfdf013f77908aaR29?

wujingyue · 2026-01-21T21:34:55Z

tests/cpp/test_low_precision_recipe.cpp

+ protected:
+  void SetUp() override {
+    BlackwellBase::SetUp();
+    EnableOptionsGuard::getCurOptions().unset(EnableOption::InferContiguity);


https://google.github.io/googletest/faq.html#CtorVsSetUp

Prefer constructor

Thanks for sharing. I think we should migrate our entire codebase to prefer ctor, instead of just this specific test. I don't think it is a good idea to mix ctor and SetUp. Because NVFuserTest setup InferContiguity and everything else in SetUp, we need to be consistent here, because otherwise whatever we set in ctor will be overriden by NVFuserTest::SetUp

I think we should migrate our entire codebase to prefer ctor, instead of just this specific test

I actually did that for NVFuserTest. In fact, most setup code for NVFuserTest is in its constructor already except this line (https://github.com/NVIDIA/Fuser/pull/5772/files#diff-16f891fd5f846480392227c6bbf81ead352f59fdc9964e5d6e4dc6089bb622c5R61) which was added later without my notice. The only thing that should be kept in NVFuserTest::SetUp at this moment is GTEST_SKIP.

wujingyue · 2026-01-21T21:35:42Z

csrc/runtime/fusion_cache_utils.cpp


 #include <fusion_segmenter.h>
 #include <ir/all_nodes.h>
+#include <ir/utils.h>


Is this used?

Looks like this is an artifact of rebase. Removed.

csrc/runtime/fusion_kernel_runtime.cpp

zasdfgbnm · 2026-01-23T08:10:45Z

!test

greptile-apps

_{19 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-23T08:13:35Z

csrc/runtime/fusion_kernel_runtime.cpp

+  }
+  KernelArgumentHolder group_runtime_outputs;
+  for (Val* v : fusion_to_run->outputs()) {
+    auto result = eval_fusion.evaluate(v);


logic: no error handling for evaluate() failure - if evaluation fails or returns an invalid result, it will silently continue

Consider validating the result or wrapping in try-catch, especially since the PR description mentions some ATen ops on meta device can hang due to GIL issues

zasdfgbnm and others added 5 commits January 6, 2026 11:53

Add utility ir_utils::resetContiguityFromTensor

473191d

Merge branch 'main' into resetContiguityFromTensor

ffd6864

save

69a65a8

save

5a784d4

Use meta device tensor to infer contiguity for expr-eval segments

d828c9f

zasdfgbnm and others added 15 commits January 13, 2026 09:43

Merge branch 'main' into resetContiguityFromTensor

c95a584

Merge branch 'resetContiguityFromTensor' into meta-eval

074e947

Merge branch 'main' into resetContiguityFromTensor

6bcdb96

Merge branch 'resetContiguityFromTensor' into meta-eval

4ad3785

save

7ffdaa3

save

7a5b0dc

save

d92e5ee

save

074209b

save

68d52fb

save

fb0572a

save

08e5b6b

enable

784ce68

Merge branch 'meta-eval' of github.com:NVIDIA/Fuser into meta-eval

9c46183

save

93a4012

fix

e5d4d67

zasdfgbnm and others added 6 commits January 14, 2026 12:14

save

38defa9

save

f50e52f

fix

5fd7496

fix

40782db

save

53d70fe

save

87b00e8

zasdfgbnm added 2 commits January 15, 2026 14:10

save

b151b50

save

4249f8b

zasdfgbnm requested a review from wujingyue January 15, 2026 22:56

fix

40b5411

zasdfgbnm added 2 commits January 15, 2026 23:36

Merge branch 'main' into resetContiguityFromTensor

b7778bd

Merge branch 'resetContiguityFromTensor' into meta-eval

c81f895

wujingyue reviewed Jan 16, 2026

View reviewed changes

csrc/runtime/fusion_kernel_runtime.cpp Show resolved Hide resolved

csrc/runtime/fusion_kernel_runtime.cpp Outdated Show resolved Hide resolved

wujingyue mentioned this pull request Jan 20, 2026

SdpaFwdOp::evaluate computes meta tensors respecting allocation domains. #5848

Merged

wujingyue added a commit that referenced this pull request Jan 21, 2026

SdpaFwdOp::evaluate computes meta tensors respecting allocation domai…

6037ef6

…ns. (#5848) To unblock #5772

wujingyue reviewed Jan 21, 2026

View reviewed changes

csrc/runtime/fusion_kernel_runtime.cpp Outdated Show resolved Hide resolved

csrc/runtime/fusion_kernel_runtime.cpp Show resolved Hide resolved

Don't update contiguity (#5842)

8466489

zasdfgbnm changed the base branch from resetContiguityFromTensor to main January 21, 2026 20:39

zasdfgbnm added 3 commits January 21, 2026 12:51

Merge branch 'main' of github.com:NVIDIA/Fuser into meta-eval

1500e69

save

7c62ef3

fix

fb95173

wujingyue approved these changes Jan 21, 2026

View reviewed changes

zasdfgbnm added 2 commits January 23, 2026 00:03

Remove unused include

f87a0ca

fix

9665cf0

greptile-apps bot reviewed Jan 23, 2026

View reviewed changes

zasdfgbnm merged commit 9266379 into main Jan 23, 2026
66 of 67 checks passed

zasdfgbnm deleted the meta-eval branch January 23, 2026 20:50

Use meta device tensor to infer contiguity for expr-eval segments #5772

Use meta device tensor to infer contiguity for expr-eval segments #5772

Uh oh!

Conversation

zasdfgbnm commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

zasdfgbnm commented Jan 14, 2026

Uh oh!

zasdfgbnm commented Jan 14, 2026

Uh oh!

zasdfgbnm commented Jan 15, 2026

Uh oh!

zasdfgbnm commented Jan 16, 2026

Uh oh!

zasdfgbnm commented Jan 16, 2026

Uh oh!

zasdfgbnm commented Jan 16, 2026

Uh oh!

zasdfgbnm commented Jan 16, 2026

Uh oh!

zasdfgbnm commented Jan 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wujingyue commented Jan 21, 2026

Uh oh!

zasdfgbnm commented Jan 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zasdfgbnm commented Jan 23, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zasdfgbnm commented Jan 7, 2026 •

edited

Loading

github-actions bot commented Jan 7, 2026 •

edited

Loading