-
Notifications
You must be signed in to change notification settings - Fork 74
Use meta device tensor to infer contiguity for expr-eval segments #5772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: resetContiguityFromTensor
Are you sure you want to change the base?
Conversation
|
Review updated until commit 4afe5b1 Description
|
| Relevant files | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Configuration changes | |||||||||||||||
| Enhancement | 5 files
| ||||||||||||||
| Miscellaneous | 1 files
| ||||||||||||||
| Tests | 7 files
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Performance Impact
inferOutputMetaTensor method creates meta device tensors for expr-eval segments. While this provides better contiguity inference, we should verify that the performance overhead of creating these meta tensors is acceptable, especially for small segments or when the option is disabled. |
Test failures (partial, pipeline still running)
-
(High, 6)
nvFuser evaluator_common internal assert in multidevice overlap (row_parallel_linear_forward) testsTest Name A100 (dist.) GB200 (dist.) Source tests.python.multidevice.test_overlap.test_row_parallel_linear_forward ❌ ❌ tests.python.multidevice.test_overlap.test_row_parallel_linear_forward_benchmark[s=2] ❌ ❌ tests.python.multidevice.test_overlap.test_row_parallel_linear_forward_benchmark[s=4] ❌ ❌ -
(Medium, 16)
NVFuser internal assert on block_scaling_factor contiguity in BlockQuantizationSchedulingTestSuiteTest Name GB200 Source BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_1024x1024_NoGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_1024x1024_WithGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_128x64_NoGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_128x64_WithGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_2048x128_NoGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_2048x128_WithGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_2048x2048_NoGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_2048x2048_WithGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/float_1024x1024_NoGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/float_1024x1024_WithGlobalScale_WithSwizzle ❌ Link ... with 6 more test failures omitted. Check internal logs. -
(Medium, 8)
nvFuser stride/contiguity mismatch in multidevice SDPA & Transformer tests (BSHE layout)Test Name A100 (dist.) GB200 (dist.) Source tests.python.multidevice.test_multidevice.test_sdpa[qkv_format=QkvFormat.BSHE] ❌ ❌ tests.python.multidevice.test_multidevice.test_sdpa_loop_split[qkv_format=QkvFormat.BSHE] ❌ ❌ tests.python.multidevice.test_transformer.test_transformer_forward[SEQUENCE_PARALLEL] ❌ ❌ tests.python.multidevice.test_transformer.test_transformer_forward[TENSOR_PARALLEL] ❌ ❌ -
(Medium, 6)
nvFuser allocation-domain assertion failures in LayoutOpTest suiteTest Name A100 GB200 Source LayoutOpTest.SchedulerKernel ❌ ❌ Link LayoutOpTest.SchedulerKernelWithExplicitQuantizationPattern ❌ ❌ Link LayoutOpTest.SchedulerKernelWithOffsetsProducer ❌ ❌ Link -
(Medium, 2)
nvFuser segmentation logic mismatch in RevertPrivatizedUpcast test suiteTest Name A100 GB200 Source SegmentationTest.RevertPrivatizedUpcast ❌ ❌ Link -
(Medium, 2)
nvFuser AliasTest aliasing mismatch in test_alias.cpp across multiple runnersTest Name A100 GB200 Source AliasTest.AliasOutputBeforeNonAliasOutput ❌ ❌ Link -
(Medium, 2)
nvFuser matmul output stride mismatch in MatmulNodeTestTest Name A100 GB200 Source MatmulNodeTest.OutputStrides ❌ ❌ Link
…andling - Renamed `inferOutputShapeAndContiguousStrides` to `inferContiguousOutputMetaTensor` for clarity. - Updated function signatures to remove unnecessary parameters. - Introduced `inferOutputMetaTensor` in `FusionKernelRuntime` to handle output shape inference for segmented groups. - Enhanced `updateWithSegmentOutputs` to streamline output management without updating contiguity directly. - Improved overall code organization and readability.
|
!test |
|
!test |
|
!test |
No description provided.