add layout op runtime function #5115

jjsjann123 · 2025-09-04T16:41:33Z

#5118 PR3: enable codegen for layout op
#5115 PR2: add layout op runtime function <- this PR
#5114 PR1: add layout op

Runtime function signature.

  template <
      typename T,
      typename Index_T,
      int BLOCK_ROW_OUTER,
      int BLOCK_ROW_INNER,
      int BLOCK_COL,
      int UNROLL_FACTOR>
  __device__ void preprocessGroupedMatmulInputSf(
      T* output,
      const T* input,
      const nvfuser_index_t row_idx,
      const nvfuser_index_t col_idx,
      const Index_T* input_offsets,
      const Index_T* output_offsets,
      const nvfuser_index_t col_size,
      const nvfuser_index_t group_size)

where:
  BLOCK_ROW_OUTER, BLOCK_ROW_INNER, BLOCK_COL will be translated from
    BlockScalingFactorLayout e.g. Block128x4 is translated to 32, 4, 4.

This function will be used by codegen for `PreprocessGroupedMatmulInputSf`
  `output` is expected to be the beginning of output buffer,
     indexing will be done inside the function template with help of `row_idx`,
     `col_idx`, `expert_offsets`, `output_offsets` and `col_size`
  Meanwhil, indexing on `input` would have been resolved during device lowering.

Todo for future PRs:
Add vectorization support.

github-actions · 2025-09-04T16:43:00Z

Review updated until commit 967b7f5

Description

Add runtime function for grouped block layout
Include block layout resource in kernel code
Update kernel summary with layout op support
Implement swizzled memory layout transformation

Changes walkthrough 📝

Relevant files

Enhancement

kernel.cpp `Handle layout op in kernel IR scanner` csrc/kernel.cpp Added handler for `PreprocessGroupedMatmulInputSf` op Updates kernel IR scanner to track layout op usage	+4/-0
compiled_kernel.cpp `Integrate block layout resource in kernel codegen` csrc/runtime/compiled_kernel.cpp Added `has_block_layout` parameter to `_getStructuredCode` Include `block_layout_cu` resource if layout op is used Pass layout op flag from kernel summary	+8/-2
block_layout.cu `Implement grouped block layout preprocessing` runtime/block_layout.cu Implemented `preprocessGroupedMatmulInputSf` device function Added swizzle padding logic for memory layout Supports grouped matmul input preprocessing Includes TODO for vectorization support	+102/-0
kernel.h `Add layout op flag in kernel summary` csrc/kernel.h Added `has_preprocess_grouped_matmul_input_sf` flag Tracks layout op usage in kernel summary	+3/-0

Configuration changes

CMakeLists.txt `Include block layout in build configuration` CMakeLists.txt Added `block_layout.cu` to runtime files list Ensures layout code is compiled into runtime	+1/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests
⚡ Recommended focus areas for review Possible Issue The loop in `preprocessGroupedMatmulInputSf` uses `input[i]` for loading values, but it should index into the correct position within the current expert's input data, likely using `input + input_offsets[expert_id] + c_row_idx * col_size + col_idx + i` or similar, depending on layout. for (int i = 0; i < UNROLL_FACTOR && col_idx + i < col_size; ++i) { nvfuser_index_t index = outputOffsetAfterSwizzlePadding< BLOCK_ROW_OUTER, BLOCK_ROW_INNER, BLOCK_COL>(c_row_idx, col_idx + i, padded_col_size); out_group_offset[index] = input[i]; } Performance Issue The function uses a loop for unrolling with no vectorization, which may lead to suboptimal memory throughput. The comment indicates awareness, but without vectorized loads/stores, performance could be significantly impacted, especially for small UNROLL_FACTOR values. // TODO: vectorized load/store instead of for loop for (int i = 0; i < UNROLL_FACTOR && col_idx + i < col_size; ++i) { nvfuser_index_t index = outputOffsetAfterSwizzlePadding< BLOCK_ROW_OUTER, BLOCK_ROW_INNER, BLOCK_COL>(c_row_idx, col_idx + i, padded_col_size); out_group_offset[index] = input[i]; } } Inconsistent Flag Usage The `has_block_layout` parameter in `_getStructuredCode` is set using `has_preprocess_grouped_matmul_input_sf`, suggesting a mismatch between the operation name and the resource inclusion flag, which could lead to confusion or incorrect code generation in the future. kernel()->summary().has_preprocess_grouped_matmul_input_sf);

jjsjann123 · 2025-09-04T19:12:45Z

!test

jjsjann123 · 2025-09-04T19:24:53Z

!test

jjsjann123 · 2025-09-04T19:46:15Z

!test

jjsjann123 · 2025-09-05T06:02:45Z

!test

jjsjann123 · 2025-09-08T16:05:43Z

!test

#5118 PR3: enable codegen for layout op #5115 PR2: add layout op runtime function #5114 PR1: add layout op <- this PR ### Motivation The operation is to support layout requirement for cutlass grouped_mm kernel. The use case: ``` QuantizationOp(activation_bf16) -> TensorView* fp4_activation, TensorView* fp8_block_sf ``` Before feeding both inputs to cutlass gemm, we need to update the block scaling factor's layout in order to satisfy the requirement of the gemm kernel. For details see: https://docs.nvidia.com/cutlass/media/docs/cpp/blackwell_functionality.html#scale-factor-layouts ``` preprocessGroupedMatmulInputSf(fp8_block_sf, ...) -> TensorView* fp8_block_sf_layout_fixed cutlassGroupedGemm(fp4_activation, fp8_block_layout_fixed, ...) ``` ### Code Change 1. adding Fusion node `PreprocessGroupedMatmulInputSf` PreprocessGroupedMatmulInputSf [output] Val* output (2d tensor) [input] TensorView* input (2d tensor) TensorView* input_offsets (vector) TensorView* output_offsets (vector) Val* k (scalar) Val* g (scalar) [attribute] BlockScalingFactorLayout layout 2. adding cpp api `preprocessGroupedMatmulInputSf` TensorView* preprocessGroupedMatmulInputSf( TensorView* input, TensorView* input_offsets, TensorView* output_offsets, BlockScalingFactorLayout layout); The design topic on the layout op 1. I choose to match the output's root/loop domain with the logical domain of input. This basically categorize the operation as a pointwise op. 2. The padding requirement is explicitly represented in the fusion IR. In order to work around the data-dependent padding size, I'm opting for allocating the maximum padding size. 3. Indexing on output is done in the runtime function, so we don't need to map anything to the logical/allocation domain of the output.

Runtime function signature. template < typename T, typename Index_T, int BLOCK_ROW_OUTER, int BLOCK_ROW_INNER, int BLOCK_COL, int UNROLL_FACTOR> __device__ void groupedBlockLayout( T* output, const T* input, const nvfuser_index_t row_idx, const nvfuser_index_t col_idx, const Index_T* expert_offsets, const Index_T* output_offsets, const nvfuser_index_t col_size, const nvfuser_index_t group_size) where: BLOCK_ROW_OUTER, BLOCK_ROW_INNER, BLOCK_COL will be translated from BlockScalingFactorLayout, e.g. Block128x4 is translated to 32, 4, 4. This function will be used by codegen for `GroupedBlockScalingFactorLayoutOp` `output` is expected to be the beginning of output buffer, indexing will be done inside the function template with help of `row_idx`, `col_idx`, `expert_offsets`, `output_offsets` and `col_size` Meanwhil, indexing on `input` would have been resolved during device lowering.

jjsjann123 · 2025-09-08T23:58:51Z

!test

naoyam

Is there any not-so-tedious way to verify the result?

#5118 PR3: enable codegen for layout op #5115 PR2: add layout op runtime function #5114 PR1: add layout op <- this PR ### Motivation The operation is to support layout requirement for cutlass grouped_mm kernel. The use case: ``` QuantizationOp(activation_bf16) -> TensorView* fp4_activation, TensorView* fp8_block_sf ``` Before feeding both inputs to cutlass gemm, we need to update the block scaling factor's layout in order to satisfy the requirement of the gemm kernel. For details see: https://docs.nvidia.com/cutlass/media/docs/cpp/blackwell_functionality.html#scale-factor-layouts ``` preprocessGroupedMatmulInputSf(fp8_block_sf, ...) -> TensorView* fp8_block_sf_layout_fixed cutlassGroupedGemm(fp4_activation, fp8_block_layout_fixed, ...) ``` ### Code Change 1. adding Fusion node `PreprocessGroupedMatmulInputSf` PreprocessGroupedMatmulInputSf [output] Val* output (2d tensor) [input] TensorView* input (2d tensor) TensorView* input_offsets (vector) TensorView* output_offsets (vector) Val* k (scalar) Val* g (scalar) [attribute] BlockScalingFactorLayout layout 2. adding cpp api `preprocessGroupedMatmulInputSf` TensorView* preprocessGroupedMatmulInputSf( TensorView* input, TensorView* input_offsets, TensorView* output_offsets, BlockScalingFactorLayout layout); The design topic on the layout op 1. I choose to match the output's root/loop domain with the logical domain of input. This basically categorize the operation as a pointwise op. 2. The padding requirement is explicitly represented in the fusion IR. In order to work around the data-dependent padding size, I'm opting for allocating the maximum padding size. 3. Indexing on output is done in the runtime function, so we don't need to map anything to the logical/allocation domain of the output.

naoyam

Stamping

#5118 PR3: enable codegen for layout op <- this PR #5115 PR2: add layout op runtime function #5114 PR1: add layout op 1. Add indexing lowing for `PreprocessGroupedMatmulInputSf`: we resolve indexing for input TV; we compute logical index for `row_idx` and `col_idx` and feed them as op attribute in index pass during device lowering; 2. Add codegen for `PreprocessGroupedMatmulInputSf`; The operation adds lowering logic to use the runtime function. The codegen utilizes the extra indexing bits added during index lowering. 3. Skip domain validation in `OptOutMutator::mutate(TensorDomain*)` mutating the domain shouldn't try to validate the coverage, because it's not a guarantee that TensorDomain entries matches identically (e.g. layout op as well as scatter op); 4. Add cpp test with a manual kernel to validate the correctness of the layout op. 5. Refactor `PreprocessGroupedMatmulInputSf` to use allocation domain to represent padding logic (instead of logical domain in #5114)

jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch from 2c992ab to 1d72d32 Compare September 4, 2025 18:28

jjsjann123 force-pushed the jj/layout_op_PR0_ir_node branch from 3ccfbde to 1d72d32 Compare September 4, 2025 18:30

jjsjann123 mentioned this pull request Sep 4, 2025

Add layout op #5114

Merged

jjsjann123 changed the title ~~block layout op runtime function added~~ add layout op runtime function Sep 4, 2025

jjsjann123 mentioned this pull request Sep 4, 2025

enable codegen for layout op #5116

Merged

jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch from ebd03f4 to 5ec9d72 Compare September 4, 2025 19:12

jjsjann123 marked this pull request as ready for review September 4, 2025 19:24

jjsjann123 requested review from naoyam, protonu and rdspring1 September 4, 2025 19:24

jjsjann123 mentioned this pull request Sep 4, 2025

enable codegen for layout op #5118

Merged

jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch from 298ea2f to a86508c Compare September 4, 2025 19:45

jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch from a86508c to 7c327f6 Compare September 4, 2025 22:11

jjsjann123 force-pushed the jj/layout_op_PR0_ir_node branch from c4e65d3 to 4deb4a9 Compare September 4, 2025 22:32

jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch 3 times, most recently from f1709fb to f5b464f Compare September 5, 2025 00:14

jjsjann123 force-pushed the jj/layout_op_PR0_ir_node branch from a3afee4 to 150d3ee Compare September 5, 2025 17:47

jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch from f5b464f to c340720 Compare September 5, 2025 17:48

jjsjann123 force-pushed the jj/layout_op_PR0_ir_node branch from 150d3ee to 8321654 Compare September 8, 2025 15:59

jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch from c340720 to 19fa2e0 Compare September 8, 2025 16:02

Base automatically changed from jj/layout_op_PR0_ir_node to main September 8, 2025 23:45

jjsjann123 added 4 commits September 8, 2025 16:55

add code comment

e2e306b

clangformat

8f18d19

method renaming

cd794eb

renaming

967b7f5

jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch from 19fa2e0 to 967b7f5 Compare September 8, 2025 23:57

naoyam reviewed Sep 9, 2025

View reviewed changes

naoyam approved these changes Sep 9, 2025

View reviewed changes

jjsjann123 merged commit 08ad4ff into main Sep 9, 2025
48 of 52 checks passed

jjsjann123 deleted the jj/layout_op_PR1_runtime_function branch September 9, 2025 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add layout op runtime function #5115

add layout op runtime function #5115

jjsjann123 commented Sep 4, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 4, 2025 •

edited

Loading

Uh oh!

jjsjann123 commented Sep 4, 2025

Uh oh!

jjsjann123 commented Sep 4, 2025

Uh oh!

jjsjann123 commented Sep 4, 2025

Uh oh!

jjsjann123 commented Sep 5, 2025

Uh oh!

jjsjann123 commented Sep 8, 2025

Uh oh!

jjsjann123 commented Sep 8, 2025

Uh oh!

naoyam left a comment

Uh oh!

naoyam left a comment

Uh oh!

Uh oh!

Uh oh!

add layout op runtime function #5115

add layout op runtime function #5115

Conversation

jjsjann123 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

jjsjann123 commented Sep 4, 2025

Uh oh!

jjsjann123 commented Sep 4, 2025

Uh oh!

jjsjann123 commented Sep 4, 2025

Uh oh!

jjsjann123 commented Sep 5, 2025

Uh oh!

jjsjann123 commented Sep 8, 2025

Uh oh!

jjsjann123 commented Sep 8, 2025

Uh oh!

naoyam left a comment

Choose a reason for hiding this comment

Uh oh!

naoyam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jjsjann123 commented Sep 4, 2025 •

edited

Loading

github-actions bot commented Sep 4, 2025 •

edited

Loading