Skip to content

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Sep 4, 2025

#5118 PR3: enable codegen for layout op
#5115 PR2: add layout op runtime function <- this PR
#5114 PR1: add layout op

Runtime function signature.

  template <
      typename T,
      typename Index_T,
      int BLOCK_ROW_OUTER,
      int BLOCK_ROW_INNER,
      int BLOCK_COL,
      int UNROLL_FACTOR>
  __device__ void preprocessGroupedMatmulInputSf(
      T* output,
      const T* input,
      const nvfuser_index_t row_idx,
      const nvfuser_index_t col_idx,
      const Index_T* input_offsets,
      const Index_T* output_offsets,
      const nvfuser_index_t col_size,
      const nvfuser_index_t group_size)

where:
  BLOCK_ROW_OUTER, BLOCK_ROW_INNER, BLOCK_COL will be translated from
    BlockScalingFactorLayout e.g. Block128x4 is translated to 32, 4, 4.

This function will be used by codegen for `PreprocessGroupedMatmulInputSf`
  `output` is expected to be the beginning of output buffer,
     indexing will be done inside the function template with help of `row_idx`,
     `col_idx`, `expert_offsets`, `output_offsets` and `col_size`
  Meanwhil, indexing on `input` would have been resolved during device lowering.

Todo for future PRs:
Add vectorization support.

Copy link

github-actions bot commented Sep 4, 2025

Review updated until commit 967b7f5

Description

  • Add runtime function for grouped block layout

  • Include block layout resource in kernel code

  • Update kernel summary with layout op support

  • Implement swizzled memory layout transformation


Changes walkthrough 📝

Relevant files
Enhancement
kernel.cpp
Handle layout op in kernel IR scanner                                       

csrc/kernel.cpp

  • Added handler for PreprocessGroupedMatmulInputSf op
  • Updates kernel IR scanner to track layout op usage
  • +4/-0     
    compiled_kernel.cpp
    Integrate block layout resource in kernel codegen               

    csrc/runtime/compiled_kernel.cpp

  • Added has_block_layout parameter to _getStructuredCode
  • Include block_layout_cu resource if layout op is used
  • Pass layout op flag from kernel summary
  • +8/-2     
    block_layout.cu
    Implement grouped block layout preprocessing                         

    runtime/block_layout.cu

  • Implemented preprocessGroupedMatmulInputSf device function
  • Added swizzle padding logic for memory layout
  • Supports grouped matmul input preprocessing
  • Includes TODO for vectorization support
  • +102/-0 
    kernel.h
    Add layout op flag in kernel summary                                         

    csrc/kernel.h

  • Added has_preprocess_grouped_matmul_input_sf flag
  • Tracks layout op usage in kernel summary
  • +3/-0     
    Configuration changes
    CMakeLists.txt
    Include block layout in build configuration                           

    CMakeLists.txt

  • Added block_layout.cu to runtime files list
  • Ensures layout code is compiled into runtime
  • +1/-0     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The loop in preprocessGroupedMatmulInputSf uses input[i] for loading values, but it should index into the correct position within the current expert's input data, likely using input + input_offsets[expert_id] + c_row_idx * col_size + col_idx + i or similar, depending on layout.

    for (int i = 0; i < UNROLL_FACTOR && col_idx + i < col_size; ++i) {
      nvfuser_index_t index = outputOffsetAfterSwizzlePadding<
          BLOCK_ROW_OUTER,
          BLOCK_ROW_INNER,
          BLOCK_COL>(c_row_idx, col_idx + i, padded_col_size);
      out_group_offset[index] = input[i];
    }
    Performance Issue

    The function uses a loop for unrolling with no vectorization, which may lead to suboptimal memory throughput. The comment indicates awareness, but without vectorized loads/stores, performance could be significantly impacted, especially for small UNROLL_FACTOR values.

      // TODO: vectorized load/store instead of for loop
      for (int i = 0; i < UNROLL_FACTOR && col_idx + i < col_size; ++i) {
        nvfuser_index_t index = outputOffsetAfterSwizzlePadding<
            BLOCK_ROW_OUTER,
            BLOCK_ROW_INNER,
            BLOCK_COL>(c_row_idx, col_idx + i, padded_col_size);
        out_group_offset[index] = input[i];
      }
    }
    Inconsistent Flag Usage

    The has_block_layout parameter in _getStructuredCode is set using has_preprocess_grouped_matmul_input_sf, suggesting a mismatch between the operation name and the resource inclusion flag, which could lead to confusion or incorrect code generation in the future.

    kernel()->summary().has_preprocess_grouped_matmul_input_sf);

    @jjsjann123 jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch from 2c992ab to 1d72d32 Compare September 4, 2025 18:28
    @jjsjann123 jjsjann123 force-pushed the jj/layout_op_PR0_ir_node branch from 3ccfbde to 1d72d32 Compare September 4, 2025 18:30
    @jjsjann123 jjsjann123 mentioned this pull request Sep 4, 2025
    @jjsjann123 jjsjann123 changed the title block layout op runtime function added add layout op runtime function Sep 4, 2025
    @jjsjann123 jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch from ebd03f4 to 5ec9d72 Compare September 4, 2025 19:12
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    1 similar comment
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123 jjsjann123 marked this pull request as ready for review September 4, 2025 19:24
    @jjsjann123 jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch from 298ea2f to a86508c Compare September 4, 2025 19:45
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123 jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch from a86508c to 7c327f6 Compare September 4, 2025 22:11
    @jjsjann123 jjsjann123 force-pushed the jj/layout_op_PR0_ir_node branch from c4e65d3 to 4deb4a9 Compare September 4, 2025 22:32
    @jjsjann123 jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch 3 times, most recently from f1709fb to f5b464f Compare September 5, 2025 00:14
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123 jjsjann123 force-pushed the jj/layout_op_PR0_ir_node branch from a3afee4 to 150d3ee Compare September 5, 2025 17:47
    @jjsjann123 jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch from f5b464f to c340720 Compare September 5, 2025 17:48
    @jjsjann123 jjsjann123 force-pushed the jj/layout_op_PR0_ir_node branch from 150d3ee to 8321654 Compare September 8, 2025 15:59
    @jjsjann123 jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch from c340720 to 19fa2e0 Compare September 8, 2025 16:02
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    jjsjann123 added a commit that referenced this pull request Sep 8, 2025
    #5118 PR3: enable codegen for layout op
    #5115 PR2: add layout op runtime function
    #5114 PR1: add layout op <- this PR
    
    ### Motivation
    
    The operation is to support layout requirement for cutlass grouped_mm
    kernel. The use case:
    
    ```
    QuantizationOp(activation_bf16) -> TensorView* fp4_activation, TensorView* fp8_block_sf
    ```
    
    Before feeding both inputs to cutlass gemm, we need to update the block
    scaling factor's layout in order to satisfy the requirement of the gemm
    kernel. For details see:
    https://docs.nvidia.com/cutlass/media/docs/cpp/blackwell_functionality.html#scale-factor-layouts
    
    ```
    preprocessGroupedMatmulInputSf(fp8_block_sf, ...) -> TensorView* fp8_block_sf_layout_fixed
    cutlassGroupedGemm(fp4_activation, fp8_block_layout_fixed, ...)
    ```
    
    ### Code Change
    
        1. adding Fusion node `PreprocessGroupedMatmulInputSf`
    
        PreprocessGroupedMatmulInputSf
          [output]
            Val* output (2d tensor)
    
          [input]
            TensorView* input (2d tensor)
            TensorView* input_offsets (vector)
            TensorView* output_offsets (vector)
            Val* k  (scalar)
            Val* g  (scalar)
    
          [attribute]
            BlockScalingFactorLayout layout
    
        2. adding cpp api `preprocessGroupedMatmulInputSf`
        
        TensorView* preprocessGroupedMatmulInputSf(
            TensorView* input,
            TensorView* input_offsets,
            TensorView* output_offsets,
            BlockScalingFactorLayout layout);
    
    The design topic on the layout op
    
    1. I choose to match the output's root/loop domain with the logical
    domain of input. This basically categorize the operation as a pointwise
    op.
    2. The padding requirement is explicitly represented in the fusion IR.
    In order to work around the data-dependent padding size, I'm opting for
    allocating the maximum padding size.
    3. Indexing on output is done in the runtime function, so we don't need
    to map anything to the logical/allocation domain of the output.
    Base automatically changed from jj/layout_op_PR0_ir_node to main September 8, 2025 23:45
    Runtime function signature.
    
      template <
          typename T,
          typename Index_T,
          int BLOCK_ROW_OUTER,
          int BLOCK_ROW_INNER,
          int BLOCK_COL,
          int UNROLL_FACTOR>
      __device__ void groupedBlockLayout(
          T* output,
          const T* input,
          const nvfuser_index_t row_idx,
          const nvfuser_index_t col_idx,
          const Index_T* expert_offsets,
          const Index_T* output_offsets,
          const nvfuser_index_t col_size,
          const nvfuser_index_t group_size)
    
    where:
      BLOCK_ROW_OUTER, BLOCK_ROW_INNER, BLOCK_COL will be translated from BlockScalingFactorLayout, e.g. Block128x4 is translated to 32, 4, 4.
    
    This function will be used by codegen for `GroupedBlockScalingFactorLayoutOp`
      `output` is expected to be the beginning of output buffer,
         indexing will be done inside the function template with help of `row_idx`,
         `col_idx`, `expert_offsets`, `output_offsets` and `col_size`
      Meanwhil, indexing on `input` would have been resolved during device lowering.
    @jjsjann123 jjsjann123 force-pushed the jj/layout_op_PR1_runtime_function branch from 19fa2e0 to 967b7f5 Compare September 8, 2025 23:57
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    Copy link
    Collaborator

    @naoyam naoyam left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Is there any not-so-tedious way to verify the result?

    rdspring1 pushed a commit that referenced this pull request Sep 9, 2025
    #5118 PR3: enable codegen for layout op
    #5115 PR2: add layout op runtime function
    #5114 PR1: add layout op <- this PR
    
    ### Motivation
    
    The operation is to support layout requirement for cutlass grouped_mm
    kernel. The use case:
    
    ```
    QuantizationOp(activation_bf16) -> TensorView* fp4_activation, TensorView* fp8_block_sf
    ```
    
    Before feeding both inputs to cutlass gemm, we need to update the block
    scaling factor's layout in order to satisfy the requirement of the gemm
    kernel. For details see:
    https://docs.nvidia.com/cutlass/media/docs/cpp/blackwell_functionality.html#scale-factor-layouts
    
    ```
    preprocessGroupedMatmulInputSf(fp8_block_sf, ...) -> TensorView* fp8_block_sf_layout_fixed
    cutlassGroupedGemm(fp4_activation, fp8_block_layout_fixed, ...)
    ```
    
    ### Code Change
    
        1. adding Fusion node `PreprocessGroupedMatmulInputSf`
    
        PreprocessGroupedMatmulInputSf
          [output]
            Val* output (2d tensor)
    
          [input]
            TensorView* input (2d tensor)
            TensorView* input_offsets (vector)
            TensorView* output_offsets (vector)
            Val* k  (scalar)
            Val* g  (scalar)
    
          [attribute]
            BlockScalingFactorLayout layout
    
        2. adding cpp api `preprocessGroupedMatmulInputSf`
        
        TensorView* preprocessGroupedMatmulInputSf(
            TensorView* input,
            TensorView* input_offsets,
            TensorView* output_offsets,
            BlockScalingFactorLayout layout);
    
    The design topic on the layout op
    
    1. I choose to match the output's root/loop domain with the logical
    domain of input. This basically categorize the operation as a pointwise
    op.
    2. The padding requirement is explicitly represented in the fusion IR.
    In order to work around the data-dependent padding size, I'm opting for
    allocating the maximum padding size.
    3. Indexing on output is done in the runtime function, so we don't need
    to map anything to the logical/allocation domain of the output.
    Copy link
    Collaborator

    @naoyam naoyam left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Stamping

    @jjsjann123 jjsjann123 merged commit 08ad4ff into main Sep 9, 2025
    48 of 52 checks passed
    @jjsjann123 jjsjann123 deleted the jj/layout_op_PR1_runtime_function branch September 9, 2025 18:33
    jjsjann123 added a commit that referenced this pull request Sep 15, 2025
    #5118 PR3: enable codegen for layout op <- this PR
    #5115 PR2: add layout op runtime function
    #5114 PR1: add layout op
    
        1. Add indexing lowing for `PreprocessGroupedMatmulInputSf`:
          we resolve indexing for input TV;
    we compute logical index for `row_idx` and `col_idx` and feed them as op
          attribute in index pass during device lowering;
          
        2. Add codegen for `PreprocessGroupedMatmulInputSf`;
    The operation adds lowering logic to use the runtime function. The
    codegen
          utilizes the extra indexing bits added during index lowering.
    
        3. Skip domain validation in `OptOutMutator::mutate(TensorDomain*)`
    mutating the domain shouldn't try to validate the coverage, because it's
    not
    a guarantee that TensorDomain entries matches identically (e.g. layout
    op as
          well as scatter op);
    
    4. Add cpp test with a manual kernel to validate the correctness of the
    layout
          op.
    
    5. Refactor `PreprocessGroupedMatmulInputSf` to use allocation domain to
            represent padding logic (instead of logical domain in #5114)
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants