-
Notifications
You must be signed in to change notification settings - Fork 67
add layout op runtime function #5115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Review updated until commit 967b7f5 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
2c992ab
to
1d72d32
Compare
3ccfbde
to
1d72d32
Compare
ebd03f4
to
5ec9d72
Compare
!test |
1 similar comment
!test |
298ea2f
to
a86508c
Compare
!test |
a86508c
to
7c327f6
Compare
c4e65d3
to
4deb4a9
Compare
f1709fb
to
f5b464f
Compare
!test |
a3afee4
to
150d3ee
Compare
f5b464f
to
c340720
Compare
150d3ee
to
8321654
Compare
c340720
to
19fa2e0
Compare
!test |
#5118 PR3: enable codegen for layout op #5115 PR2: add layout op runtime function #5114 PR1: add layout op <- this PR ### Motivation The operation is to support layout requirement for cutlass grouped_mm kernel. The use case: ``` QuantizationOp(activation_bf16) -> TensorView* fp4_activation, TensorView* fp8_block_sf ``` Before feeding both inputs to cutlass gemm, we need to update the block scaling factor's layout in order to satisfy the requirement of the gemm kernel. For details see: https://docs.nvidia.com/cutlass/media/docs/cpp/blackwell_functionality.html#scale-factor-layouts ``` preprocessGroupedMatmulInputSf(fp8_block_sf, ...) -> TensorView* fp8_block_sf_layout_fixed cutlassGroupedGemm(fp4_activation, fp8_block_layout_fixed, ...) ``` ### Code Change 1. adding Fusion node `PreprocessGroupedMatmulInputSf` PreprocessGroupedMatmulInputSf [output] Val* output (2d tensor) [input] TensorView* input (2d tensor) TensorView* input_offsets (vector) TensorView* output_offsets (vector) Val* k (scalar) Val* g (scalar) [attribute] BlockScalingFactorLayout layout 2. adding cpp api `preprocessGroupedMatmulInputSf` TensorView* preprocessGroupedMatmulInputSf( TensorView* input, TensorView* input_offsets, TensorView* output_offsets, BlockScalingFactorLayout layout); The design topic on the layout op 1. I choose to match the output's root/loop domain with the logical domain of input. This basically categorize the operation as a pointwise op. 2. The padding requirement is explicitly represented in the fusion IR. In order to work around the data-dependent padding size, I'm opting for allocating the maximum padding size. 3. Indexing on output is done in the runtime function, so we don't need to map anything to the logical/allocation domain of the output.
Runtime function signature. template < typename T, typename Index_T, int BLOCK_ROW_OUTER, int BLOCK_ROW_INNER, int BLOCK_COL, int UNROLL_FACTOR> __device__ void groupedBlockLayout( T* output, const T* input, const nvfuser_index_t row_idx, const nvfuser_index_t col_idx, const Index_T* expert_offsets, const Index_T* output_offsets, const nvfuser_index_t col_size, const nvfuser_index_t group_size) where: BLOCK_ROW_OUTER, BLOCK_ROW_INNER, BLOCK_COL will be translated from BlockScalingFactorLayout, e.g. Block128x4 is translated to 32, 4, 4. This function will be used by codegen for `GroupedBlockScalingFactorLayoutOp` `output` is expected to be the beginning of output buffer, indexing will be done inside the function template with help of `row_idx`, `col_idx`, `expert_offsets`, `output_offsets` and `col_size` Meanwhil, indexing on `input` would have been resolved during device lowering.
19fa2e0
to
967b7f5
Compare
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any not-so-tedious way to verify the result?
#5118 PR3: enable codegen for layout op #5115 PR2: add layout op runtime function #5114 PR1: add layout op <- this PR ### Motivation The operation is to support layout requirement for cutlass grouped_mm kernel. The use case: ``` QuantizationOp(activation_bf16) -> TensorView* fp4_activation, TensorView* fp8_block_sf ``` Before feeding both inputs to cutlass gemm, we need to update the block scaling factor's layout in order to satisfy the requirement of the gemm kernel. For details see: https://docs.nvidia.com/cutlass/media/docs/cpp/blackwell_functionality.html#scale-factor-layouts ``` preprocessGroupedMatmulInputSf(fp8_block_sf, ...) -> TensorView* fp8_block_sf_layout_fixed cutlassGroupedGemm(fp4_activation, fp8_block_layout_fixed, ...) ``` ### Code Change 1. adding Fusion node `PreprocessGroupedMatmulInputSf` PreprocessGroupedMatmulInputSf [output] Val* output (2d tensor) [input] TensorView* input (2d tensor) TensorView* input_offsets (vector) TensorView* output_offsets (vector) Val* k (scalar) Val* g (scalar) [attribute] BlockScalingFactorLayout layout 2. adding cpp api `preprocessGroupedMatmulInputSf` TensorView* preprocessGroupedMatmulInputSf( TensorView* input, TensorView* input_offsets, TensorView* output_offsets, BlockScalingFactorLayout layout); The design topic on the layout op 1. I choose to match the output's root/loop domain with the logical domain of input. This basically categorize the operation as a pointwise op. 2. The padding requirement is explicitly represented in the fusion IR. In order to work around the data-dependent padding size, I'm opting for allocating the maximum padding size. 3. Indexing on output is done in the runtime function, so we don't need to map anything to the logical/allocation domain of the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamping
#5118 PR3: enable codegen for layout op <- this PR #5115 PR2: add layout op runtime function #5114 PR1: add layout op 1. Add indexing lowing for `PreprocessGroupedMatmulInputSf`: we resolve indexing for input TV; we compute logical index for `row_idx` and `col_idx` and feed them as op attribute in index pass during device lowering; 2. Add codegen for `PreprocessGroupedMatmulInputSf`; The operation adds lowering logic to use the runtime function. The codegen utilizes the extra indexing bits added during index lowering. 3. Skip domain validation in `OptOutMutator::mutate(TensorDomain*)` mutating the domain shouldn't try to validate the coverage, because it's not a guarantee that TensorDomain entries matches identically (e.g. layout op as well as scatter op); 4. Add cpp test with a manual kernel to validate the correctness of the layout op. 5. Refactor `PreprocessGroupedMatmulInputSf` to use allocation domain to represent padding logic (instead of logical domain in #5114)
#5118 PR3: enable codegen for layout op
#5115 PR2: add layout op runtime function <- this PR
#5114 PR1: add layout op
Todo for future PRs:
Add vectorization support.