-
Notifications
You must be signed in to change notification settings - Fork 73
GroupedBlockQuantizeOp PR0: Adding runtime function #5775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1. refactor existing block_layout op and block_quantization_kernel to re-use existing runtime functions; 2. added runtime function for GroupedBlockQuantizeOp
|
!test |
|
Review updated until commit e66c444 Description
|
| Relevant files | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Enhancement |
| ||||||||
| Configuration changes |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 No relevant tests |
| 🔒 No security concerns identified |
| ⚡ Recommended focus areas for review |
Code Quality
grouped_block_quantize_to_nvfp4 function has a linear search loop to find the expert_id which could be inefficient for large group sizes. Consider if a binary search or other optimization is needed. |
|
!test |
Greptile SummaryRefactored block quantization runtime code by consolidating
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Kernel as Generated Kernel Code
participant CG as CodeGen (codegen.cpp)
participant CK as CompiledKernel (compiled_kernel.cpp)
participant BQ as block_quantization_kernels.cu
Note over CG,CK: Build Phase
CK->>CK: Check has_block_layout || has_block_quantize_op
CK->>BQ: Include block_quantization_kernels_cu
Note over BQ: Contains merged functionality:<br/>- outputOffsetAfterSwizzlePadding<br/>- preprocessGroupedMatmulInputSf<br/>- block_quantize_to_nvfp4_util<br/>- grouped_block_quantize_to_nvfp4
Note over Kernel,BQ: Runtime Phase
Kernel->>BQ: Call bq::preprocessGroupedMatmulInputSf
Note over BQ: Uses outputOffsetAfterSwizzlePadding<br/>for swizzle layout computation
Kernel->>BQ: Call bq::grouped_block_quantize_to_nvfp4
BQ->>BQ: Find expert_id from input_offsets
BQ->>BQ: Compute output offset with swizzle
BQ->>BQ: Call block_quantize_to_nvfp4_util
Note over BQ: Performs quantization and<br/>writes block scaling factors
|
Greptile's behavior is changing!From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
|
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR refactors runtime functions to prepare for grouped block quantization operations. The changes consolidate block layout functions from block_layout.cu into block_quantization_kernels.cu and introduce new runtime functions.
Key changes:
- Moved
outputOffsetAfterSwizzlePaddingandpreprocessGroupedMatmulInputSffromblock_layout.cutoblock_quantization_kernels.cunamespacebq:: - Refactored
block_quantize_to_nvfp4by extracting core logic intoblock_quantize_to_nvfp4_utilfor reuse - Added new
grouped_block_quantize_to_nvfp4function that combines quantization with grouped layout handling - Added
block_quantize_to_mxfp8function for FP8 quantization - Updated build system to remove
block_layout.cuand consolidated resource loading
Issue found:
- Parameter type mismatch in
block_quantize_to_nvfp4_util- expectsglobal_scaleby reference but callers pass by value
Confidence Score: 3/5
- This PR has a parameter type mismatch that needs to be fixed before merging
- The refactoring is well-structured and consolidates related functionality appropriately. However, there's a critical parameter type inconsistency in
block_quantize_to_nvfp4_utilwhereglobal_scaleis declared as a reference but callers pass it by value. This will cause compilation errors or incorrect behavior and must be resolved. - Pay close attention to
runtime/block_quantization_kernels.cu- fix the parameter type mismatch on line 202
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| CMakeLists.txt | 5/5 | removed block_layout.cu from build files as the functions were consolidated into block_quantization_kernels.cu |
| csrc/runtime/compiled_kernel.cpp | 5/5 | removed block_layout.h include and consolidated resource loading - block_quantization_kernels_cu now loaded for both has_block_layout and has_block_quantize_op flags |
| runtime/block_quantization_kernels.cu | 3/5 | major refactoring - added helper functions from block_layout.cu, refactored block_quantize_to_nvfp4 into utility function, added new block_quantize_to_mxfp8 and grouped_block_quantize_to_nvfp4 functions; parameter type mismatch found |
Sequence Diagram
sequenceDiagram
participant Caller as Fusion Kernel
participant GBQ as grouped_block_quantize_to_nvfp4
participant Util as block_quantize_to_nvfp4_util
participant Helper as outputOffsetAfterSwizzlePadding
Note over Caller,Helper: New Grouped Block Quantization Flow
Caller->>GBQ: input, row_idx, col_idx, offsets, group_size
Note over GBQ: Find expert_id for current row
GBQ->>GBQ: Search input_offsets array
Note over GBQ: Calculate group-relative indices
GBQ->>GBQ: c_row_idx = row_idx - input_offsets[expert_id]
GBQ->>GBQ: padded_col_size = ceil(col_size/BLOCK_SIZE / BLOCK_COL) * BLOCK_COL
GBQ->>Helper: c_row_idx, col_idx/BLOCK_SIZE, padded_col_size
Helper-->>GBQ: swizzled index
Note over GBQ: Calculate output offset
GBQ->>GBQ: offset = output_offsets[expert_id] * padded_col_size + index
GBQ->>Util: input, output, block_scales, global_scale, offset
Note over Util: Quantization Logic
Util->>Util: Convert to float & compute max
Util->>Util: Reduce across threads
Util->>Util: Scale with global_scale (if enabled)
Util->>Util: Clamp to FP8
Util->>Util: Write block_scales[offset]
Util->>Util: Quantize to nvfp4
Util-->>GBQ: quantized output
GBQ-->>Caller: output with swizzled layout
protonu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Context
The series of PRs is trying to enable a single kernel for quantization and layout handling of block scaling factor on grouped tensors.
Existing solution for nvfp4 quantization of activation Tensor for grouped_mm relies on two operation:
i. BlockQuantizationOp produces scaled_tv and block_scaling_factor.
ii. block_scaling_factor needs to be processed by PreprocessGroupedMatmulInputSf in order to satisfy the swizzle layout required by grouped_mm kernels
The series of PRs tries to merge the two operation into a single one.
Stacked PRs
#5775 GroupedBlockQuantizationOp PR0: Adding runtime function
#5776 GroupedBlockQuantizationOp PR1: Adding codegen support
#5777 GroupedBlockQuantizationOp PR2: Adding python API and updating llama4 benchmark
What's in this PR