-
Notifications
You must be signed in to change notification settings - Fork 74
GroupedBlockQuantizeOp PR2: Adding python API and updating llama4 benchmark #5777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: jj/grouped_block_quantize_op_1
Are you sure you want to change the base?
GroupedBlockQuantizeOp PR2: Adding python API and updating llama4 benchmark #5777
Conversation
|
Review updated until commit 5c905de Description
|
| Relevant files | |||||||
|---|---|---|---|---|---|---|---|
| Enhancement |
| ||||||
| Tests |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Performance validation
|
## Context The series of PRs is trying to enable a single kernel for quantization and layout handling of block scaling factor on grouped tensors. Existing solution for nvfp4 quantization of activation Tensor for grouped_mm relies on two operation: i. BlockQuantizationOp produces scaled_tv and block_scaling_factor. ii. block_scaling_factor needs to be processed by PreprocessGroupedMatmulInputSf in order to satisfy the swizzle layout required by grouped_mm kernels The series of PRs tries to merge the two operation into a single one. ### Stacked PRs #5775 GroupedBlockQuantizationOp PR0: Adding runtime function #5776 GroupedBlockQuantizationOp PR1: Adding codegen support #5777 GroupedBlockQuantizationOp PR2: Adding python API and updating llama4 benchmark ## What's in this PR 1. refactor existing runtime function for re-use by the new op; 2. added runtime function for GroupedBlockQuantizeOp.
Context
The series of PRs is trying to enable a single kernel for quantization and layout handling of block scaling factor on grouped tensors.
Existing solution for nvfp4 quantization of activation Tensor for grouped_mm relies on two operation:
i. BlockQuantizationOp produces scaled_tv and block_scaling_factor.
ii. block_scaling_factor needs to be processed by PreprocessGroupedMatmulInputSf in order to satisfy the swizzle layout required by grouped_mm kernels
The series of PRs tries to merge the two operation into a single one.
Stacked PRs
#5775 GroupedBlockQuantizationOp PR0: Adding runtime function
#5776 GroupedBlockQuantizationOp PR1: Adding codegen support
#5777 GroupedBlockQuantizationOp PR2: Adding python API and updating llama4 benchmark
What's in this PR
ops.nv_grouped_block_quantizefor GroupedBlockQuantizationOp;