You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GroupedBlockQuantizeOp PR1: Adding codegen support (#5776)
## Context
The series of PRs is trying to enable a single kernel for quantization
and layout handling of block scaling factor on grouped tensors.
Existing solution for nvfp4 quantization of activation Tensor for
grouped_mm relies on two operation:
i. BlockQuantizationOp produces scaled_tv and block_scaling_factor.
ii. block_scaling_factor needs to be processed by
PreprocessGroupedMatmulInputSf in order to satisfy the swizzle layout
required by grouped_mm kernels
The series of PRs tries to merge the two operation into a single one.
### Stacked PRs
#5775 GroupedBlockQuantizationOp PR0: Adding runtime function
#5776 GroupedBlockQuantizationOp PR1: Adding codegen support
#5777 GroupedBlockQuantizationOp PR2: Adding python API and updating
llama4 benchmark
## What's in this PR
1. Adding Fusion IR node GroupedBlockQuantizationOp. The operation is a
combination of BlockQuantizationOp and PreprocessGroupedMatmulInputSf,
where it inherits all the validation / checks from the two operations.
The operation is similar to BlockQuantizationOp, with the exception
that:
i. The block scaling factor output doesn't have the swizzle logic
represented as allocation domain transformations;
ii. It takes an additional inputs (input_offsets and output_offsets) to
facilitate group indexing, similar to PreprocessGroupedMatmulInputSf.
2. Adding cpp test case for GroupedBlockQuantizationOp.
0 commit comments