[Do not review] Python Bindings for Block Quantization #5579

protonu · 2025-11-21T21:02:37Z

No description provided.

…pbasu_working_swizzle

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

…pbasu_working_swizzle

github-actions · 2025-11-21T21:03:53Z

Review updated until commit 5e0b02f

Description

Add Python bindings for NVFP4 block quantization with nv_block_quantize operation
Implement memory layout optimization via swizzleBlockScale for improved access patterns
Add comprehensive test suite comparing against Transformer Engine and PyTorch references
Optimize CUDA kernel performance by replacing division with multiplication by reciprocal

Changes walkthrough

Relevant files

Enhancement

ops.cpp `Add Python bindings for NVFP4 block quantization` python/python_direct/ops.cpp Add `swizzleBlockScale` helper function to optimize memory layout for FP4 quantization Implement `bindQuantizationOps` function to expose `nv_block_quantize` to Python Add `nv_block_quantize` operation with parameters for input tensor, global scale, block size, and dtype Register quantization operations in the main binding function	+73/-0
python_translate.cpp `Map BlockQuantizationOp to Python frontend` python/python_direct/python_translate.cpp Add handler for `BlockQuantizationOp` to map operation to Python frontend Generate kwargs operation for `fd.ops.nv_block_quantize` with proper argument mapping Handle output registration for both quantized tensor and block scales	+27/-0
block_quantization_kernels.cu `Optimize CUDA kernel performance for block quantization` runtime/block_quantization_kernels.cu Replace division with multiplication by reciprocal for better performance Simplify clamping logic and remove redundant conversions Modify global scale application logic for improved numerical stability Change value scaling from division to multiplication for better precision	+13/-12

Tests

test_narrow_precision.py `Add comprehensive test suite for NVFP4 quantization` tests/python/direct/test_narrow_precision.py Add `functional_nvfp4_quantize` using Transformer Engine NVFP4Quantizer for reference Add `test_nv_block_quantization_vs_te` comparing nvfuser output against Transformer Engine Add `test_nv_block_quantization_vs_pytorch` comparing against PyTorch reference implementation Add `test_scaled_mm_new` testing scaled matrix multiplication with NVFP4 tensors Update utility functions and constants for NVFP4 quantization	+357/-3

Bug fix

tensor_metadata.cpp `Skip validation for empty tensors` csrc/tensor_metadata.cpp Comment out validation of allocation sizes and strides for empty tensors Skip validation when tensor has zero elements to avoid unnecessary overhead	+4/-3

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Validation Code Removed The validation of allocation sizes and strides has been commented out in the inferAndValidateAllocationSizesAndStrides function. This could potentially hide memory layout issues and should be verified that validation is no longer needed or moved elsewhere. // if (tensor.numel() != 0) { // validateAllocationSizesAndStrides(tv, allocation_sizes, // allocation_strides); // } Kernel Scaling Logic Changes The scaling and clamping logic in block_quantize_to_nvfp4 has been significantly modified. The previous implementation used explicit clamping with min/max bounds, while the new implementation relies on FP8 conversion for clamping. This changes the numerical behavior and should be verified for correctness across the full input range. // This division should be replaced with a multiplication // by a reciprocal for better performance. // float scaled_max = block_max / 6.000000000e+00f; constexpr float rcp_6f = 1.0f / 6.0f; float scaled_max = 0.0f; if constexpr (USE_GLOBAL_SCALE) { scaled_max = block_max * global_scale[0] * rcp_6f; } else { scaled_max = block_max / 6.000000000e+00f; } __e4m3 clamped_max_fp8 = __float2e4m3(scaled_max); float clamped_max = __e4m32float(clamped_max_fp8); if constexpr (USE_GLOBAL_SCALE) { clamped_max = global_scale[0] / clamped_max; } BFloat16 Conversion Issue The bfloat16 conversion has been modified to use a double conversion (__bfloat2float -> __float2bfloat -> __bfloat2float) which appears redundant and may introduce precision loss. The original single conversion should be verified as correct. vec_in[i] = __bfloat2float(__float2bfloat(__bfloat2float(input[i])));

protonu added 30 commits September 29, 2025 12:34

creating a new node

8e3842f

removing commented out code

9c06080

codegen the indices for the outputs

fd79ac1

adding a new test for 2D sched

60af605

write quantized output to regs

73295a1

clean up

f97264a

clean up

a587641

clean up trivial broadcast

a44a6fc

clean up the tests

727b663

minor cleanup

e8b0ade

address reviewer comments

55bb2c3

reviewer comment

69c315e

move comments around

ab5e60e

clean up

8eedb76

edit comments

a7b6d58

remove setting parallel type for BIDx TIDx

75699a1

remove setting parallel type for BIDx TIDx - cleanup

c7f1d8d

adding support for parallel type group

44740dc

address reviewer comments

9130435

adding a comment

434a1b1

removing a comment

ce0b820

modifying a check

fb4ad21

merge

2ce601f

merge

beb2f06

Merge branch 'main' into pbasu_fp4_node

9bd7936

add validation, update test

b04d974

clean up from merge

7c79c32

remove a utility function

8564431

support half and bfloat in tests

774e27d

removing vectorize

d43096e

protonu and others added 23 commits November 18, 2025 08:19

format

79c7dc2

Merge branch 'main' into pbasu_working_swizzle

91e9c59

Merge branch 'main' into pbasu_working_swizzle

989b974

address reviewer comments

483069f

Merge branch 'pbasu_working_swizzle' of github.com:nvidia/fuser into …

11eb6c6

…pbasu_working_swizzle

Apply suggestions from code review

190802f

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

add validation for swizzling

de431d3

more tests for swizzle validation

18a51af

Merge branch 'main' into pbasu_working_swizzle

b664575

clean up

646b136

Merge branch 'pbasu_working_swizzle' of github.com:nvidia/fuser into …

5651c24

…pbasu_working_swizzle

clean up

d0c0e02

Merge branch 'main' into pbasu_working_swizzle

06f7987

handle clang-tidy error

741e530

Merge branch 'pbasu_working_swizzle' of github.com:nvidia/fuser into …

661e6ea

…pbasu_working_swizzle

better comment

317445d

Merge branch 'main' into pbasu_working_swizzle

c13125c

cleanup validation

e249df9

cleanup

2ebb93f

python API for block quantization

896e45e

Merge branch 'main' into pbasu_bq_py_api

3a41bf8

wip

553daa4

wip

2962ac4

protonu added 4 commits November 21, 2025 14:09

wip

740eea9

adding test against TE

a6e767d

almost working tests with TE

2434f76

scaled_mm test

5e0b02f

protonu closed this Dec 1, 2025

protonu deleted the pbasu_bq_py_api branch December 3, 2025 19:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Do not review] Python Bindings for Block Quantization #5579

[Do not review] Python Bindings for Block Quantization #5579

Uh oh!

protonu commented Nov 21, 2025

Uh oh!

github-actions bot commented Nov 21, 2025 •

edited

Loading

Changes walkthrough

PR Reviewer Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Do not review] Python Bindings for Block Quantization #5579

[Do not review] Python Bindings for Block Quantization #5579

Uh oh!

Conversation

protonu commented Nov 21, 2025

Uh oh!

github-actions bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Nov 21, 2025 •

edited

Loading