Skip to content

Conversation

@protonu
Copy link
Collaborator

@protonu protonu commented Nov 21, 2025

No description provided.

@github-actions
Copy link

github-actions bot commented Nov 21, 2025

Review updated until commit 5e0b02f

Description

  • Add Python bindings for NVFP4 block quantization with nv_block_quantize operation

  • Implement memory layout optimization via swizzleBlockScale for improved access patterns

  • Add comprehensive test suite comparing against Transformer Engine and PyTorch references

  • Optimize CUDA kernel performance by replacing division with multiplication by reciprocal

Changes walkthrough

Relevant files
Enhancement
ops.cpp
Add Python bindings for NVFP4 block quantization                 

python/python_direct/ops.cpp

  • Add swizzleBlockScale helper function to optimize memory layout for
    FP4 quantization
  • Implement bindQuantizationOps function to expose nv_block_quantize to
    Python
  • Add nv_block_quantize operation with parameters for input tensor,
    global scale, block size, and dtype
  • Register quantization operations in the main binding function
  • +73/-0   
    python_translate.cpp
    Map BlockQuantizationOp to Python frontend                             

    python/python_direct/python_translate.cpp

  • Add handler for BlockQuantizationOp to map operation to Python
    frontend
  • Generate kwargs operation for fd.ops.nv_block_quantize with proper
    argument mapping
  • Handle output registration for both quantized tensor and block scales
  • +27/-0   
    block_quantization_kernels.cu
    Optimize CUDA kernel performance for block quantization   

    runtime/block_quantization_kernels.cu

  • Replace division with multiplication by reciprocal for better
    performance
  • Simplify clamping logic and remove redundant conversions
  • Modify global scale application logic for improved numerical stability
  • Change value scaling from division to multiplication for better
    precision
  • +13/-12 
    Tests
    test_narrow_precision.py
    Add comprehensive test suite for NVFP4 quantization           

    tests/python/direct/test_narrow_precision.py

  • Add functional_nvfp4_quantize using Transformer Engine NVFP4Quantizer
    for reference
  • Add test_nv_block_quantization_vs_te comparing nvfuser output against
    Transformer Engine
  • Add test_nv_block_quantization_vs_pytorch comparing against PyTorch
    reference implementation
  • Add test_scaled_mm_new testing scaled matrix multiplication with NVFP4
    tensors
  • Update utility functions and constants for NVFP4 quantization
  • +357/-3 
    Bug fix
    tensor_metadata.cpp
    Skip validation for empty tensors                                               

    csrc/tensor_metadata.cpp

  • Comment out validation of allocation sizes and strides for empty
    tensors
  • Skip validation when tensor has zero elements to avoid unnecessary
    overhead
  • +4/-3     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Validation Code Removed

    The validation of allocation sizes and strides has been commented out in the inferAndValidateAllocationSizesAndStrides function. This could potentially hide memory layout issues and should be verified that validation is no longer needed or moved elsewhere.

    // if (tensor.numel() != 0) {
    //   validateAllocationSizesAndStrides(tv, allocation_sizes,
    //   allocation_strides);
    // }
    Kernel Scaling Logic Changes

    The scaling and clamping logic in block_quantize_to_nvfp4 has been significantly modified. The previous implementation used explicit clamping with min/max bounds, while the new implementation relies on FP8 conversion for clamping. This changes the numerical behavior and should be verified for correctness across the full input range.

    // This division should be replaced with a multiplication
    // by a reciprocal for better performance.
    // float scaled_max = block_max / 6.000000000e+00f;
    
    constexpr float rcp_6f = 1.0f / 6.0f;
    
    float scaled_max = 0.0f;
    if constexpr (USE_GLOBAL_SCALE) {
      scaled_max = block_max * global_scale[0] * rcp_6f;
    } else {
      scaled_max = block_max / 6.000000000e+00f;
    }
    
    __e4m3 clamped_max_fp8 = __float2e4m3(scaled_max);
    
    float clamped_max = __e4m32float(clamped_max_fp8);
    
    if constexpr (USE_GLOBAL_SCALE) {
      clamped_max = global_scale[0] / clamped_max;
    }
    BFloat16 Conversion Issue

    The bfloat16 conversion has been modified to use a double conversion (__bfloat2float -> __float2bfloat -> __bfloat2float) which appears redundant and may introduce precision loss. The original single conversion should be verified as correct.

    vec_in[i] = __bfloat2float(__float2bfloat(__bfloat2float(input[i])));

    @protonu protonu closed this Dec 1, 2025
    @protonu protonu deleted the pbasu_bq_py_api branch December 3, 2025 19:11
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants