Skip to content

Conversation

@liqiangxl
Copy link
Collaborator

@liqiangxl liqiangxl commented Jan 5, 2026

skip unsupported arches in grouped and scaled matmuls

@liqiangxl
Copy link
Collaborator Author

!test

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 5, 2026

Greptile Overview

Greptile Summary

Overview

This PR updates architecture checks in Python tests to skip unsupported GPU architectures for FP4 quantization and GEMM operations. The changes replace module-level skip conditions with per-test @pytest.mark.skipif decorators.

Key Changes

  1. Added is_blackwell() helper in utils.py to identify Blackwell GPU variants
  2. Replaced module-level skip in test_cutlass_nvfp4_gemm.py with per-test decorators
  3. Consolidated skip conditions across test files to use microarchitecture_is(10, 0) for most tests
  4. Simplified dual skip conditions (e.g., is_pre_blackwell() + microarchitecture_is_pre(12) → single check)

Critical Issues Found

1. Incorrect Architecture Definition in is_blackwell()

The new is_blackwell() function includes compute capabilities 12.0 (RTX PRO 6000/RTX 50XX) and 12.1 (DGX Spark) as Blackwell architectures. However, the cutlass scheduler (csrc/scheduler/cutlass.cpp:75-84) explicitly only supports compute capabilities 10.0 and 10.3:

if (device_prop->major != 10 || !(device_prop->minor == 0 || device_prop->minor == 3)) {
  // Error: "Cutlass scheduler only supports GB200 and GB300 (cc 10.0 or 10.3)"
}

Impact: Tests using is_blackwell() will incorrectly run on 12.0/12.1 devices and fail at runtime.

2. Inconsistent Skip Conditions in test_narrow_precision.py

  • test_scaled_mm uses microarchitecture_is(10, 0) - only 10.0
  • test_scaled_mm_nv_quantized uses is_blackwell() - includes unsupported 12.0/12.1
  • test_cutlass_nvfp4_grouped_mm uses microarchitecture_is(10, 0) - only 10.0

All these tests use nvfuser operations that may rely on cutlass scheduling. The inconsistency creates unclear test coverage and will cause failures on 12.0/12.1 devices for test_scaled_mm_nv_quantized.

3. More Restrictive Than Original Logic

The old skip conditions checked:

  • is_pre_blackwell() (major < 10) + not microarchitecture_is_pre(12) (major < 12)
  • Result: Ran on 10.0, 10.3, and any 11.x devices

The new condition microarchitecture_is(10, 0) only runs on exactly 10.0, excluding:

  • 10.3 (B300/GB300) - which cutlass scheduler supports
  • 11.x devices - unclear if these exist or were intended to be supported

Impact: Reduced test coverage, particularly for 10.3 which is explicitly supported by the cutlass scheduler.

Recommendations

  1. Fix is_blackwell(): Remove 12.0 and 12.1, or rename the function to clarify it's broader than cutlass support
  2. Standardize skip conditions: Decide whether to support only 10.0 or both 10.0 and 10.3, then apply consistently
  3. Update test_scaled_mm_nv_quantized: Change to use microarchitecture_is(10, 0) or the corrected is_blackwell() to prevent runtime failures on unsupported architectures

Confidence Score: 2/5

  • This PR has critical logic errors that will cause test failures on compute capability 12.0/12.1 devices
  • Score of 2 reflects three critical issues: (1) is_blackwell() incorrectly includes unsupported compute capabilities 12.0 and 12.1, which will cause runtime failures when tests attempt to use cutlass operations on these devices; (2) inconsistent skip conditions across tests create unclear behavior and potential failures; (3) the new logic is more restrictive than the original, potentially excluding supported architectures like 10.3. While the module-level to per-test skip migration is correct, the implementation has significant logical flaws that need correction before merge.
  • tests/python/direct_utils/utils.py (fix is_blackwell definition) and tests/python/direct/test_narrow_precision.py (standardize skip conditions)

Important Files Changed

File Analysis

Filename Score Overview
tests/python/direct_utils/utils.py 3/5 Added is_blackwell() helper but incorrectly includes compute 12.0 and 12.1 which are not supported by cutlass scheduler
tests/python/direct/test_cutlass_nvfp4_gemm.py 5/5 Correctly replaced module-level skip with per-test skipif decorators using microarchitecture_is(10, 0)
tests/python/direct/test_narrow_precision.py 2/5 Inconsistent skip conditions: test_scaled_mm_nv_quantized uses is_blackwell() (includes unsupported 12.0/12.1) while other tests correctly use microarchitecture_is(10, 0)
tests/python/direct/test_with_id_model_indexer.py 5/5 Correctly consolidated dual skip conditions into single microarchitecture_is(10, 0) check

Sequence Diagram

sequenceDiagram
    participant Test as Test Runner
    participant Skip as Skip Condition Check
    participant Device as GPU Device
    participant Cutlass as Cutlass Scheduler
    participant Nvfuser as Nvfuser Operations
    
    Test->>Skip: Evaluate @pytest.mark.skipif
    Skip->>Device: Query compute capability
    Device-->>Skip: Returns (major, minor)
    
    alt Old Logic (is_pre_blackwell + microarchitecture_is_pre)
        Skip->>Skip: Check major >= 10 AND major < 12
        Note over Skip: Runs on 10.0, 10.3, 11.x
    else New Logic (microarchitecture_is)
        Skip->>Skip: Check major == 10 AND minor == 0
        Note over Skip: Only runs on 10.0
    else New Logic (is_blackwell - problematic)
        Skip->>Skip: Check 10.0, 10.3, 12.0, 12.1
        Note over Skip: Includes unsupported 12.0/12.1
    end
    
    alt Test NOT Skipped
        Test->>Nvfuser: Execute fusion (scaled_mm, cutlass ops)
        Nvfuser->>Cutlass: Request scheduling
        Cutlass->>Device: Verify compute capability
        alt Device is 10.0 or 10.3
            Cutlass-->>Nvfuser: Schedule operations
            Nvfuser-->>Test: Success
        else Device is 12.0 or 12.1
            Cutlass-->>Nvfuser: Error: Unsupported architecture
            Nvfuser-->>Test: FAIL
        end
    else Test Skipped
        Skip-->>Test: Skip test
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@pytest.mark.skipif(
is_pre_blackwell(), reason="Only supported on blackwell and newer devices."
)
@pytest.mark.skipif(not is_blackwell(), reason="Only supported on blackwell.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Inconsistent architecture filtering: this test uses is_blackwell() (supports 10.0, 10.3, 12.0, 12.1) while other tests in this same file use microarchitecture_is(10, 0) (only supports 10.0). This creates different test coverage across Blackwell variants.

@github-actions
Copy link

github-actions bot commented Jan 5, 2026

Review updated until commit 835c7a8

Description

  • Replace compute capability checks with microarchitecture checks for test skipping

  • Add is_blackwell() utility function to identify Blackwell architectures (10.0, 10.3, 12.0, 12.1)

  • Update test decorators to skip unsupported architectures with clearer messaging

  • Remove deprecated is_pre_blackwell() and microarchitecture_is_pre() usage

Changes walkthrough

Relevant files
Tests
test_cutlass_nvfp4_gemm.py
Update NVFP4 GEMM test skipping logic                                       

tests/python/direct/test_cutlass_nvfp4_gemm.py

  • Remove global compute_cap check at module level
  • Add @pytest.mark.skipif decorators to test functions using
    microarchitecture_is(10, 0)
  • Import microarchitecture_is utility function
  • Update skip reason to clarify Blackwell compute 12.0 is not supported
  • +13/-8   
    test_narrow_precision.py
    Update narrow precision test skipping                                       

    tests/python/direct/test_narrow_precision.py

  • Remove is_pre_blackwell() and microarchitecture_is_pre(12) skipif
    decorators
  • Add is_blackwell() and microarchitecture_is imports
  • Replace with microarchitecture_is(10, 0) checks for most tests
  • Use is_blackwell() for one test requiring broader Blackwell support
  • +9/-15   
    test_with_id_model_indexer.py
    Update model indexer test skipping                                             

    tests/python/direct/test_with_id_model_indexer.py

  • Remove is_pre_blackwell() and microarchitecture_is_pre(12) skipif
    decorators
  • Add microarchitecture_is import
  • Add @pytest.mark.skipif decorator using microarchitecture_is(10, 0)
  • +3/-6     
    Enhancement
    utils.py
    Add Blackwell architecture detection utility                         

    tests/python/direct_utils/utils.py

  • Add is_blackwell() function to identify Blackwell architectures
  • Support microarchitectures 10.0, 10.3, 12.0, and 12.1
  • Provide centralized utility for Blackwell architecture detection
  • +13/-0   

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Architecture Coverage Inconsistency

    The test decorators use microarchitecture_is(10, 0) which only allows compute capability 10.0, but the new is_blackwell() function in utils.py includes 10.3, 12.0, and 12.1 as Blackwell architectures. This suggests the tests might be too restrictive and not testing all intended Blackwell architectures.

    @pytest.mark.skipif(
        not microarchitecture_is(10, 0),
        reason="Does not support blackwell compute 12.0, other arches are not tested.",
    )
    Inconsistent Architecture Checking

    There's inconsistent use of architecture checking functions - some tests use microarchitecture_is(10, 0) while others use is_blackwell(). This inconsistency should be reviewed to ensure all intended architectures are properly tested.

    @pytest.mark.skipif(not is_blackwell(), reason="Only supported on blackwell.")
    New Function Implementation

    The new is_blackwell() function should be validated to ensure it correctly identifies all intended Blackwell architectures (10.0, 10.3, 12.0, 12.1) and that this matches the intended test coverage.

    def is_blackwell():
        return (
            microarchitecture_is(10, 0)
            or microarchitecture_is(10, 3)
            or microarchitecture_is(12, 0)
            or microarchitecture_is(12, 1)
        )

    Test failures

    • (Low, 1) Minor numerical mismatch in thunder instance_norm nvFuser CUDA test on A100

      Test Name A100 Source
      thunder.tests.test_ops.test_core_vs_torch_consistency_instance_norm_nvfuser_cuda_thunder.dtypes.float32

    @naoyam
    Copy link
    Collaborator

    naoyam commented Jan 5, 2026

    The title suggests this has something to do with indexing. Can you clarify what it is?

    @liqiangxl liqiangxl changed the title fix test with id model indexer skip unsupported arches in test test_with_id_model_indexer.py Jan 6, 2026
    @liqiangxl
    Copy link
    Collaborator Author

    The title suggests this has something to do with indexing. Can you clarify what it is?

    The title was not accurate, failure is not realated to indexing. Revised to skip unsupported arches in test test_with_id_model_indexer.py``

    @liqiangxl liqiangxl marked this pull request as draft January 6, 2026 14:30
    @liqiangxl
    Copy link
    Collaborator Author

    @greptileai review this draft

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    4 files reviewed, 4 comments

    Edit Code Review Agent Settings | Greptile

    Comment on lines 239 to 242
    @pytest.mark.skipif(
    is_pre_blackwell(), reason="Only supported on blackwell and newer devices."
    )
    @pytest.mark.skipif(
    not microarchitecture_is_pre(12), reason="Does not support blackwell compute 12.0"
    not microarchitecture_is(10, 0),
    reason="Does not support blackwell compute 12.0, other arches are not tested.",
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    style: Inconsistent architecture filtering within same file: this test uses microarchitecture_is(10, 0) (only supports 10.0), but other tests in this file (lines 66, 173, 320) use is_blackwell() which supports all Blackwell variants (10.0, 10.3, 12.0, 12.1). Are the _scaled_mm tests specifically incompatible with 12.x architectures while quantization tests support them? Do scaled_mm operations have specific incompatibilities with 12.x compute capabilities that quantization operations don't have?

    Comment on lines 448 to 451
    @pytest.mark.skipif(
    is_pre_blackwell(), reason="Only supported on blackwell and newer devices."
    )
    @pytest.mark.skipif(
    not microarchitecture_is_pre(12), reason="Does not support blackwell compute 12.0"
    not microarchitecture_is(10, 0),
    reason="Does not support blackwell compute 12.0, other arches are not tested.",
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    style: Same inconsistency: microarchitecture_is(10, 0) here vs is_blackwell() at line 320 for similar grouped matmul tests

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Comment on lines 608 to 611
    @pytest.mark.skipif(
    is_pre_blackwell(), reason="Only supported on blackwell and newer devices."
    )
    @pytest.mark.skipif(
    not microarchitecture_is_pre(12), reason="Does not support blackwell compute 12.0"
    not microarchitecture_is(10, 0),
    reason="Does not support blackwell compute 12.0, other arches are not tested.",
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    style: Same pattern: microarchitecture_is(10, 0) here vs is_blackwell() elsewhere in file

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Comment on lines 651 to 654
    @pytest.mark.skipif(
    is_pre_blackwell(), reason="Only supported on blackwell and newer devices."
    )
    @pytest.mark.skipif(
    not microarchitecture_is_pre(12), reason="Does not support blackwell compute 12.0"
    not microarchitecture_is(10, 0),
    reason="Does not support blackwell compute 12.0, other arches are not tested.",
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    style: Same pattern: microarchitecture_is(10, 0) here vs is_blackwell() elsewhere in file

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    @liqiangxl liqiangxl changed the title skip unsupported arches in test test_with_id_model_indexer.py skip unsupported arches in python tests Jan 6, 2026
    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl
    Copy link
    Collaborator Author

    liqiangxl commented Jan 6, 2026

    Summary of failure reasons:
    The first four tests are failed in the process of computing the reference output using PyTorch. The last test is due to nvFuser.

    1. Test: CUDA_LAUNCH_BLOCKING=1 pytest tests/python/direct/test_cutlass_gemm.py -k "bfloat16-tokens_per_expert_neg_one and 115 and 1024" -vvvs
      Err: Exception raised from run_group_mm at /opt/pytorch/nvfuser/cutlass/group_mm.cu:372
      Reason: nvFuser grouped matmul related
      Fix: already skipped in current main branch

    2. Test: CUDA_LAUNCH_BLOCKING=1 pytest tests/python/direct/test_cutlass_nvfp4_gemm.py -k "gemm[shape and 256 and 64 and bfloat16" -vvvs
      Err: Exception raised from runGemm at /opt/pytorch/nvfuser/cutlass/nvfp4_scaled_mm.cu:255
      Reason: nvFuser scaled matmul related

    3. Test CUDA_LAUNCH_BLOCKING=1 pytest tests/python/direct/test_cutlass_nvfp4_gemm.py -k "grouped_mm and bfloat"
      Err: Exception raised from run_nvfp4_scaled_group_mm at /opt/pytorch/nvfuser/cutlass/nvfp4_scaled_group_mm.cu:518
      Reason: nvFuser scaled-grouped matmul related

    4. Test CUDA_LAUNCH_BLOCKING=1 pytest tests/python/direct/test_narrow_precision.py -k "scaled_mm" -vvvs
      Reason: nvFuser scaled matmul related

    5. Test NVFUSER_DISABLE=parallel_compile pytest tests/python/direct/test_with_id_model_indexer.py -k "eager" -vvvs
      Err: INTERNAL ASSERT FAILED at /opt/pytorch/nvfuser/csrc/runtime/executor_utils.cpp:559
      Reason: seems nvFuser related.

    @liqiangxl liqiangxl marked this pull request as ready for review January 13, 2026 13:51
    @liqiangxl
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    6 files reviewed, 6 comments

    Edit Code Review Agent Settings | Greptile

    Comment on lines +42 to +48
    def is_blackwell():
    return (
    microarchitecture_is(10, 0)
    or microarchitecture_is(10, 3)
    or microarchitecture_is(12, 0)
    or microarchitecture_is(12, 1)
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The is_blackwell() function incorrectly includes compute capability 12.0 and 12.1 as "Blackwell" architectures. However, the cutlass scheduler (csrc/scheduler/cutlass.cpp lines 75-84) explicitly only supports compute capabilities 10.0 and 10.3:

    if (device_prop->major != 10 ||
        !(device_prop->minor == 0 || device_prop->minor == 3)) {
      // error: "Cutlass scheduler only supports GB200 and GB300 (cc 10.0 or 10.3)"
    }

    Compute capabilities 12.0 (RTX PRO 6000 and RTX 50XX) and 12.1 (DGX Spark) are NOT Blackwell architectures and are NOT supported by the cutlass scheduler.

    This function should only return true for 10.0 and 10.3:

    Suggested change
    def is_blackwell():
    return (
    microarchitecture_is(10, 0)
    or microarchitecture_is(10, 3)
    or microarchitecture_is(12, 0)
    or microarchitecture_is(12, 1)
    )
    def is_blackwell():
    return (
    microarchitecture_is(10, 0)
    or microarchitecture_is(10, 3)
    )

    Alternatively, if 12.0 and 12.1 are legitimately Blackwell variants, the function should be renamed to clarify it's broader than what cutlass supports, and tests should use more specific checks.

    @pytest.mark.skipif(
    is_pre_blackwell(), reason="Only supported on blackwell and newer devices."
    )
    @pytest.mark.skipif(not is_blackwell(), reason="Only supported on blackwell.")
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Inconsistent architecture check: this test uses is_blackwell() which includes compute capabilities 10.0, 10.3, 12.0, and 12.1. However:

    1. Other similar tests in this file that use fd.ops.scaled_mm (like test_scaled_mm at line 245) use microarchitecture_is(10, 0) - only 10.0
    2. The cutlass scheduler only supports 10.0 and 10.3 (not 12.0 or 12.1)

    This test uses fd.ops.scaled_mm and fd.ops.nv_block_quantize, which may rely on cutlass scheduling. If so, it should use the same skip condition as other tests:

    Suggested change
    @pytest.mark.skipif(not is_blackwell(), reason="Only supported on blackwell.")
    @pytest.mark.skipif(not microarchitecture_is(10, 0), reason="Only supported on blackwell.")

    If this test is genuinely intended to support all Blackwell variants (10.0, 10.3, 12.0, 12.1) because scaled_mm uses a different scheduler (nvmmh), then the is_blackwell() function needs to be fixed to exclude 12.0 and 12.1, OR this needs documentation explaining why it's different.

    Comment on lines 245 to 248
    @pytest.mark.skipif(
    is_pre_blackwell(), reason="Only supported on blackwell and newer devices."
    )
    @pytest.mark.skipif(
    not microarchitecture_is_pre(12), reason="Does not support blackwell compute 12.0"
    not microarchitecture_is(10, 0),
    reason="Does not support blackwell compute 12.0, other arches are not tested.",
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The skip condition logic has changed significantly. The old conditions were:

    • is_pre_blackwell() → skip if major < 10
    • not microarchitecture_is_pre(12) → skip if major >= 12

    This ran tests on architectures where: 10 <= major < 12 (i.e., 10.0, 10.3, and any 11.x devices)

    The new condition not microarchitecture_is(10, 0) runs tests ONLY on exactly 10.0.

    If the intent is to support 10.3 (B300/GB300) in addition to 10.0, the condition should be:

    Suggested change
    @pytest.mark.skipif(
    is_pre_blackwell(), reason="Only supported on blackwell and newer devices."
    )
    @pytest.mark.skipif(
    not microarchitecture_is_pre(12), reason="Does not support blackwell compute 12.0"
    not microarchitecture_is(10, 0),
    reason="Does not support blackwell compute 12.0, other arches are not tested.",
    )
    @pytest.mark.skipif(
    not (microarchitecture_is(10, 0) or microarchitecture_is(10, 3)),
    reason="Does not support blackwell compute 12.0, other arches are not tested.",
    )

    However, if 10.3 is deliberately excluded because it hasn't been tested, the current change is correct but note that this is more restrictive than the original logic.

    Comment on lines 454 to 457
    @pytest.mark.skipif(
    is_pre_blackwell(), reason="Only supported on blackwell and newer devices."
    )
    @pytest.mark.skipif(
    not microarchitecture_is_pre(12), reason="Does not support blackwell compute 12.0"
    not microarchitecture_is(10, 0),
    reason="Does not support blackwell compute 12.0, other arches are not tested.",
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Same issue as test_scaled_mm: the skip condition logic has changed from supporting 10.x and 11.x architectures to only supporting exactly 10.0.

    Old: is_pre_blackwell() AND not microarchitecture_is_pre(12) → run on 10 <= major < 12
    New: not microarchitecture_is(10, 0) → run only on major==10 and minor==0

    This test uses fd.ops.cutlass_nvfp4_grouped_mm which is explicitly a cutlass operation. The cutlass scheduler supports both 10.0 and 10.3 according to csrc/scheduler/cutlass.cpp. Consider updating to support both tested architectures if 10.3 should be included.

    Comment on lines 659 to 662
    @pytest.mark.skipif(
    is_pre_blackwell(), reason="Only supported on blackwell and newer devices."
    )
    @pytest.mark.skipif(
    not microarchitecture_is_pre(12), reason="Does not support blackwell compute 12.0"
    not microarchitecture_is(10, 0),
    reason="Does not support blackwell compute 12.0, other arches are not tested.",
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Same issue: the skip condition has become more restrictive, changing from supporting 10.x and 11.x to only 10.0. This test also uses fd.ops.cutlass_nvfp4_grouped_mm which should support both 10.0 and 10.3 based on the cutlass scheduler implementation.

    Comment on lines 32 to 35
    @pytest.mark.skipif(
    is_pre_blackwell(), reason="Only supported on blackwell and newer devices."
    )
    @pytest.mark.skipif(
    not microarchitecture_is_pre(12), reason="Does not support blackwell compute 12.0"
    not microarchitecture_is(10, 0),
    reason="Does not support blackwell compute 12.0, other arches are not tested.",
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The skip condition has changed from supporting 10.x and 11.x architectures (via is_pre_blackwell() AND not microarchitecture_is_pre(12)) to only supporting exactly 10.0. This is consistent with the changes in test_narrow_precision.py but represents a more restrictive test coverage than before. Verify this is intentional.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants