Skip to content

Conversation

@mdavis36
Copy link
Collaborator

@mdavis36 mdavis36 commented Oct 1, 2025

Summary

The current ./manual-ci.sh test suite fails on Compute Capabiliy (CC) 12 Blackwell cards (like RTX5090) these tests fail due to architectural difference between CC 10 and 12 cars. Primarily due to differences in shared memory available per SM.

1. Architecture-specific Test Skips and Guards:

  • Blackwell-specific tests (using SM_100/104 instructions) were causing failures on SM_110+ (including 5090 GPUs). The PR skips these tests for SM_110+ architectures.
  • Additional guards were added to skip or adjust tests that would fail on architectures with either lower shared memory or register availability (e.g., compute capability 12).
  • Some tests now calculate or assign shared memory requirements and will skip if the device can't meet them.
  • Input sizes were reduced for certain tests to ensure they fit within the resource limits of low-capability GPUs.

2. Resource Constraint Handling:

  • Tests requiring more than 32GB of memory are skipped (when appropriate), preventing failures on systems without sufficient memory.
  • Tests incompatible with specific compute capabilities (e.g., some Python tests with compute 12) are conditionally skipped.

3. Test Maintenance and Cleanup:

  • Some missing or obsolete tests were removed to clean up the test suite.

4. Fusion Cache Management Fix:

  • The FusionCache_CUDA test now properly resets the fusion cache when the number of cached fusions exceeds the max_fusions limit, ensuring the cache respects new constraints.

@github-actions
Copy link

github-actions bot commented Oct 1, 2025

Review updated until commit 80fbfdd

Description

  • Skip tests incompatible with compute capability 12+ GPUs

  • Add shared memory and global memory requirement guards

  • Reduce input sizes for tests on low-resource architectures

  • Fix fusion cache reset when exceeding max_fusions limit


Changes walkthrough 📝

Relevant files
Bug fix
1 files
fusion_cache.cpp
Reset fusion cache on max_fusions overflow                             
+21/-0   
Tests
8 files
test_circular_buffering.cpp
Add SMEM guards for matmul tests                                                 
+7/-1     
test_combined_inner_outer_reduction.cpp
Adjust warp reduction test for CC 12                                         
+8/-2     
test_memory.cpp
Enforce shared memory requirements in TMA tests                   
+10/-0   
test_cutlass_gemm.py
Skip test for compute 12.0                                                             
+4/-0     
test_cutlass_nvfp4_gemm.py
Restrict NVFP4 test to compute < 12.0                                       
+3/-2     
test_narrow_precision.py
Skip NVFP4 tests on compute 12.0                                                 
+7/-0     
test_repro.py
Skip large tests if memory < 32GB                                               
+8/-0     
utils.h
Refine Blackwell GPU architecture guard                                   
+3/-2     
Enhancement
1 files
utils.py
Add memory-based test skip utility                                             
+13/-0   
Configuration changes
1 files
manual_ci.sh
Update test suite execution command                                           
+1/-6     

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review

Cache Reset Safety

The fusion cache is being deleted and recreated when the deserialized size exceeds max_fusions. This pattern may lead to undefined behavior if other components hold references to the old singleton instance.

delete singleton_;
singleton_ = new FusionCache(max_fusions, selected_device);
Inconsistent Compute Capability Check

The test skips devices with compute capability >= (12, 0), but the PR description mentions fixing issues for compute capability 12 devices. This exclusion may be contradictory to the PR's purpose of supporting CC 12.

if compute_cap < (10, 0) or compute_cap >= (12, 0):
    pytest.skip(
        reason="Nvfp4 Requires compute capability 10.",
Conditional Test Logic

The test now conditionally checks computation_warp_groups only on CC 12+ devices, but skips the check entirely on older architectures. This may hide potential issues on non-CC12 devices.

if (cudaArchGuardShouldSkip(12, 0, 13, 0)) {
  EXPECT_TRUE(rparams->computation_warp_groups > 1);
}

@mdavis36 mdavis36 force-pushed the bugfix/blackwell-5090-tests branch 2 times, most recently from 0ac9f79 to c968853 Compare October 6, 2025 22:12
@mdavis36 mdavis36 closed this Oct 6, 2025
@mdavis36 mdavis36 reopened this Oct 6, 2025
@mdavis36 mdavis36 force-pushed the bugfix/blackwell-5090-tests branch from c968853 to 91d2f22 Compare October 6, 2025 22:22
@mdavis36
Copy link
Collaborator Author

mdavis36 commented Oct 6, 2025

!test

@mdavis36 mdavis36 force-pushed the bugfix/blackwell-5090-tests branch from 91d2f22 to 587812b Compare October 7, 2025 20:32
@mdavis36
Copy link
Collaborator Author

mdavis36 commented Oct 7, 2025

!test

@mdavis36 mdavis36 marked this pull request as ready for review October 7, 2025 20:44
@mdavis36 mdavis36 changed the title Skipping blackwell tests for sm_110+ architectures. Fixing tests for compute capability 12 devices. Oct 7, 2025
@mdavis36 mdavis36 force-pushed the bugfix/blackwell-5090-tests branch from 587812b to 9b5e85c Compare October 8, 2025 00:57
@mdavis36 mdavis36 force-pushed the bugfix/blackwell-5090-tests branch from ab61f40 to 5e19036 Compare October 8, 2025 19:03
Copy link
Collaborator

@liqiangxl liqiangxl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good to me. Thanks!

@mdavis36 mdavis36 force-pushed the bugfix/blackwell-5090-tests branch from 5e19036 to 4a90bd2 Compare October 8, 2025 20:18
@mdavis36
Copy link
Collaborator Author

mdavis36 commented Oct 8, 2025

!test

Blackwell specific tests use instruction sets for sm_100/104, these
tests fail for sm_110+ (like 5090 GPUs).
Calculate or assign shared memory requirements for tests that will fail
to execute properly due to shared memory constraints on certain
architectures.
…capabilities.

These tests will fail on compute capabilities with lower shared memory /
registers available per SM e.g. 12
Due to the lower shared memory available on SM/CC 12 cards this test
fails to schedule static warp reductions properly, reducing the input
size allows us to generate the expected pattern.
The FusionCache_CUDA test attempts to reset and get a fresh FusionCache.
If the currect cache contains more fusions than what is requested for
the new max_fusions limit it will fail to enforce the new max_fusions
constraint.
@mdavis36 mdavis36 force-pushed the bugfix/blackwell-5090-tests branch from 4a90bd2 to 80fbfdd Compare October 8, 2025 21:21
@mdavis36 mdavis36 closed this Oct 8, 2025
@mdavis36 mdavis36 reopened this Oct 8, 2025
@mdavis36
Copy link
Collaborator Author

mdavis36 commented Oct 8, 2025

!test

@mdavis36 mdavis36 merged commit fb338b7 into main Oct 9, 2025
68 of 73 checks passed
@mdavis36 mdavis36 deleted the bugfix/blackwell-5090-tests branch October 9, 2025 03:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants