Fixing tests for compute capability 12 devices. #5284

mdavis36 · 2025-10-01T17:06:38Z

Summary

The current ./manual-ci.sh test suite fails on Compute Capabiliy (CC) 12 Blackwell cards (like RTX5090) these tests fail due to architectural difference between CC 10 and 12 cars. Primarily due to differences in shared memory available per SM.

1. Architecture-specific Test Skips and Guards:

Blackwell-specific tests (using SM_100/104 instructions) were causing failures on SM_110+ (including 5090 GPUs). The PR skips these tests for SM_110+ architectures.
Additional guards were added to skip or adjust tests that would fail on architectures with either lower shared memory or register availability (e.g., compute capability 12).
Some tests now calculate or assign shared memory requirements and will skip if the device can't meet them.
Input sizes were reduced for certain tests to ensure they fit within the resource limits of low-capability GPUs.

2. Resource Constraint Handling:

Tests requiring more than 32GB of memory are skipped (when appropriate), preventing failures on systems without sufficient memory.
Tests incompatible with specific compute capabilities (e.g., some Python tests with compute 12) are conditionally skipped.

3. Test Maintenance and Cleanup:

Some missing or obsolete tests were removed to clean up the test suite.

4. Fusion Cache Management Fix:

The FusionCache_CUDA test now properly resets the fusion cache when the number of cached fusions exceeds the max_fusions limit, ensuring the cache respects new constraints.

github-actions · 2025-10-01T17:07:30Z

Review updated until commit 80fbfdd

Description

Skip tests incompatible with compute capability 12+ GPUs
Add shared memory and global memory requirement guards
Reduce input sizes for tests on low-resource architectures
Fix fusion cache reset when exceeding max_fusions limit

Changes walkthrough 📝

Relevant files

Bug fix

1 files

fusion_cache.cpp `Reset fusion cache on max_fusions overflow`	+21/-0

Tests

8 files

test_circular_buffering.cpp `Add SMEM guards for matmul tests`	+7/-1
test_combined_inner_outer_reduction.cpp `Adjust warp reduction test for CC 12`	+8/-2
test_memory.cpp `Enforce shared memory requirements in TMA tests`	+10/-0
test_cutlass_gemm.py `Skip test for compute 12.0`	+4/-0
test_cutlass_nvfp4_gemm.py `Restrict NVFP4 test to compute < 12.0`	+3/-2
test_narrow_precision.py `Skip NVFP4 tests on compute 12.0`	+7/-0
test_repro.py `Skip large tests if memory < 32GB`	+8/-0
utils.h `Refine Blackwell GPU architecture guard`	+3/-2

Enhancement

1 files

utils.py `Add memory-based test skip utility`	+13/-0

Configuration changes

1 files

manual_ci.sh `Update test suite execution command`	+1/-6

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review Cache Reset Safety The fusion cache is being deleted and recreated when the deserialized size exceeds max_fusions. This pattern may lead to undefined behavior if other components hold references to the old singleton instance. delete singleton_; singleton_ = new FusionCache(max_fusions, selected_device); Inconsistent Compute Capability Check The test skips devices with compute capability >= (12, 0), but the PR description mentions fixing issues for compute capability 12 devices. This exclusion may be contradictory to the PR's purpose of supporting CC 12. if compute_cap < (10, 0) or compute_cap >= (12, 0): pytest.skip( reason="Nvfp4 Requires compute capability 10.", Conditional Test Logic The test now conditionally checks computation_warp_groups only on CC 12+ devices, but skips the check entirely on older architectures. This may hide potential issues on non-CC12 devices. if (cudaArchGuardShouldSkip(12, 0, 13, 0)) { EXPECT_TRUE(rparams->computation_warp_groups > 1); }

mdavis36 · 2025-10-06T22:24:57Z

!test

mdavis36 · 2025-10-07T20:34:20Z

!test

tests/cpp/test_combined_inner_outer_reduction.cpp

python/python_frontend/fusion_cache.cpp

tests/cpp/test_memory.cpp

tests/python/direct/test_repro.py

tests/python/direct/conftest.py

tests/python/direct/test_cutlass_nvfp4_gemm.py

tests/python/direct/test_cutlass_gemm.py

tests/python/direct_utils/utils.py

liqiangxl

The fix looks good to me. Thanks!

mdavis36 · 2025-10-08T20:18:33Z

!test

Blackwell specific tests use instruction sets for sm_100/104, these tests fail for sm_110+ (like 5090 GPUs).

Calculate or assign shared memory requirements for tests that will fail to execute properly due to shared memory constraints on certain architectures.

…capabilities. These tests will fail on compute capabilities with lower shared memory / registers available per SM e.g. 12

Due to the lower shared memory available on SM/CC 12 cards this test fails to schedule static warp reductions properly, reducing the input size allows us to generate the expected pattern.

The FusionCache_CUDA test attempts to reset and get a fresh FusionCache. If the currect cache contains more fusions than what is requested for the new max_fusions limit it will fail to enforce the new max_fusions constraint.

mdavis36 · 2025-10-08T21:22:29Z

!test

mdavis36 force-pushed the bugfix/blackwell-5090-tests branch 2 times, most recently from 0ac9f79 to c968853 Compare October 6, 2025 22:12

mdavis36 closed this Oct 6, 2025

mdavis36 reopened this Oct 6, 2025

mdavis36 force-pushed the bugfix/blackwell-5090-tests branch from c968853 to 91d2f22 Compare October 6, 2025 22:22

mdavis36 force-pushed the bugfix/blackwell-5090-tests branch from 91d2f22 to 587812b Compare October 7, 2025 20:32

mdavis36 marked this pull request as ready for review October 7, 2025 20:44

mdavis36 requested review from jjsjann123, liqiangxl, naoyam, rdspring1 and wujingyue October 7, 2025 20:46

mdavis36 changed the title ~~Skipping blackwell tests for sm_110+ architectures.~~ Fixing tests for compute capability 12 devices. Oct 7, 2025

naoyam reviewed Oct 7, 2025

View reviewed changes

tests/cpp/test_combined_inner_outer_reduction.cpp Outdated Show resolved Hide resolved

wujingyue approved these changes Oct 7, 2025

View reviewed changes

rdspring1 reviewed Oct 7, 2025

View reviewed changes

tests/python/direct/test_cutlass_nvfp4_gemm.py Show resolved Hide resolved

rdspring1 reviewed Oct 7, 2025

View reviewed changes

tests/python/direct/test_cutlass_gemm.py Outdated Show resolved Hide resolved

wujingyue reviewed Oct 7, 2025

View reviewed changes

tests/python/direct_utils/utils.py Outdated Show resolved Hide resolved

mdavis36 force-pushed the bugfix/blackwell-5090-tests branch from 587812b to 9b5e85c Compare October 8, 2025 00:57

wujingyue approved these changes Oct 8, 2025

View reviewed changes

tests/python/direct_utils/utils.py Outdated Show resolved Hide resolved

mdavis36 force-pushed the bugfix/blackwell-5090-tests branch from ab61f40 to 5e19036 Compare October 8, 2025 19:03

liqiangxl approved these changes Oct 8, 2025

View reviewed changes

mdavis36 force-pushed the bugfix/blackwell-5090-tests branch from 5e19036 to 4a90bd2 Compare October 8, 2025 20:18

mdavis36 added 3 commits October 8, 2025 14:21

Removing missing tests

a925d63

Skipping blackwell tests for sm_110+ architectures.

f2500ed

Blackwell specific tests use instruction sets for sm_100/104, these tests fail for sm_110+ (like 5090 GPUs).

Shared memory test guard

ab1913f

Calculate or assign shared memory requirements for tests that will fail to execute properly due to shared memory constraints on certain architectures.

mdavis36 added 5 commits October 8, 2025 14:21

Skip tests that require more than 32GB of memory.

b8e4ed6

Skip python tests where compute 12 is incompatible.

af6b510

Arch guards on warp specialization test that fail on certain compute …

cf5622f

…capabilities. These tests will fail on compute capabilities with lower shared memory / registers available per SM e.g. 12

Reduce input size to fit on compute capability 12 architectures.

a80cf77

Due to the lower shared memory available on SM/CC 12 cards this test fails to schedule static warp reductions properly, reducing the input size allows us to generate the expected pattern.

mdavis36 force-pushed the bugfix/blackwell-5090-tests branch from 4a90bd2 to 80fbfdd Compare October 8, 2025 21:21

mdavis36 closed this Oct 8, 2025

mdavis36 reopened this Oct 8, 2025

mdavis36 merged commit fb338b7 into main Oct 9, 2025
68 of 73 checks passed

mdavis36 deleted the bugfix/blackwell-5090-tests branch October 9, 2025 03:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixing tests for compute capability 12 devices. #5284

Fixing tests for compute capability 12 devices. #5284

Uh oh!

mdavis36 commented Oct 1, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 1, 2025 •

edited

Loading

Uh oh!

mdavis36 commented Oct 6, 2025

Uh oh!

mdavis36 commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liqiangxl left a comment

Uh oh!

mdavis36 commented Oct 8, 2025

Uh oh!

mdavis36 commented Oct 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Fixing tests for compute capability 12 devices. #5284

Fixing tests for compute capability 12 devices. #5284

Uh oh!

Conversation

mdavis36 commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

github-actions bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

mdavis36 commented Oct 6, 2025

Uh oh!

mdavis36 commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liqiangxl left a comment

Choose a reason for hiding this comment

Uh oh!

mdavis36 commented Oct 8, 2025

Uh oh!

mdavis36 commented Oct 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mdavis36 commented Oct 1, 2025 •

edited

Loading

github-actions bot commented Oct 1, 2025 •

edited

Loading