support swapAB for m_grouped_fp8_gemm_nt_masked #192

Wangzheee · 2025-09-11T14:23:40Z

SwapAB: Significantly improve the performance for M%64<32

Description

Significantly improve the performance for BLOCK_M = 32 or M%64<32
Swap A B: WGMMA::wgmma(desc_b, desc_a, accum, k)
Use multi-math_warp_groups and multi-wave MMAs support BLOCK_N(64, 128, 256): BLOCK_N=256 for H20

How to use

export ENABLE_SWAPAB=1

Improvements (H20)

Aligned M, desired state: masked_m[j] = int(expected_m_per_group * random.uniform(1, 1))

Original

Perf (num_groups=16, expected_m_per_group= 32, n=4096, k=7168, 1D2D): 257 us | 117 TFLOPS | 1862 GB/s
Perf (num_groups=16, expected_m_per_group= 32, n=7168, k=2048, 1D2D): 127 us | 118 TFLOPS | 1914 GB/s
Perf (num_groups=16, expected_m_per_group= 96, n=4096, k=7168, 1D2D): 497 us | 181 TFLOPS | 993 GB/s
Perf (num_groups=16, expected_m_per_group= 96, n=7168, k=2048, 1D2D): 240 us | 188 TFLOPS | 1085 GB/s
Perf (num_groups=16, expected_m_per_group= 160, n=4096, k=7168, 1D2D): 921 us | 163 TFLOPS | 554 GB/s
Perf (num_groups=16, expected_m_per_group= 160, n=7168, k=2048, 1D2D): 472 us | 159 TFLOPS | 587 GB/s

SwapAB

Perf (num_groups=16, expected_m_per_group= 32, n=4096, k=7168, 1D2D): 154 us | 195 TFLOPS | 3100 GB/s
Perf (num_groups=16, expected_m_per_group= 32, n=7168, k=2048, 1D2D): 84 us | 180 TFLOPS | 2906 GB/s
Perf (num_groups=16, expected_m_per_group= 96, n=4096, k=7168, 1D2D): 340 us | 266 TFLOPS | 1454 GB/s
Perf (num_groups=16, expected_m_per_group= 96, n=7168, k=2048, 1D2D): 191 us | 236 TFLOPS | 1364 GB/s
Perf (num_groups=16, expected_m_per_group= 160, n=4096, k=7168, 1D2D): 572 us | 263 TFLOPS | 891 GB/s
Perf (num_groups=16, expected_m_per_group= 160, n=7168, k=2048, 1D2D): 303 us | 248 TFLOPS | 915 GB/s

Other case (original test): masked_m[j] = int(expected_m_per_group * random.uniform(0.7, 1.3))

Original

Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168, 1D2D): 349 us | 214 TFLOPS | 141 GB/s
Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048, 1D2D): 159 us | 215 TFLOPS | 213 GB/s
Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168, 1D2D): 348 us | 178 TFLOPS | 216 GB/s
Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048, 1D2D): 159 us | 191 TFLOPS | 292 GB/s
Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168, 1D2D): 347 us | 174 TFLOPS | 384 GB/s
Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048, 1D2D): 217 us | 146 TFLOPS | 353 GB/s
Perf (num_groups=16, expected_m_per_group= 64, n=4096, k=7168, 1D2D): 406 us | 153 TFLOPS | 1199 GB/s
Perf (num_groups=16, expected_m_per_group= 64, n=7168, k=2048, 1D2D): 172 us | 164 TFLOPS | 1455 GB/s
Perf (num_groups=16, expected_m_per_group= 128, n=4096, k=7168, 1D2D): 740 us | 165 TFLOPS | 678 GB/s
Perf (num_groups=16, expected_m_per_group= 128, n=7168, k=2048, 1D2D): 354 us | 168 TFLOPS | 758 GB/s

SwapAB

Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168, 1D2D): 304 us | 246 TFLOPS | 162 GB/s
Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048, 1D2D): 145 us | 235 TFLOPS | 233 GB/s
Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168, 1D2D): 271 us | 229 TFLOPS | 278 GB/s
Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048, 1D2D): 136 us | 224 TFLOPS | 342 GB/s
Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168, 1D2D): 271 us | 223 TFLOPS | 493 GB/s
Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048, 1D2D): 146 us | 217 TFLOPS | 523 GB/s
Perf (num_groups=16, expected_m_per_group= 64, n=4096, k=7168, 1D2D): 308 us | 201 TFLOPS | 1580 GB/s
Perf (num_groups=16, expected_m_per_group= 64, n=7168, k=2048, 1D2D): 161 us | 175 TFLOPS | 1556 GB/s
Perf (num_groups=16, expected_m_per_group= 128, n=4096, k=7168, 1D2D): 539 us | 227 TFLOPS | 931 GB/s
Perf (num_groups=16, expected_m_per_group= 128, n=7168, k=2048, 1D2D): 272 us | 218 TFLOPS | 986 GB/s

TODO

In LowLatency, expected_m does not represent the maximum possible value of actual shape M in GroupsGEMM. How to choose BLOCK_M and BLOCK_N?
Manually set BLOCK_M and BLOCK_N: support swapAB for m_grouped_fp8_gemm_nt_masked #192 (comment)
Set BLOCK_M and BLOCK_N to improve performance for H100/H800

Wangzheee · 2025-09-11T15:01:45Z

csrc/jit_kernels/heuristics/common.hpp

        block_ns.push_back(i);
+    if(get_env<int>("ENABLE_SWAPAB")){
+        block_ms = std::vector{32};  // 32, 64
+        block_ns = std::vector{256}; // 64, 128, 256 


Manually set one of them. Experiments have found that in most cases, 256 performs the best

LyricZhao · 2025-09-15T08:03:32Z

Thanks! Merging it later.

Wangzheee · 2025-09-25T13:48:16Z

Thanks! Merging it later.

Hi～ Do you have a plan for code merge? We have already used this PR on our online service for H20.

whybeyoung · 2025-09-27T02:12:26Z

LGTM

Chtholly-Boss · 2025-09-27T03:14:45Z

为什么在other case里

num_groups=1, expected_m_per_group=1024, n=4096, k=7168

这个case也能有提升？num groups＝1时实际上相当于是一个1024×4096×7168的矩阵乘吧？SwapAB在这里的优势是什么

LyricZhao · 2025-09-28T01:27:25Z

Do you have a plan for code merge?

Sorry, we will try to merge this by the end of Oct. As swap AB will introduce non-batch-invariant and deterministic issues, we will consider it more carefully and do some refactors before merging. Also, as most the code can be reused, we will also refactor the epilogue part to make this feature less change for the code.

Thanks for your contribution! We will refactor for you, no change request👍🏻

cc @zheanxu

Wangzheee · 2025-09-28T02:57:40Z

为什么在other case里

num_groups=1, expected_m_per_group=1024, n=4096, k=7168

这个case也能有提升？num groups＝1时实际上相当于是一个1024×4096×7168的矩阵乘吧？SwapAB在这里的优势是什么

The swapAB variant “swap” the WGMMA tile usage, mapping the original problem’s M dimension onto WGMMA’s N dimension (which must be a multiple of 8). This enables smaller BLOCK_M (32). The performance advantage primarily comes from finer tiling granularity and better resource utilization.

Wangzheee · 2025-09-28T03:02:49Z

Do you have a plan for code merge?

Sorry, we will try to merge this by the end of Oct. As swap AB will introduce non-batch-invariant and deterministic issues, we will consider it more carefully and do some refactors before merging. Also, as most the code can be reused, we will also refactor the epilogue part to make this feature less change for the code.

Thanks for your contribution! We will refactor for you, no change request👍🏻

cc @zheanxu

Thanks～ Looking forward to the release of the new version.

KevinZeng08 · 2025-10-06T01:51:58Z

Hi @Wangzheee, I tested it on H100, and SwapAB seems to face performance degradation.
Is this technique only beneficial for H20? Thanks.

Original

Perf (num_groups=16, expected_m_per_group= 32, n=4096, k=7168, 1D2D): 168 us | 175 TFLOPS | 2836 GB/s
Perf (num_groups=16, expected_m_per_group= 32, n=7168, k=2048, 1D2D): 94 us | 155 TFLOPS | 2581 GB/s
Perf (num_groups=16, expected_m_per_group= 96, n=4096, k=7168, 1D2D): 176 us | 521 TFLOPS | 2807 GB/s
Perf (num_groups=16, expected_m_per_group= 96, n=7168, k=2048, 1D2D): 100 us | 434 TFLOPS | 2599 GB/s
Perf (num_groups=16, expected_m_per_group= 160, n=4096, k=7168, 1D2D): 229 us | 645 TFLOPS | 2220 GB/s
Perf (num_groups=16, expected_m_per_group= 160, n=7168, k=2048, 1D2D): 116 us | 634 TFLOPS | 2379 GB/s

SwapAB

Perf (num_groups=16, expected_m_per_group= 32, n=4096, k=7168, 1D2D): 175 us | 168 TFLOPS | 2725 GB/s
Perf (num_groups=16, expected_m_per_group= 32, n=7168, k=2048, 1D2D): 98 us | 148 TFLOPS | 2474 GB/s
Perf (num_groups=16, expected_m_per_group= 96, n=4096, k=7168, 1D2D): 255 us | 359 TFLOPS | 1937 GB/s
Perf (num_groups=16, expected_m_per_group= 96, n=7168, k=2048, 1D2D): 120 us | 361 TFLOPS | 2162 GB/s
Perf (num_groups=16, expected_m_per_group= 160, n=4096, k=7168, 1D2D): 377 us | 393 TFLOPS | 1351 GB/s
Perf (num_groups=16, expected_m_per_group= 160, n=7168, k=2048, 1D2D): 159 us | 463 TFLOPS | 1736 GB/s

jhaotingc · 2025-10-23T18:33:03Z

Hi, sorry about not having a lot of contexts on DeepGEMM. Do you know what's the difference between this and NVIDIA/TensorRT-LLM#4430 ?

In that PR, there's also perf number attached.

I'm seeing the number of this aligns with H20 perf between this PR and PR 4430.
Perf (num_groups=16, expected_m_per_group= 32, n=7168, k=2048, 1D2D):
Perf (num_groups=16, expected_m_per_group= 96, n=7168, k=2048, 1D2D):

At the same case H100 is also seeing benefit.

Thanks all!

Wangzheee commented Sep 11, 2025

View reviewed changes

LyricZhao mentioned this pull request Sep 15, 2025

support group_gemm_offset, group_gemm_offset_swapAB #116

Closed

support swapAB for m_grouped_fp8_gemm_nt_masked

2991c77

Wangzheee force-pushed the swap_ab_for_mask branch from f2e2357 to 2991c77 Compare September 15, 2025 13:18

Wangzheee mentioned this pull request Sep 15, 2025

support swapAB for m_grouped_fp8_gemm_nt_masked sgl-project/DeepGEMM#11

Open

This was referenced Oct 20, 2025

[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices sgl-project/sglang#11854

Closed

[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices antgroup/sglang#4

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support swapAB for m_grouped_fp8_gemm_nt_masked #192

support swapAB for m_grouped_fp8_gemm_nt_masked #192

Uh oh!

Wangzheee commented Sep 11, 2025 •

edited

Loading

Uh oh!

Wangzheee Sep 11, 2025

Uh oh!

LyricZhao commented Sep 15, 2025

Uh oh!

Wangzheee commented Sep 25, 2025

Uh oh!

whybeyoung commented Sep 27, 2025

Uh oh!

Chtholly-Boss commented Sep 27, 2025

Uh oh!

LyricZhao commented Sep 28, 2025

Uh oh!

Wangzheee commented Sep 28, 2025

Uh oh!

Wangzheee commented Sep 28, 2025

Uh oh!

KevinZeng08 commented Oct 6, 2025

Uh oh!

jhaotingc commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

support swapAB for m_grouped_fp8_gemm_nt_masked #192

Are you sure you want to change the base?

support swapAB for m_grouped_fp8_gemm_nt_masked #192

Uh oh!

Conversation

Wangzheee commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SwapAB: Significantly improve the performance for M%64<32

Description

How to use

Improvements (H20)

Aligned M, desired state: masked_m[j] = int(expected_m_per_group * random.uniform(1, 1))

Other case (original test): masked_m[j] = int(expected_m_per_group * random.uniform(0.7, 1.3))

TODO

Uh oh!

Wangzheee Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

LyricZhao commented Sep 15, 2025

Uh oh!

Wangzheee commented Sep 25, 2025

Uh oh!

whybeyoung commented Sep 27, 2025

Uh oh!

Chtholly-Boss commented Sep 27, 2025

Uh oh!

LyricZhao commented Sep 28, 2025

Uh oh!

Wangzheee commented Sep 28, 2025

Uh oh!

Wangzheee commented Sep 28, 2025

Uh oh!

KevinZeng08 commented Oct 6, 2025

Uh oh!

jhaotingc commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Wangzheee commented Sep 11, 2025 •

edited

Loading