[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #183

Sulfur6 · 2025-09-02T14:32:01Z

1. Motivation

The optimization effect of Two-Batch Overlap (TBO) is suboptimal for the Decode phase on low-compute-power cards (i.e., H20). This is due to two main factors: First, on the Hopper architecture, the WGMMA block_m is 64. Consequently, when TBO is enabled with a small Decode batch size, the MLP GEMM suffers from redundant computations. A positive throughput gain is only observed at larger batch sizes (e.g., 64, 128). Second, at these larger batch sizes, low-compute-power cards like the H20 fail to meet the SLA guarantees for TPOT/ITL.

Therefore, it is necessary to find a solution that can improve Decode throughput even with small batch sizes. Single Batch Overlap (SBO) presents itself as a viable solution.

We implement SBO for DeepSeek v3/R1 by modifying DeepEP and DeepGEMM, including the overlap of Shared Expert and Dispatch Recv, as well as the overlap of Down GEMM with Combine Send.

The overlap of Down GEMM with Combine Send is implemented by modifying SGlang, DeepEP and DeepGEMM, with the detailed implementation available in the PRs below:

DeepEP: [Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send DeepEP#390.
DeepGEMM: this PR.

We also conducted integration and evaluation in SGLang: sgl-project/sglang#9660.

Since the latest version of SGLang depends on the branch https://github.com/sgl-project/DeepGEMM/tree/sgl, you should not use this branch when starting SGLang. Instead, you should use the branch developed based on the sgl branch https://github.com/Sulfur6/DeepGEMM/tree/sbo.v2.sgl.

2. Overlap Design

SBO implements two overlap for the MoE layers of DeepSeek-V3/R1. One is to overlap the Shared Expert computation with the Dispatch Recv communication, and the other is to overlap the Down GEMM computation with the Combine Send communication.

The interaction between Down GEMM and Combine Send is structured as a producer-consumer model synchronized by signals. For each local expert, a signal unit is allocated for every block_m tokens. The Down GEMM computes the results for these block_m tokens and atomically increments the signaling unit after completing a portion of the work. The Combine Send polls this signaling unit. Once the value reaches a threshold, it sends the corresponding block_m tokens.

3. Modifications

Add the m_grouped_fp8_gemm_nt_signal Python interface to support overlapping Down GEMM with Combine Send.
Add the SM90FP8SignalGemm1D2DRuntime and sm90_m_grouped_fp8_gemm_signal_1d2d to support Signal Down GEMM in SM90.
The sm90_fp8_signal_gemm_1d2d_impl kernel uses atomicAdd to write signal after the corresponding block_m tokens are computed.

4. Evaluation

We integrated the modified DeepEP and DeepGEMM into SGLang for performance evaluation.

4.1. Experiment Setup

5 nodes, with 8 × H20 GPUs per node. Each prefill node uses TP8, and the other 2 decode nodes use DP_Attn 16 + EP 16.
Input length 4096, output length 1536.

4.2. Performance Evaluation

bs 32, origin

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.8
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2359.16
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15672509
Request throughput (req/s):              4.34
Input token throughput (tok/s):          17778.82
Output token throughput (tok/s):         6667.06
Total token throughput (tok/s):          24445.88
Concurrency:                             490.01
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   112892.31
Median E2E Latency (ms):                 113847.19
---------------Time to First Token----------------
Mean TTFT (ms):                          640.62
Median TTFT (ms):                        545.06
P99 TTFT (ms):                           1543.37
---------------Inter-Token Latency----------------
Mean ITL (ms):                           73.11
Median ITL (ms):                         71.81
P95 ITL (ms):                            86.02
P99 ITL (ms):                            155.32
Max ITL (ms):                            1543.26
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    5.0
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2357.80
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15673361
Request throughput (req/s):              4.34
Input token throughput (tok/s):          17789.05
Output token throughput (tok/s):         6670.89
Total token throughput (tok/s):          24459.95
Concurrency:                             490.83
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   113015.97
Median E2E Latency (ms):                 113951.58
---------------Time to First Token----------------
Mean TTFT (ms):                          724.98
Median TTFT (ms):                        624.73
P99 TTFT (ms):                           1693.64
---------------Inter-Token Latency----------------
Mean ITL (ms):                           73.13
Median ITL (ms):                         71.84
P95 ITL (ms):                            86.57
P99 ITL (ms):                            155.21
Max ITL (ms):                            1081.95
==================================================

bs 32, sbo

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.8
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2211.76
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15673456
Request throughput (req/s):              4.63
Input token throughput (tok/s):          18963.67
Output token throughput (tok/s):         7111.38
Total token throughput (tok/s):          26075.05
Concurrency:                             481.58
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   104017.64
Median E2E Latency (ms):                 105363.65
---------------Time to First Token----------------
Mean TTFT (ms):                          606.28
Median TTFT (ms):                        508.61
P99 TTFT (ms):                           1475.44
---------------Inter-Token Latency----------------
Mean ITL (ms):                           67.35
Median ITL (ms):                         66.10
P95 ITL (ms):                            81.58
P99 ITL (ms):                            141.96
Max ITL (ms):                            1422.74
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    5.0
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2194.12
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15672577
Request throughput (req/s):              4.67
Input token throughput (tok/s):          19116.14
Output token throughput (tok/s):         7168.55
Total token throughput (tok/s):          26284.70
Concurrency:                             487.92
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   104545.42
Median E2E Latency (ms):                 105483.50
---------------Time to First Token----------------
Mean TTFT (ms):                          619.03
Median TTFT (ms):                        511.23
P99 TTFT (ms):                           1504.27
---------------Inter-Token Latency----------------
Mean ITL (ms):                           67.68
Median ITL (ms):                         66.44
P95 ITL (ms):                            82.13
P99 ITL (ms):                            142.48
Max ITL (ms):                            1024.85
==================================================

4.3. Accuracy Tests

bs 32, origin

#python -m benchmark.gsm8k.bench_sglang --port 8000 --num-questions 1000
100%|█████████████████████████████████████████████████████████████| 1000/1000 [01:20<00:00, 12.41it/s]
Accuracy: 0.951
Invalid: 0.000
Latency: 80.802 s
Output throughput: 1183.468 token/s

bs 32, sbo

#python -m benchmark.gsm8k.bench_sglang --port 8000 --num-questions 1000
100%|█████████████████████████████████████████████████████████████| 1000/1000 [01:17<00:00, 12.87it/s]
Accuracy: 0.950
Invalid: 0.000
Latency: 78.056 s
Output throughput: 1217.443 token/s

4.4. Repro Script

Please refer to sgl-project/sglang#9660.

Co-authored-by: Zqy11 <[email protected]>

Co-authored-by: Zqy11 <[email protected]> Co-authored-by: AniZpZ <[email protected]>

fzyzcjy

(I will have a more detailed check a bit later)

tests/test_signal_gemm.py

tests/generators.py

tests/test_fp8.py

deep_gemm/include/deep_gemm/impls/sm90_fp8_gemm_1d2d.cuh

fzyzcjy · 2025-09-04T12:11:23Z

deep_gemm/include/deep_gemm/impls/sm90_fp8_gemm_1d2d.cuh

+                __threadfence();
+
+                if (threadIdx.x == 0) {
+                    atomicAdd(signal + scheduler.current_group_idx * ceil_div(shape_m, BLOCK_M) + m_block_idx, 1);


btw I am still a bit worried about this atomicAdd... (the code location issue in sgl-project/sglang#9660 (comment) is already solved and no problem)

EDIT: oh I see the __threadfence that is added in the new code compared with old. then curious whether the following approach will work and whether it is faster or not: remove threadfence + but make atomicAdd a released ordering, similar to my naive attempt here https://github.com/flashinfer-ai/flashinfer/pull/1569/files#diff-26b7ee95d08a959cf95f3a5c1719b5e00a2b0bc596227967de8e0caf74aefdcaR95 (warn again - my impl there has not been tested e2e b/c the e2e code is not ready)

OK, I will conduct further research and testing on these suggestions.

I tried using atom.add.release.gpu.global.s32 instead of __threadfence + atomicAdd. Bench_kineto results showed some performance benefits:

__threadfence + atomicAdd:

Testing m-grouped masked GEMM: Warning: please use at least NVCC 12.9 for the best DeepGEMM performance > Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168, 1D2D, enable_overlap=False): 347 us | 216 TFLOPS | 142 GB/s > Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048, 1D2D, enable_overlap=False): 159 us | 215 TFLOPS | 213 GB/s > Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168, 1D2D, enable_overlap=False): 348 us | 178 TFLOPS | 216 GB/s > Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048, 1D2D, enable_overlap=False): 159 us | 191 TFLOPS | 291 GB/s > Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168, 1D2D, enable_overlap=False): 348 us | 174 TFLOPS | 384 GB/s > Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048, 1D2D, enable_overlap=False): 217 us | 146 TFLOPS | 353 GB/s > Perf (num_groups=16, expected_m_per_group= 64, n=4096, k=7168, 1D2D, enable_overlap=False): 405 us | 153 TFLOPS | 1201 GB/s > Perf (num_groups=16, expected_m_per_group= 64, n=7168, k=2048, 1D2D, enable_overlap=False): 172 us | 164 TFLOPS | 1455 GB/s > Perf (num_groups=16, expected_m_per_group= 32, n=4096, k=7168, 1D2D, enable_overlap=False): 256 us | 118 TFLOPS | 1865 GB/s > Perf (num_groups=16, expected_m_per_group= 32, n=7168, k=2048, 1D2D, enable_overlap=False): 127 us | 115 TFLOPS | 1911 GB/s > Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168, 1D2D, enable_overlap=True): 351 us | 194 TFLOPS | 135 GB/s > Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048, 1D2D, enable_overlap=True): 124 us | 174 TFLOPS | 216 GB/s > Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168, 1D2D, enable_overlap=True): 351 us | 179 TFLOPS | 215 GB/s > Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048, 1D2D, enable_overlap=True): 165 us | 183 TFLOPS | 281 GB/s > Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168, 1D2D, enable_overlap=True): 351 us | 165 TFLOPS | 378 GB/s > Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048, 1D2D, enable_overlap=True): 165 us | 159 TFLOPS | 444 GB/s > Perf (num_groups=16, expected_m_per_group= 64, n=4096, k=7168, 1D2D, enable_overlap=True): 362 us | 159 TFLOPS | 1342 GB/s > Perf (num_groups=16, expected_m_per_group= 64, n=7168, k=2048, 1D2D, enable_overlap=True): 215 us | 148 TFLOPS | 1177 GB/s > Perf (num_groups=16, expected_m_per_group= 32, n=4096, k=7168, 1D2D, enable_overlap=True): 261 us | 109 TFLOPS | 1829 GB/s > Perf (num_groups=16, expected_m_per_group= 32, n=7168, k=2048, 1D2D, enable_overlap=True): 135 us | 108 TFLOPS | 1798 GB/s

atom.add.release.gpu.global.s32:

Testing m-grouped masked GEMM: Warning: please use at least NVCC 12.9 for the best DeepGEMM performance > Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168, 1D2D, enable_overlap=False): 347 us | 216 TFLOPS | 142 GB/s > Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048, 1D2D, enable_overlap=False): 159 us | 215 TFLOPS | 213 GB/s > Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168, 1D2D, enable_overlap=False): 348 us | 178 TFLOPS | 216 GB/s > Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048, 1D2D, enable_overlap=False): 159 us | 191 TFLOPS | 291 GB/s > Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168, 1D2D, enable_overlap=False): 348 us | 174 TFLOPS | 384 GB/s > Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048, 1D2D, enable_overlap=False): 217 us | 146 TFLOPS | 353 GB/s > Perf (num_groups=16, expected_m_per_group= 64, n=4096, k=7168, 1D2D, enable_overlap=False): 405 us | 153 TFLOPS | 1202 GB/s > Perf (num_groups=16, expected_m_per_group= 64, n=7168, k=2048, 1D2D, enable_overlap=False): 172 us | 164 TFLOPS | 1456 GB/s > Perf (num_groups=16, expected_m_per_group= 32, n=4096, k=7168, 1D2D, enable_overlap=False): 256 us | 118 TFLOPS | 1866 GB/s > Perf (num_groups=16, expected_m_per_group= 32, n=7168, k=2048, 1D2D, enable_overlap=False): 127 us | 115 TFLOPS | 1909 GB/s > Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168, 1D2D, enable_overlap=True): 350 us | 195 TFLOPS | 136 GB/s > Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048, 1D2D, enable_overlap=True): 123 us | 175 TFLOPS | 217 GB/s > Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168, 1D2D, enable_overlap=True): 349 us | 180 TFLOPS | 216 GB/s > Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048, 1D2D, enable_overlap=True): 163 us | 184 TFLOPS | 283 GB/s > Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168, 1D2D, enable_overlap=True): 350 us | 165 TFLOPS | 379 GB/s > Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048, 1D2D, enable_overlap=True): 164 us | 160 TFLOPS | 448 GB/s > Perf (num_groups=16, expected_m_per_group= 64, n=4096, k=7168, 1D2D, enable_overlap=True): 359 us | 160 TFLOPS | 1351 GB/s > Perf (num_groups=16, expected_m_per_group= 64, n=7168, k=2048, 1D2D, enable_overlap=True): 209 us | 152 TFLOPS | 1207 GB/s > Perf (num_groups=16, expected_m_per_group= 32, n=4096, k=7168, 1D2D, enable_overlap=True): 259 us | 110 TFLOPS | 1844 GB/s > Perf (num_groups=16, expected_m_per_group= 32, n=7168, k=2048, 1D2D, enable_overlap=True): 132 us | 111 TFLOPS | 1846 GB/s

However, after some research, I concluded that release semantics ensure that all memory writes initiated by the same thread executing the atomic instruction before the atomic instruction are visible to other threads that subsequently observe the results of the atomic operation through an acquire operation. In other words, the guarantee of release semantics is bound to the thread executing the atomic operation. However, the thread initiating the TMA operation and the thread executing the write signal are not necessarily the same, so I'm concerned that this may cause problems.

My naive understanding is the following (please correct me if I am wrong!):

the thread initiating the TMA operation and the thread executing the write signal are not necessarily the same

cutlass::arch::NamedBarrier(kNumMathThreads).sync(); ensures all math threads sync at this point, thus we know all tma_store_wait calls have finished at this point, thus when we do a atom.add.release.gpu.global.s32 on thread 0, we already ensure tma_store_wait from thread 0,1,2,3,... all done.

we may apply cuda::atomic_ref and fetch_add instead of atomicAdd @Sulfur6

fzyzcjy · 2025-09-06T10:40:06Z

deep_gemm/include/deep_gemm/impls/sm90_fp8_gemm_1d2d.cuh

+
+            if constexpr (kEnableOverlap) {
+                if (threadIdx.x < BLOCK_N / TMA_D_BLOCK_N) {
+                    cute::tma_store_wait<0>();


one more naive worry: tma_store_wait seems to correspond to cp.async.bulk.wait_group.read (src: https://github.com/NVIDIA/cutlass/blob/76c96b0be35cb263debe3e3d8418b80911a544ab/include/cute/arch/copy_sm90_tma.hpp#L1251), but it seems that we need cp.async.bulk.wait_group (no ".read"), otherwise the semantics (https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-wait-group) is that, "the tma store has done reading from source, but the Writes being made visible to the executing thread may not have been done".

you are right. we may apply "asm volatile("cp.async.bulk.wait_group 0;\n" ::: "memory")" instead of tma_store_wait here

looking forward to the fix

…ith atomicAdd.

Sulfur6 and others added 30 commits August 28, 2025 17:39

feat: add signal gemm api for SBO (Single Batch Overlap).

9ac3853

Co-authored-by: Zqy11 <[email protected]>

feat: add launch config args for coorperative groups.

79b5a45

Co-authored-by: Zqy11 <[email protected]>

feat: add signal gemm impl&runtime in jit kernels.

6cf6ca1

Co-authored-by: Zqy11 <[email protected]>

feat: add signal gemm kernel.

1e2c135

Co-authored-by: Zqy11 <[email protected]>

feat: add signal gemm import in deep_gemm package.

a5d23b0

Co-authored-by: Zqy11 <[email protected]>

feat: add test for signal gemm.

48c741a

Co-authored-by: Zqy11 <[email protected]>

feat: add signal threshold as a config arg for signal gemm.

9a2d3e5

add comparation test.

03f61d2

avoid using bench kineto

e86f1f9

test: modify generators to fit bs32 down gemm.

e7ebff1

more

8037369

more.

55c7943

more.

99cc43b

fix

4bfed47

more.

4a14a40

more.

701751c

more.

8035121

more.

02edd31

feat: add param max_block_n

0fcd03c

test

2ec7414

more.

a63152f

more.

92b29f6

more.

be21bf6

more.

ffa1140

fix: ensure memory order and send location.

4802c62

Co-authored-by: Zqy11 <[email protected]> Co-authored-by: AniZpZ <[email protected]>

more.

6271138

exp

57bb435

exp

d6e63a3

rollback generator.

a21c96f

remove threadfence.

51740cf

Sulfur6 added 7 commits September 4, 2025 18:24

fix.

debb196

fix

7ee0480

more.

0c0eac8

remove comments.

af77de9

remove code dup.

23be492

remove signal gemm api.

84526c5

fix.

f858cad

fzyzcjy reviewed Sep 4, 2025

View reviewed changes

tests/test_signal_gemm.py Outdated Show resolved Hide resolved

fzyzcjy reviewed Sep 4, 2025

View reviewed changes

tests/generators.py Outdated Show resolved Hide resolved

tests/test_fp8.py Outdated Show resolved Hide resolved

deep_gemm/include/deep_gemm/impls/sm90_fp8_gemm_1d2d.cuh Outdated Show resolved Hide resolved

fzyzcjy reviewed Sep 4, 2025

View reviewed changes

Sulfur6 added 10 commits September 5, 2025 11:19

feat: use NamedBarrier instead of coorperative groups.

ee1a058

refactor: unify test for overlap with test for masked gemm.

ec8b7c1

ref

ad8874a

fix.

59dc1aa

fix.

5edbbc5

fix.

bf0d62a

fix.

b78061b

fix.

848952b

add test for EP16 situation.

bbc09d6

remove code dup.

4011a8a

Sulfur6 marked this pull request as ready for review September 6, 2025 09:28

fzyzcjy reviewed Sep 6, 2025

View reviewed changes

Sulfur6 added 6 commits September 8, 2025 15:51

more.

b9f37f2

more.

07aa61c

try to use atom.add.release.gpu.global.s32 instead of __threadfence w…

379a913

…ith atomicAdd.

remove redundant change.

9f769b4

fix.

eb1091f

fix.

ede008b

Sulfur6 mentioned this pull request Sep 11, 2025

The current strategy in get_best_config is suboptimal in some cases on H20 #191

Open

Merge remote-tracking branch 'origin/main' into sbo.v2.public

d232a36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #183

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #183

Uh oh!

Sulfur6 commented Sep 2, 2025 •

edited

Loading

Uh oh!

fzyzcjy left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fzyzcjy Sep 4, 2025 •

edited

Loading

Uh oh!

Sulfur6 Sep 4, 2025

Uh oh!

Sulfur6 Sep 6, 2025

Uh oh!

fzyzcjy Sep 6, 2025

Uh oh!

AniZpZ Sep 8, 2025

Uh oh!

fzyzcjy Sep 6, 2025

Uh oh!

AniZpZ Sep 8, 2025

Uh oh!

fzyzcjy Sep 8, 2025

Uh oh!

Uh oh!

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #183

Are you sure you want to change the base?

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #183

Uh oh!

Conversation

Sulfur6 commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Motivation

2. Overlap Design

3. Modifications

4. Evaluation

4.1. Experiment Setup

4.2. Performance Evaluation

4.3. Accuracy Tests

4.4. Repro Script

Uh oh!

fzyzcjy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fzyzcjy Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sulfur6 Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Sulfur6 Sep 6, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Sep 6, 2025

Choose a reason for hiding this comment

Uh oh!

AniZpZ Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Sep 6, 2025

Choose a reason for hiding this comment

Uh oh!

AniZpZ Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Sulfur6 commented Sep 2, 2025 •

edited

Loading

fzyzcjy Sep 4, 2025 •

edited

Loading