Init moe support #16

airMeng · 2025-10-17T00:59:31Z

moe scatter and gather update to the latest main
cutlass based MoE GEMM

src/sycl/GroupGemm.cpp

python/sgl_kernel/moe.py

benchmark/bench_fused_moe.py

tests/test_moe_gemm.py

mingfeima

Let's firstly separate the PR into:

implement MoE with grouped gemm by cutlass
enable operator unit benchmarks for CI

mingfeima · 2025-11-14T08:19:22Z

python/sgl_kernel/moe.py

+    flat_topk = topk_ids.flatten()
+    idxs = flat_topk.argsort()
+    sorted_expert_ids = flat_topk[idxs]
+
+    counts = torch.bincount(sorted_expert_ids, minlength=E)  # [E]
+    token_idxs = idxs // TopK  # [num_tokens * TopK]
+    input_A = torch.empty(
+        (num_tokens * TopK, K), device=hidden_states.device, dtype=hidden_states.dtype
+    )
+    input_A = hidden_states[token_idxs].squeeze(1)
+    offset = counts.to(torch.int32)
+


is this the best that we can do for current cutlass APIs?

if so, do we have JIRA tracking the real need that we have?

this implementation will definitely hurt the perf.

@airMeng, do you know how is FlashInfer using a cutlass-based FusedMoE for Nvidia GPUs? I think folks in the vLLM team know.

Flashinder using a cutlass-based FusedMoE without activation reshuffle

mingfeima

Need some minor changes. Generally LGTM!

mingfeima · 2025-11-17T06:56:23Z

benchmark/bench_fused_moe.py

+    # # DeepSeek-V3-0324, tp = 1
+    # {
+    #     "num_experts": 257,
+    #     "topk": 8,
+    #     "hidden_size": 7168,
+    #     "shard_intermediate_size": 4096,
+    #     "dtype": torch.bfloat16,
+    #     "block_shape": [128, 128],
+    # },
+    # # DeepSeek-V3-0324, tp = 2
+    # {
+    #     "num_experts": 257,
+    #     "topk": 8,
+    #     "hidden_size": 7168,
+    #     "shard_intermediate_size": 2048,
+    #     "dtype": torch.bfloat16,
+    #     "block_shape": [128, 128],
+    # },
+    # # DeepSeek-V3-0324, tp = 4
+    # {
+    #     "num_experts": 257,
+    #     "topk": 8,
+    #     "hidden_size": 7168,
+    #     "shard_intermediate_size": 1024,
+    #     "dtype": torch.bfloat16,
+    #     "block_shape": [128, 128],
+    # },
+    # # DeepSeek-V3-0324, tp = 8
+    # {
+    #     "num_experts": 257,
+    #     "topk": 8,
+    #     "hidden_size": 7168,
+    #     "shard_intermediate_size": 512,
+    #     "dtype": torch.bfloat16,
+    #     "block_shape": [128, 128],


can we use TP as an arguement and then for the configuration, shard_intermediate_size = shard_intermediate_size // tp

benchmark/bench_fused_moe.py

python/sgl_kernel/moe.py

src/sycl/GroupGemm.cpp

mingfeima · 2025-11-17T07:15:26Z

src/sycl/MoEAlign.cpp

+  return n + 1;
+}
+
+#define CEILDIV(x, y) ((x + y - 1) / y)


defined in include/utils.h

removed, using div_up instead

src/sycl/MoEAlign.cpp

mingfeima · 2025-11-17T07:18:48Z

src/sycl/MoEAlign.cpp

+    if (small_batch_expert_mode) {
+      const int32_t threads_local = std::max((int32_t)num_experts, sub_group_size);
+      auto range = sycl::nd_range<1>(sycl::range<1>(threads_local), sycl::range<1>(threads_local));
+      using SmallKernel = MOEAlignBlockSizeSmallBatchExpertFunctor<scalar_t>;
+      SmallKernel kernel(
+          topk_ids.data_ptr<scalar_t>(),
+          sorted_token_ids.data_ptr<int32_t>(),
+          experts_ids.data_ptr<int32_t>(),
+          num_tokens_post_pad.data_ptr<int32_t>(),
+          num_experts,
+          block_size,
+          topk_ids.numel(),
+          pad_sorted_token_ids);
+      sycl_kernel_submit(range.get_global_range(), range.get_local_range(), queue, kernel);
+    } else {
+      const size_t scan_size = next_pow2(num_experts);
+      const size_t shared_mem_size = (num_experts + (num_experts + 1) + scan_size + sub_group_size) * sizeof(int32_t);
+      using Kernel = MOEAlignBlockSizeFunctor<scalar_t>;
+      Kernel kernel(
+          topk_ids.data_ptr<scalar_t>(),
+          sorted_token_ids.data_ptr<int32_t>(),
+          experts_ids.data_ptr<int32_t>(),
+          num_tokens_post_pad.data_ptr<int32_t>(),
+          num_experts,
+          block_size,
+          topk_ids.numel(),
+          cumsum_buffer.data_ptr<int32_t>(),
+          pad_sorted_token_ids,
+          scan_size);


mingfeima · 2025-11-17T07:23:02Z

src/sycl/MoESum.cpp

+  switch (topk) {
+    case 2: {
+      DISPATCH_FLOAT_TYPES(input.scalar_type(), "moe_sum", [&] {
+        using Kernel = MoeSumKernel<scalar_t, 2>;
+        Kernel kernel(output.data_ptr<scalar_t>(), input.data_ptr<scalar_t>(), hidden_size);
+        sycl_kernel_submit(range.get_global_range(), range.get_local_range(), queue, kernel);
+      });
+      break;
+    }
+    case 3: {
+      DISPATCH_FLOAT_TYPES(input.scalar_type(), "moe_sum", [&] {
+        using Kernel = MoeSumKernel<scalar_t, 3>;
+        Kernel kernel(output.data_ptr<scalar_t>(), input.data_ptr<scalar_t>(), hidden_size);
+        sycl_kernel_submit(range.get_global_range(), range.get_local_range(), queue, kernel);
+      });
+      break;
+    }
+    case 4: {
+      DISPATCH_FLOAT_TYPES(input.scalar_type(), "moe_sum", [&] {
+        using Kernel = MoeSumKernel<scalar_t, 4>;
+        Kernel kernel(output.data_ptr<scalar_t>(), input.data_ptr<scalar_t>(), hidden_size);
+        sycl_kernel_submit(range.get_global_range(), range.get_local_range(), queue, kernel);
+      });
+      break;
+    }
+    default:
+      at::sum_out(output, input, 1);
+      break;
+  }
+}


why do we need to do special treatment for 2, 3, and 4?

port from https://github.com/sgl-project/sglang/blob/26ca07469b2baca0f0b62690669ce297149e4d57/sgl-kernel/csrc/moe/moe_sum.cu#L41

mingfeima · 2025-11-17T07:27:59Z

Do we have performance numbers? I don't know if we can compare cutlass kernels with xetla kernels (with so called persisitent weight). If the compare makes sense, please collect the data. Otherwise, you can just calculate the memory bandwidth.

.github/workflows/pr-test-xpu.yml

+    if: github.event_name != 'pull_request' || contains(github.event.pull_request.labels.*.name, 'run-ci')
+    runs-on: sglang-pvc
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v2
+
+      - name: Build Docker image
+        run: |
+          docker build \
+            --build-arg SG_LANG_KERNEL_BRANCH=${{ github.head_ref }} \
+            --build-arg SG_LANG_KERNEL_REPO=${{ github.event.pull_request.head.repo.clone_url }} \
+            --no-cache --progress=plain -f Dockerfile.xpu_kernel -t xpu_sglang:kernel .
+
+      - name: Run container
+        run: |
+          docker run -dt \
+            --device /dev/dri/ \
+            --name ci_sglang_xpu \
+            -e HF_TOKEN=$(cat ~/huggingface_token.txt) \
+            xpu_sglang:kernel
+
+      - name: Install Dependency
+        timeout-minutes: 20
+        run: |
+          docker exec ci_sglang_xpu /miniforge3/envs/py3.10/bin/python3 -m pip install --upgrade pip
+          docker exec ci_sglang_xpu /miniforge3/envs/py3.10/bin/python3 -m pip install pytest expecttest ray huggingface_hub
+          docker exec ci_sglang_xpu /bin/bash -c '/miniforge3/envs/py3.10/bin/huggingface-cli login --token ${HF_TOKEN} '
+          docker exec ci_sglang_xpu /bin/bash -c "ln -sf /miniforge3/envs/py3.10/bin/python3 /usr/bin/python3"
+
+      - name: Run Sglang Kernel Cases
+        timeout-minutes: 20
+        run: |
+          docker exec -w /root/sglang ci_sglang_xpu \
+            /bin/bash -c "cd /root/sglang/sgl-kernel-xpu/tests &&  python3 run_suite.py --suite per-commit "
+
+      - name: Run Sglang Kernel Benchmarks
+        timeout-minutes: 20
+        run: |
+          docker exec -w /root/sglang ci_sglang_xpu \
+            /bin/bash -c "cd /root/sglang/sgl-kernel-xpu/benchmark &&  python3 bench_flash_attn.py && python3 bench_moe_topk_softmax.py && python3 benchmark_fused_moe.py "
+
+      - name: Run E2E Bfloat16 tests
+        timeout-minutes: 20
+        run: |
+          echo "[PlaceHolder for E2E Test...]"
+
+      - name: Run E2E Qunatization tests
+        timeout-minutes: 20
+        run: |
+          echo "[PlaceHolder for E2E Test...]"
+
+      - name: Cleanup container
+        if: always()
+        run: |
+          docker rm -f ci_sglang_xpu || true


The best way to fix this problem is to add the explicit permissions block with the minimum set of permissions required for the workflow to function. Since this workflow does not push changes, create releases, or otherwise write to the repository contents, it only needs contents: read permission. For maximum clarity and correctness, insert the following block after the top-level name: key (before or after on: is fine, but prefer after) to apply to all jobs in the workflow:

permissions: contents: read

This ensures that GITHUB_TOKEN will only have read access to repository contents, thus following the principle of least privilege and fixing the CodeQL error. No new imports, methods, or definitions are required—just a YAML syntax addition.

kareemshaik80

LGTM

airMeng · 2025-11-20T02:10:25Z

@mingfeima @sanchitintel If no more comments I will merge the PR

mingfeima · 2025-11-20T02:12:29Z

@mingfeima @sanchitintel If no more comments I will merge the PR

NP, we don't have to do everything in just one PR.

airMeng force-pushed the moe branch from a6c2c5e to c4bd487 Compare October 17, 2025 06:19

airMeng marked this pull request as draft October 22, 2025 09:01

airMeng force-pushed the moe branch 4 times, most recently from 6589915 to 0004dab Compare October 28, 2025 08:47

sanchitintel reviewed Oct 29, 2025

View reviewed changes

src/sycl/GroupGemm.cpp Show resolved Hide resolved

airMeng force-pushed the moe branch from 8839780 to eae76db Compare October 31, 2025 07:06

sanchitintel reviewed Nov 7, 2025

View reviewed changes

src/sycl/GroupGemm.cpp Outdated Show resolved Hide resolved

airMeng force-pushed the moe branch from 32d9fc9 to 0bcf993 Compare November 11, 2025 07:57

airMeng added 5 commits November 11, 2025 16:10

init moe

bfbd214

port MoE as a new cutlass::gemm::device

4fac6f9

init moe interface

0bc6de0

moe_gemm functional ready

4290032

add moe_gemm benchmark

70110d8

airMeng force-pushed the moe branch from 0bcf993 to 70110d8 Compare November 11, 2025 08:18

remove host overhead

c0f8fff

polisettyvarma suggested changes Nov 14, 2025

View reviewed changes

python/sgl_kernel/moe.py Outdated Show resolved Hide resolved

python/sgl_kernel/moe.py Outdated Show resolved Hide resolved

kareemshaik80 reviewed Nov 14, 2025

View reviewed changes

benchmark/bench_fused_moe.py Show resolved Hide resolved

benchmark/bench_fused_moe.py Show resolved Hide resolved

tests/test_moe_gemm.py Show resolved Hide resolved

airMeng added 2 commits November 14, 2025 16:01

avoid memory allocation outside pytorch

16b3f49

fused_moe and moe_align_block_size benchmarks

f47f0d4

airMeng marked this pull request as ready for review November 14, 2025 08:13

airMeng requested a review from mingfeima November 14, 2025 08:14

mingfeima requested changes Nov 14, 2025

View reviewed changes

mingfeima added the run-ci label Nov 14, 2025

mingfeima approved these changes Nov 17, 2025

View reviewed changes

github-advanced-security bot found potential problems Nov 18, 2025

View reviewed changes

refactor test tolerance

0496697

airMeng force-pushed the moe branch from 7fd3682 to 0496697 Compare November 18, 2025 03:23

airMeng added 2 commits November 18, 2025 11:29

format

7c47587

typo

ac0eaed

airMeng mentioned this pull request Nov 19, 2025

[XPU] remove host overhead in XPU attention backend and integrate MoE sgl-project/sglang#13561

Draft

6 tasks

kareemshaik80 approved these changes Nov 19, 2025

View reviewed changes

mingfeima merged commit f6d2976 into main Nov 20, 2025
6 checks passed

airMeng deleted the moe branch November 20, 2025 02:12

@@ -1,5 +1,8 @@
             name: PR Test (XPU)
+            permissions:
+              contents: read
             on:
               pull_request:
                 branches: [main]

Init moe support #16

Init moe support #16

Uh oh!

Conversation

airMeng commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mingfeima left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanchitintel Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mingfeima left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mingfeima commented Nov 17, 2025

Uh oh!

Check warning

Copilot Autofix

kareemshaik80 left a comment

Choose a reason for hiding this comment

Uh oh!

airMeng commented Nov 20, 2025

Uh oh!

Uh oh!

mingfeima commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

airMeng commented Oct 17, 2025 •

edited

Loading

sanchitintel Nov 18, 2025 •

edited

Loading