-
Notifications
You must be signed in to change notification settings - Fork 13
Init moe support #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Init moe support #16
Conversation
6589915 to
0004dab
Compare
mingfeima
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's firstly separate the PR into:
- implement MoE with grouped gemm by cutlass
- enable operator unit benchmarks for CI
| flat_topk = topk_ids.flatten() | ||
| idxs = flat_topk.argsort() | ||
| sorted_expert_ids = flat_topk[idxs] | ||
|
|
||
| counts = torch.bincount(sorted_expert_ids, minlength=E) # [E] | ||
| token_idxs = idxs // TopK # [num_tokens * TopK] | ||
| input_A = torch.empty( | ||
| (num_tokens * TopK, K), device=hidden_states.device, dtype=hidden_states.dtype | ||
| ) | ||
| input_A = hidden_states[token_idxs].squeeze(1) | ||
| offset = counts.to(torch.int32) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the best that we can do for current cutlass APIs?
if so, do we have JIRA tracking the real need that we have?
this implementation will definitely hurt the perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@airMeng, do you know how is FlashInfer using a cutlass-based FusedMoE for Nvidia GPUs? I think folks in the vLLM team know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flashinder using a cutlass-based FusedMoE without activation reshuffle
mingfeima
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need some minor changes. Generally LGTM!
| # # DeepSeek-V3-0324, tp = 1 | ||
| # { | ||
| # "num_experts": 257, | ||
| # "topk": 8, | ||
| # "hidden_size": 7168, | ||
| # "shard_intermediate_size": 4096, | ||
| # "dtype": torch.bfloat16, | ||
| # "block_shape": [128, 128], | ||
| # }, | ||
| # # DeepSeek-V3-0324, tp = 2 | ||
| # { | ||
| # "num_experts": 257, | ||
| # "topk": 8, | ||
| # "hidden_size": 7168, | ||
| # "shard_intermediate_size": 2048, | ||
| # "dtype": torch.bfloat16, | ||
| # "block_shape": [128, 128], | ||
| # }, | ||
| # # DeepSeek-V3-0324, tp = 4 | ||
| # { | ||
| # "num_experts": 257, | ||
| # "topk": 8, | ||
| # "hidden_size": 7168, | ||
| # "shard_intermediate_size": 1024, | ||
| # "dtype": torch.bfloat16, | ||
| # "block_shape": [128, 128], | ||
| # }, | ||
| # # DeepSeek-V3-0324, tp = 8 | ||
| # { | ||
| # "num_experts": 257, | ||
| # "topk": 8, | ||
| # "hidden_size": 7168, | ||
| # "shard_intermediate_size": 512, | ||
| # "dtype": torch.bfloat16, | ||
| # "block_shape": [128, 128], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use TP as an arguement and then for the configuration, shard_intermediate_size = shard_intermediate_size // tp
src/sycl/MoEAlign.cpp
Outdated
| return n + 1; | ||
| } | ||
|
|
||
| #define CEILDIV(x, y) ((x + y - 1) / y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defined in include/utils.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed, using div_up instead
| if (small_batch_expert_mode) { | ||
| const int32_t threads_local = std::max((int32_t)num_experts, sub_group_size); | ||
| auto range = sycl::nd_range<1>(sycl::range<1>(threads_local), sycl::range<1>(threads_local)); | ||
| using SmallKernel = MOEAlignBlockSizeSmallBatchExpertFunctor<scalar_t>; | ||
| SmallKernel kernel( | ||
| topk_ids.data_ptr<scalar_t>(), | ||
| sorted_token_ids.data_ptr<int32_t>(), | ||
| experts_ids.data_ptr<int32_t>(), | ||
| num_tokens_post_pad.data_ptr<int32_t>(), | ||
| num_experts, | ||
| block_size, | ||
| topk_ids.numel(), | ||
| pad_sorted_token_ids); | ||
| sycl_kernel_submit(range.get_global_range(), range.get_local_range(), queue, kernel); | ||
| } else { | ||
| const size_t scan_size = next_pow2(num_experts); | ||
| const size_t shared_mem_size = (num_experts + (num_experts + 1) + scan_size + sub_group_size) * sizeof(int32_t); | ||
| using Kernel = MOEAlignBlockSizeFunctor<scalar_t>; | ||
| Kernel kernel( | ||
| topk_ids.data_ptr<scalar_t>(), | ||
| sorted_token_ids.data_ptr<int32_t>(), | ||
| experts_ids.data_ptr<int32_t>(), | ||
| num_tokens_post_pad.data_ptr<int32_t>(), | ||
| num_experts, | ||
| block_size, | ||
| topk_ids.numel(), | ||
| cumsum_buffer.data_ptr<int32_t>(), | ||
| pad_sorted_token_ids, | ||
| scan_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thumbs up!
| switch (topk) { | ||
| case 2: { | ||
| DISPATCH_FLOAT_TYPES(input.scalar_type(), "moe_sum", [&] { | ||
| using Kernel = MoeSumKernel<scalar_t, 2>; | ||
| Kernel kernel(output.data_ptr<scalar_t>(), input.data_ptr<scalar_t>(), hidden_size); | ||
| sycl_kernel_submit(range.get_global_range(), range.get_local_range(), queue, kernel); | ||
| }); | ||
| break; | ||
| } | ||
| case 3: { | ||
| DISPATCH_FLOAT_TYPES(input.scalar_type(), "moe_sum", [&] { | ||
| using Kernel = MoeSumKernel<scalar_t, 3>; | ||
| Kernel kernel(output.data_ptr<scalar_t>(), input.data_ptr<scalar_t>(), hidden_size); | ||
| sycl_kernel_submit(range.get_global_range(), range.get_local_range(), queue, kernel); | ||
| }); | ||
| break; | ||
| } | ||
| case 4: { | ||
| DISPATCH_FLOAT_TYPES(input.scalar_type(), "moe_sum", [&] { | ||
| using Kernel = MoeSumKernel<scalar_t, 4>; | ||
| Kernel kernel(output.data_ptr<scalar_t>(), input.data_ptr<scalar_t>(), hidden_size); | ||
| sycl_kernel_submit(range.get_global_range(), range.get_local_range(), queue, kernel); | ||
| }); | ||
| break; | ||
| } | ||
| default: | ||
| at::sum_out(output, input, 1); | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to do special treatment for 2, 3, and 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Do we have performance numbers? I don't know if we can compare cutlass kernels with xetla kernels (with so called persisitent weight). If the compare makes sense, please collect the data. Otherwise, you can just calculate the memory bandwidth. |
| if: github.event_name != 'pull_request' || contains(github.event.pull_request.labels.*.name, 'run-ci') | ||
| runs-on: sglang-pvc | ||
| steps: | ||
| - name: Checkout code | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| fetch-depth: 0 | ||
|
|
||
| - name: Set up Docker Buildx | ||
| uses: docker/setup-buildx-action@v2 | ||
|
|
||
| - name: Build Docker image | ||
| run: | | ||
| docker build \ | ||
| --build-arg SG_LANG_KERNEL_BRANCH=${{ github.head_ref }} \ | ||
| --build-arg SG_LANG_KERNEL_REPO=${{ github.event.pull_request.head.repo.clone_url }} \ | ||
| --no-cache --progress=plain -f Dockerfile.xpu_kernel -t xpu_sglang:kernel . | ||
| - name: Run container | ||
| run: | | ||
| docker run -dt \ | ||
| --device /dev/dri/ \ | ||
| --name ci_sglang_xpu \ | ||
| -e HF_TOKEN=$(cat ~/huggingface_token.txt) \ | ||
| xpu_sglang:kernel | ||
| - name: Install Dependency | ||
| timeout-minutes: 20 | ||
| run: | | ||
| docker exec ci_sglang_xpu /miniforge3/envs/py3.10/bin/python3 -m pip install --upgrade pip | ||
| docker exec ci_sglang_xpu /miniforge3/envs/py3.10/bin/python3 -m pip install pytest expecttest ray huggingface_hub | ||
| docker exec ci_sglang_xpu /bin/bash -c '/miniforge3/envs/py3.10/bin/huggingface-cli login --token ${HF_TOKEN} ' | ||
| docker exec ci_sglang_xpu /bin/bash -c "ln -sf /miniforge3/envs/py3.10/bin/python3 /usr/bin/python3" | ||
| - name: Run Sglang Kernel Cases | ||
| timeout-minutes: 20 | ||
| run: | | ||
| docker exec -w /root/sglang ci_sglang_xpu \ | ||
| /bin/bash -c "cd /root/sglang/sgl-kernel-xpu/tests && python3 run_suite.py --suite per-commit " | ||
| - name: Run Sglang Kernel Benchmarks | ||
| timeout-minutes: 20 | ||
| run: | | ||
| docker exec -w /root/sglang ci_sglang_xpu \ | ||
| /bin/bash -c "cd /root/sglang/sgl-kernel-xpu/benchmark && python3 bench_flash_attn.py && python3 bench_moe_topk_softmax.py && python3 benchmark_fused_moe.py " | ||
| - name: Run E2E Bfloat16 tests | ||
| timeout-minutes: 20 | ||
| run: | | ||
| echo "[PlaceHolder for E2E Test...]" | ||
| - name: Run E2E Qunatization tests | ||
| timeout-minutes: 20 | ||
| run: | | ||
| echo "[PlaceHolder for E2E Test...]" | ||
| - name: Cleanup container | ||
| if: always() | ||
| run: | | ||
| docker rm -f ci_sglang_xpu || true |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 2 days ago
The best way to fix this problem is to add the explicit permissions block with the minimum set of permissions required for the workflow to function. Since this workflow does not push changes, create releases, or otherwise write to the repository contents, it only needs contents: read permission. For maximum clarity and correctness, insert the following block after the top-level name: key (before or after on: is fine, but prefer after) to apply to all jobs in the workflow:
permissions:
contents: readThis ensures that GITHUB_TOKEN will only have read access to repository contents, thus following the principle of least privilege and fixing the CodeQL error. No new imports, methods, or definitions are required—just a YAML syntax addition.
-
Copy modified lines R3-R5
| @@ -1,5 +1,8 @@ | ||
| name: PR Test (XPU) | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| on: | ||
| pull_request: | ||
| branches: [main] |
kareemshaik80
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@mingfeima @sanchitintel If no more comments I will merge the PR |
NP, we don't have to do everything in just one PR. |
Uh oh!
There was an error while loading. Please reload this page.