[Feature, Hardware] Enable DeepseekV3 on AMD GPUs by BruceXcluding · Pull Request #2601 · sgl-project/sglang

BruceXcluding · 2024-12-26T15:51:52Z

Motivation

Support DeepseekV3 on AMD Instinct MI300X GPU

Modifications

Add proper fix for AMD FP8 e4m3fnuz to support DeepseekV3 FP8 model
Bypass FlashInfer backend bmm_fp8 to cast FP8 to BF16 in MLA
Add AMD triton stages config

TODO

amd base image testing ROCm base image update #2692
sgl-kernel add amd backend ROCm: sgl-kernel enablement starting with sgl_moe_align_block #3287
DeepseekV3 triton MOE config optimization add dsv3 mi300 triton config for block scale #3146
batch mm for FP8 optimization on rocm
composable kernel block gemm ROCM: AITER BLOCK GEMM #4075
ep moe

How to run

build env

cd sglang/docker

docker build –t sglang-rocm:latest –f Dockerfile.rocm .
 
docker run -it --ipc=host \ 
               --cap-add=SYS_PTRACE \
               --network=host \ 
               --device=/dev/kfd --device=/dev/dri \
               --security-opt seccomp=unconfined \ 
               --group-add video \
               --privileged \
               -w /workspace sglang-rocm:latest

offline:

python -m sglang.bench_one_batch --batch-size 32 --input 128 --output 32 --model /data/DeepSeek-V3-Base/ --tp 8 --trust-remote-code

Warmup ...
Prefill. latency: 6.46569 s, throughput:    633.50 token/s
Decode.  latency: 2.58990 s, throughput:     12.36 token/s
Decode.  latency: 0.07421 s, throughput:    431.21 token/s
Decode.  latency: 0.07358 s, throughput:    434.90 token/s
Decode.  latency: 0.07341 s, throughput:    435.91 token/s
Decode.  latency: 0.07385 s, throughput:    433.30 token/s
Decode.  median latency: 0.07383 s, median throughput:    433.44 token/s
Total. latency:  9.498 s, throughput:    458.19 token/s
Benchmark ...
Prefill. latency: 0.54745 s, throughput:   7482.01 token/s
Decode.  latency: 0.07250 s, throughput:    441.41 token/s
Decode.  latency: 0.07399 s, throughput:    432.46 token/s
Decode.  latency: 0.07309 s, throughput:    437.84 token/s
Decode.  latency: 0.07335 s, throughput:    436.27 token/s
Decode.  latency: 0.07333 s, throughput:    436.38 token/s
Decode.  median latency: 0.07358 s, median throughput:    434.88 token/s
Total. latency:  2.828 s, throughput:   1810.38 token/s

server:

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3-Base --tp 8 --trust-remote-code

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8

Accuracy: 0.950
Invalid: 0.000

Issues

If you get the error like raise OutOfResources(self.metadata.shared, max_shared, "shared memory"), same with [Bug] Deepseek-v2-lite AMD MI300 run failed #2384
Solved with python/sglang/srt/layers/attention/triton_ops/decode_attention.py +410
If you get an error like ImportError: cannot import name 'build_regex_from_schema' from 'outlines.fsm.json_schema', same with [Bug] SGLang v0.4.0 with AMD MI300X #2530
Solved with downgrade vllm
If you get an error like `RuntimeError: [enforce fail at /app/pytorch/third_party/gloo/gloo/transport/tcp/device.cc:83] ifa != nullptr. Unable to find address for: eth0'
Solved with ifconfig check your eth number and export GLOO_SOCKET_IFNAME=your eth

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

carlushuang · 2024-12-26T16:15:13Z

@HaiShaw

HaiShaw · 2024-12-26T16:23:07Z

@BruceXcluding Can we just add the fix to unlock v3 from the triton kernel config error first?

zhyncs · 2024-12-26T16:25:10Z

@BruceXcluding Can we just add the fix to unlock v3 from the triton kernel config error first?

That would be nice. I plan to release v0.4.1.post1 soon to enable users to use AMD MI300X initially.

zhyncs · 2024-12-26T16:50:22Z

Hi @BruceXcluding May you help fix the failed CIs ref https://github.com/sgl-project/sglang/blob/main/docs/references/contributor_guide.md#format-your-code

HaiShaw

@BruceXcluding
Some to address, thanks!

HaiShaw · 2024-12-26T23:14:44Z

 ENV NCCL_MIN_NCHANNELS=112

-ENV MOE_PADDING=1
+ENV MOE_PADDING=0


We need to keep MOE_PADDING on for performance, to error it incurs we need to fix it.

HaiShaw · 2024-12-26T23:16:26Z

    logit_cap,
 ):
-    BLOCK = 32
+    BLOCK = 16 if is_hip() else 32


we should not cut by half for HIP globally here.

it doesn't work well in latest vllm with BLOCK 32

This part we can not take as it is - it will cost performance of all other models in large margin.

HaiShaw · 2024-12-26T23:32:05Z

        # WEIGHT
        weight_dtype = (
-            torch.float8_e4m3fn
+            torch.float8_e4m3fnuz if is_hip() else torch.float8_e4m3fn


We should not have this, serialized weight is always OCP (torch.float8_e4m3fn)

it would encounter the error "python/sglang/srt/layers/quantization/fp8_kernel.py:176:33: error: Unsupported conversion from 'f8E4M3FN' to 'f16'
accumulator += tl.dot(a, b) * a_s[:, None] * b_s[None, :]" with torch.float8_e4m3fn at w8a8_block_fp8_matmul

Please check how normalize_e4m3fn_to_e4m3fnuz is used.
Basically - we do not expected non-OCP/e4m3fn dtype in the quantized model.

HaiShaw · 2024-12-26T23:37:44Z

    is_marlin: bool,
 ) -> Dict[str, int]:
-    if dtype == "fp8_w8a8":
+    if dtype == "fp8_w8a8" and not is_hip():


following block isn't a breaker to HIP

HaiShaw

@BruceXcluding
Also see this error below with your version of pyproject.toml:

  File "/dockerx/1226/HS/sglang/python/sglang/srt/constrained/outlines_backend.py", line 23, in <module>
    from outlines.fsm.json_schema import build_regex_from_schema
ImportError: cannot import name 'build_regex_from_schema' from 'outlines.fsm.json_schema' (/usr/local/lib/python3.12/dist-packages/outlines/fsm/json_schema.py)

ZJLi2013 · 2024-12-27T04:55:32Z

the CI failure: PR Test / unit-test-backend-2-gpu, used a lite model 'deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct', which doesn't has fp8 block-level quant feature

HaiShaw · 2024-12-27T07:54:59Z

        # WEIGHT
        weight_dtype = (
-            torch.float8_e4m3fn
+            torch.float8_e4m3fnuz if is_hip() else torch.float8_e4m3fn


Please check how normalize_e4m3fn_to_e4m3fnuz is used.
Basically - we do not expected non-OCP/e4m3fn dtype in the quantized model.

HaiShaw · 2024-12-27T07:55:41Z


        if self.quant_config.is_checkpoint_fp8_serialized:
-            params_dtype = torch.float8_e4m3fn
+            params_dtype = torch.float8_e4m3fnuz if is_hip() else torch.float8_e4m3fn


same problem here - check out the previous usage from normalize_e4m3fn_to_e4m3fnuz

BruceXcluding · 2024-12-31T02:47:56Z

@AdjectiveAllison we are targeted to fix accuracy issue with fp8, do you see garbled output with bf16 as well? We will tune performance with config.json provided soon. Are you using MI308?

No, output on full bf16 works perfectly. I'm on an 8x mi300x machine. 192GB of vram each.

@AdjectiveAllison Can you try with the latest instruction

zhyncs · 2024-12-31T08:11:25Z

    os.environ["CUDA_DEVICE_MAX_CONNECTIONS"] = "4"
-    if "GLOO_SOCKET_IFNAME" not in os.environ:
-        os.environ["GLOO_SOCKET_IFNAME"] = "eth0"
+    # TODO(fix socket error with gpu backend)


Why is this commented out?

Is this used for cpu backend or specific workstation? get RuntimeError: [enforce fail at pytorch/third_party/gloo/gloo/transport/tcp/device.cc:83] ifa != nullptr. Unable to find address for: eth0

This is used for multi-node tensor parallelism. Instead of using comments, we suggest adding an is_hip flag.

I think the value set for the GLOO_SOCKET_IFNAME environment variable should depend on the name of the network interface card in each user's system and should not be hard-coded as eth0

@wufann If the user's value is not eth0, they should specify it explicitly, this applies only when no setting is provided, with eth0 as the default.

@zhyncs Different network interface ( "ens" ) may be used. Also they may test in a single node envrionment where IB is not configured. In that case IB should be disabled

zhyncs · 2024-12-31T08:12:34Z

@@ -0,0 +1,51 @@
+cmake_minimum_required(VERSION 3.18)


Please remove this, we only use CMakeLists.txt for clangd indexing, so it's not necessary.

zhyncs · 2024-12-31T08:17:42Z

+build-backend = "setuptools.build_meta"
+
+[project]
+name = "sgl-kernel"


Can we refer to the setup of flash-attention or vllm compatible with NVIDIA and AMD?
https://github.com/Dao-AILab/flash-attention/blob/main/setup.py
https://github.com/vllm-project/vllm/blob/main/setup.py

zhyncs · 2025-01-02T13:50:17Z

Hi @BruceXcluding @HaiShaw
#2712
You can now try using moe_align_block_size_triton on AMD.

BruceXcluding · 2025-01-02T15:52:56Z

Hi @BruceXcluding @HaiShaw #2712 You can now try using moe_align_block_size_triton on AMD.

Tested and works well. We could build sgl-kernel-amd after we add ck kernels

HaiShaw · 2025-01-02T17:42:25Z

Hi @BruceXcluding @HaiShaw #2712 You can now try using moe_align_block_size_triton on AMD.

Tested and works well. We could build sgl-kernel-amd after we add ck kernels

@BruceXcluding, How was the performance comparing to sgl-kernel-amd?

zhyncs · 2025-01-02T17:53:23Z

Hi @BruceXcluding @HaiShaw
Before releasing v0.4.1.post4 #2713, I hope the main branch has a version compatible with AMD MI300X. What minimal changes are needed to achieve this? The requirement is just to get it running, performance optimization can be done later.

HaiShaw · 2025-01-02T19:05:26Z

@zhyncs I am expecting @BruceXcluding to do the final update.
@BruceXcluding can you confirm the decode_attention.py change?

HaiShaw

@BruceXcluding thanks!

had been address above

BruceXcluding · 2025-01-03T00:40:19Z

Hi @BruceXcluding @HaiShaw Before releasing v0.4.1.post4 #2713, I hope the main branch has a version compatible with AMD MI300X. What minimal changes are needed to achieve this? The requirement is just to get it running, performance optimization can be done later.

Thanks @zhyncs @HaiShaw. we will keep the TODO list on track for performance improvement.

Co-authored-by: root <root@banff-cyxtera-s83-5.amd.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Bruce Xue <yigex@xilinx.com> Co-authored-by: Yineng Zhang <me@zhyncs.com>

yiakwy-xpu-ml-framework-team · 2025-01-03T11:27:53Z

Hi @BruceXcluding @HaiShaw Before releasing v0.4.1.post4 #2713, I hope the main branch has a version compatible with AMD MI300X. What minimal changes are needed to achieve this? The requirement is just to get it running, performance optimization can be done later.

Thanks @zhyncs @HaiShaw. we will keep the TODO list on track for performance improvement.

Yes theoretical throughput is

4800 (memory transaction speed) / 37 * 1.8 (MTP multiplier) ~ 233 tok/gpu/sec， arrond 1868 toks/sec for 8 cards

There are spaces to improve.

Co-authored-by: root <root@banff-cyxtera-s83-5.amd.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Bruce Xue <yigex@xilinx.com> Co-authored-by: Yineng Zhang <me@zhyncs.com>

Add hip config

45dfe9e

zhyncs added the high priority label Dec 26, 2024

zhyncs assigned ispobock, HandH1998 and zhyncs Dec 26, 2024

zhyncs added bug Something isn't working amd labels Dec 26, 2024

merrymercy mentioned this pull request Dec 26, 2024

[Bug] Deepseek v3 doesn't work on mi300x #2595

Closed

5 tasks

zhyncs mentioned this pull request Dec 26, 2024

[Bug] libcudart.so.12: cannot open shared object file: No such file or directory #2584

Closed

5 tasks

HaiShaw requested changes Dec 26, 2024

View reviewed changes

merrymercy previously requested changes Dec 27, 2024

View reviewed changes

Comment thread python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py Outdated

HaiShaw requested changes Dec 27, 2024

View reviewed changes

Merge branch 'sgl-project:main' into main

d315402

Fix AMD moe_align and triton stage config

57a5006

BruceXcluding marked this pull request as ready for review December 27, 2024 05:56

BruceXcluding requested review from ByronHsu, Ying1123, hnyls2002, ispobock and zhyncs as code owners December 27, 2024 05:56

BruceXcluding requested review from HaiShaw and merrymercy December 27, 2024 05:59

HaiShaw requested changes Dec 27, 2024

View reviewed changes

fix fused_moe.py conflict

3fa113b

BruceXcluding force-pushed the main branch from 067b04a to 3fa113b Compare December 27, 2024 09:05

BruceXcluding marked this pull request as draft December 31, 2024 01:36

sperate sgl-kernel with amd backend

abc497d

BruceXcluding force-pushed the main branch from 82968d4 to abc497d Compare December 31, 2024 02:07

This was referenced Dec 31, 2024

[Bug] Deepseek-v2-lite AMD MI300 run failed #2384

Closed

[Bug] SGLang v0.4.0 with AMD MI300X #2530

Closed

zhyncs reviewed Dec 31, 2024

View reviewed changes

Merge 'main' into 'main'

4bb3332

Clang format

b10c089

BruceXcluding force-pushed the main branch from 34da38f to b10c089 Compare January 2, 2025 16:52

BruceXcluding requested a review from HaiShaw January 2, 2025 16:54

zhyncs added 2 commits January 3, 2025 01:58

Merge branch 'main' into main

bf2ad5d

Merge branch 'main' into main

3b63a5f

zhyncs marked this pull request as ready for review January 2, 2025 18:42

Merge branch 'main' into main

7b8d375

HaiShaw approved these changes Jan 3, 2025

View reviewed changes

HaiShaw merged commit c7ae474 into sgl-project:main Jan 3, 2025

HaiShaw mentioned this pull request Jan 8, 2025

[AMD Feature Performance] Triton kernel performance tuning for DeepSeek V3 #2802

Closed

2 tasks

JustinTong0323 mentioned this pull request Mar 4, 2026

[Bug] allreduce_fusion unified API produces garbled output on Blackwell (GB300, SM 10.3) while legacy trtllm_allreduce_fusion works correctly #19884

Closed

5 tasks

Conversation

BruceXcluding commented Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

TODO

How to run

Issues

Checklist

Uh oh!

carlushuang commented Dec 26, 2024

Uh oh!

HaiShaw commented Dec 26, 2024

Uh oh!

zhyncs commented Dec 26, 2024

Uh oh!

zhyncs commented Dec 26, 2024

Uh oh!

HaiShaw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HaiShaw left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZJLi2013 commented Dec 27, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BruceXcluding commented Dec 31, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhyncs Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhyncs commented Jan 2, 2025

Uh oh!

BruceXcluding commented Jan 2, 2025

Uh oh!

HaiShaw commented Jan 2, 2025

Uh oh!

zhyncs commented Jan 2, 2025

Uh oh!

HaiShaw commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BruceXcluding commented Dec 26, 2024 •

edited

Loading

HaiShaw left a comment •

edited

Loading

zhyncs Dec 31, 2024 •

edited

Loading

HaiShaw commented Jan 2, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Jan 3, 2025 •

edited

Loading