[Feature, Hardware] Enable DeepseekV3 on AMD GPUs#2601
[Feature, Hardware] Enable DeepseekV3 on AMD GPUs#2601HaiShaw merged 27 commits intosgl-project:mainfrom
Conversation
|
@BruceXcluding Can we just add the fix to unlock v3 from the triton kernel config error first? |
That would be nice. I plan to release v0.4.1.post1 soon to enable users to use AMD MI300X initially. |
|
Hi @BruceXcluding May you help fix the failed CIs ref https://github.com/sgl-project/sglang/blob/main/docs/references/contributor_guide.md#format-your-code |
HaiShaw
left a comment
There was a problem hiding this comment.
@BruceXcluding
Some to address, thanks!
| ENV NCCL_MIN_NCHANNELS=112 | ||
|
|
||
| ENV MOE_PADDING=1 | ||
| ENV MOE_PADDING=0 |
There was a problem hiding this comment.
We need to keep MOE_PADDING on for performance, to error it incurs we need to fix it.
| logit_cap, | ||
| ): | ||
| BLOCK = 32 | ||
| BLOCK = 16 if is_hip() else 32 |
There was a problem hiding this comment.
we should not cut by half for HIP globally here.
There was a problem hiding this comment.
it doesn't work well in latest vllm with BLOCK 32
There was a problem hiding this comment.
This part we can not take as it is - it will cost performance of all other models in large margin.
| # WEIGHT | ||
| weight_dtype = ( | ||
| torch.float8_e4m3fn | ||
| torch.float8_e4m3fnuz if is_hip() else torch.float8_e4m3fn |
There was a problem hiding this comment.
We should not have this, serialized weight is always OCP (torch.float8_e4m3fn)
There was a problem hiding this comment.
it would encounter the error "python/sglang/srt/layers/quantization/fp8_kernel.py:176:33: error: Unsupported conversion from 'f8E4M3FN' to 'f16'
accumulator += tl.dot(a, b) * a_s[:, None] * b_s[None, :]" with torch.float8_e4m3fn at w8a8_block_fp8_matmul
There was a problem hiding this comment.
Please check how normalize_e4m3fn_to_e4m3fnuz is used.
Basically - we do not expected non-OCP/e4m3fn dtype in the quantized model.
| is_marlin: bool, | ||
| ) -> Dict[str, int]: | ||
| if dtype == "fp8_w8a8": | ||
| if dtype == "fp8_w8a8" and not is_hip(): |
There was a problem hiding this comment.
following block isn't a breaker to HIP
There was a problem hiding this comment.
@BruceXcluding
Also see this error below with your version of pyproject.toml:
File "/dockerx/1226/HS/sglang/python/sglang/srt/constrained/outlines_backend.py", line 23, in <module>
from outlines.fsm.json_schema import build_regex_from_schema
ImportError: cannot import name 'build_regex_from_schema' from 'outlines.fsm.json_schema' (/usr/local/lib/python3.12/dist-packages/outlines/fsm/json_schema.py)
|
the CI failure: PR Test / unit-test-backend-2-gpu, used a lite model 'deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct', which doesn't has fp8 block-level quant feature |
| # WEIGHT | ||
| weight_dtype = ( | ||
| torch.float8_e4m3fn | ||
| torch.float8_e4m3fnuz if is_hip() else torch.float8_e4m3fn |
There was a problem hiding this comment.
Please check how normalize_e4m3fn_to_e4m3fnuz is used.
Basically - we do not expected non-OCP/e4m3fn dtype in the quantized model.
|
|
||
| if self.quant_config.is_checkpoint_fp8_serialized: | ||
| params_dtype = torch.float8_e4m3fn | ||
| params_dtype = torch.float8_e4m3fnuz if is_hip() else torch.float8_e4m3fn |
There was a problem hiding this comment.
same problem here - check out the previous usage from normalize_e4m3fn_to_e4m3fnuz
@AdjectiveAllison Can you try with the latest instruction |
| os.environ["CUDA_DEVICE_MAX_CONNECTIONS"] = "4" | ||
| if "GLOO_SOCKET_IFNAME" not in os.environ: | ||
| os.environ["GLOO_SOCKET_IFNAME"] = "eth0" | ||
| # TODO(fix socket error with gpu backend) |
There was a problem hiding this comment.
Why is this commented out?
There was a problem hiding this comment.
Is this used for cpu backend or specific workstation? get RuntimeError: [enforce fail at pytorch/third_party/gloo/gloo/transport/tcp/device.cc:83] ifa != nullptr. Unable to find address for: eth0
There was a problem hiding this comment.
This is used for multi-node tensor parallelism. Instead of using comments, we suggest adding an is_hip flag.
There was a problem hiding this comment.
I think the value set for the GLOO_SOCKET_IFNAME environment variable should depend on the name of the network interface card in each user's system and should not be hard-coded as eth0
There was a problem hiding this comment.
@wufann If the user's value is not eth0, they should specify it explicitly, this applies only when no setting is provided, with eth0 as the default.
There was a problem hiding this comment.
@zhyncs Different network interface ( "ens" ) may be used. Also they may test in a single node envrionment where IB is not configured. In that case IB should be disabled
| @@ -0,0 +1,51 @@ | |||
| cmake_minimum_required(VERSION 3.18) | |||
There was a problem hiding this comment.
Please remove this, we only use CMakeLists.txt for clangd indexing, so it's not necessary.
| build-backend = "setuptools.build_meta" | ||
|
|
||
| [project] | ||
| name = "sgl-kernel" |
There was a problem hiding this comment.
Can we refer to the setup of flash-attention or vllm compatible with NVIDIA and AMD?
https://github.com/Dao-AILab/flash-attention/blob/main/setup.py
https://github.com/vllm-project/vllm/blob/main/setup.py
|
Hi @BruceXcluding @HaiShaw |
Tested and works well. We could build sgl-kernel-amd after we add ck kernels |
@BruceXcluding, How was the performance comparing to |
|
Hi @BruceXcluding @HaiShaw |
|
@zhyncs I am expecting @BruceXcluding to do the final update. |
Thanks @zhyncs @HaiShaw. we will keep the TODO list on track for performance improvement. |
Co-authored-by: root <root@banff-cyxtera-s83-5.amd.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Bruce Xue <yigex@xilinx.com> Co-authored-by: Yineng Zhang <me@zhyncs.com>
Yes theoretical throughput is
There are spaces to improve. |
Co-authored-by: root <root@banff-cyxtera-s83-5.amd.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Bruce Xue <yigex@xilinx.com> Co-authored-by: Yineng Zhang <me@zhyncs.com>
Motivation
Modifications
e4m3fnuzto support DeepseekV3 FP8 modelFlashInfer backend bmm_fp8to cast FP8 to BF16 in MLAtriton stagesconfigTODO
How to run
build env
offline:
server:
Issues
raise OutOfResources(self.metadata.shared, max_shared, "shared memory"), same with [Bug] Deepseek-v2-lite AMD MI300 run failed #2384Solved with
python/sglang/srt/layers/attention/triton_ops/decode_attention.py +410ImportError: cannot import name 'build_regex_from_schema' from 'outlines.fsm.json_schema', same with [Bug] SGLang v0.4.0 with AMD MI300X #2530Solved with
downgrade vllmSolved with ifconfig check your eth number and
export GLOO_SOCKET_IFNAME=your ethChecklist