-
Notifications
You must be signed in to change notification settings - Fork 2
[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…when seq_lens is small
Reviewer's GuideThis PR implements a new one-shot multi-head attention mode for DeepSeek-V2, enriches fused MoE Triton kernels with descriptor/TMA/filtering support, introduces Triton-based KV buffer operations in the memory pool and utils, updates config generation for down-MoE scenarios, and adds a comprehensive benchmark/tuning script for the fused MoE kernels. Sequence diagram for one-shot MHA attention path in DeepSeek-V2sequenceDiagram
participant FB as ForwardBatch
participant Attn as DeepseekV2AttentionMLA
participant KVPool as MLATokenToKVPool
FB->>Attn: forward_prepare(...)
Attn->>FB: _support_mha_one_shot(...)
alt MHA_ONE_SHOT supported
Attn->>Attn: forward_normal_one_shot_prepare(...)
Attn->>FB: fetch_mha_one_shot_kv_indices()
Attn->>KVPool: get_mla_kv_buffer(...)
KVPool-->>Attn: (kv_a, k_pe)
Attn->>Attn: forward_normal_one_shot_core(...)
else fallback
Attn->>Attn: forward_normal_chunked_kv_prepare(...)
end
Sequence diagram for fused MoE Triton kernel invocation with TMA/descriptor supportsequenceDiagram
participant Worker as BenchmarkWorker
participant FusedMoE as FusedMoE
participant Kernel as TritonKernel
Worker->>FusedMoE: benchmark(...)
FusedMoE->>Kernel: invoke_fused_moe_kernel(..., a_desc, b_desc, filter_expert)
Kernel-->>FusedMoE: (results)
FusedMoE-->>Worker: (latency results)
Class diagram for new and updated DeepSeek-V2 attention and MoE classesclassDiagram
class AttnForwardMethod {
+MHA_CHUNKED_KV
+MHA_ONE_SHOT
+MLA_FUSED_ROPE
}
class DeepseekV2AttentionMLA {
+kv_cache_dtype
+forward_normal_one_shot_prepare()
+forward_normal_one_shot_core()
+_set_mla_kv_buffer()
+_get_mla_kv_buffer()
+_concat_and_cast_mha_k()
}
class ForwardBatch {
+mha_one_shot_kv_indices
+mha_one_shot
+fetch_mha_one_shot_kv_indices()
}
class MLATokenToKVPool {
+get_mla_kv_buffer()
}
AttnForwardMethod <|-- DeepseekV2AttentionMLA
DeepseekV2AttentionMLA <.. ForwardBatch
ForwardBatch <.. MLATokenToKVPool
Class diagram for Fused MoE Triton kernel and config changesclassDiagram
class BenchmarkWorker {
+benchmark()
+tune()
}
class BestConfigTrace {
+update()
+total_time
+config_dict()
}
class MoeRunnerConfig {
+inplace
+num_experts
+num_local_experts
}
class FusedMoE {
+fused_experts_impl(..., filter_expert)
}
class FusedMoEConfig {
+get_config_file_name(..., down_moe)
+get_moe_configs(..., down_moe)
+try_get_optimal_moe_config(..., return_down_config)
}
BenchmarkWorker <.. BestConfigTrace
FusedMoEConfig <.. FusedMoE
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Sorry to bother you, is this optimization only for the DeepSeek-R1 model? Is the DeepSeek-V3 model also available? |
Yes, V3 is also available. |
Thanks for your reply. When I try to reproduce your work, it shows that there is a problem with the DeepGEMM library. Can you tell me the link of the Deepgemm library you use? In this repository, I tried the |
|
Hello, I have successfully run it according to your configuration, and the performance is very good. However, there is an issue: during the operation, the sglang logs are not printed. Even when I set |
Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices
Introduction
We published an article on LMSYS titled "Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G", sharing our best practices for deploying the DeepSeek-R1 model on H20-96G hardware.
To facilitate reproduction of our experimental results and provide access to our code, we have released this pull request in the DeepSeek-R1 repository.
Reproduction Steps
Pulling the Docker Image
To obtain the Docker image, use the following command:
The image is hosted at: https://github.com/orgs/antgroup/packages/container/package/sglang
Checking Environment Variables
All environment variables are stored in the
/root/env.shfile, configured for our H20 environment. Before launching SGLang, verify that these variables are suitable for your environment.Launching SGLang
We recommend running four containers: two for Prefill nodes and two for Decode nodes.
1. Launching Prefill Nodes (Identical Configuration for Both Nodes)
Note:
2. Launching Decode Nodes
Note:
{node_rank}to0or1for the respective node.{decode_master_ip}with the IP address of Node 0.Node-0
PYTHONUNBUFFERED=1 \ SGL_ENABLE_JIT_DEEPGEMM=1 \ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64 \ ENABLE_SWAPAB=1 \ nohup python3 -m sglang.launch_server \ --model-path /path/to/DeepSeek-R1 \ --disaggregation-mode decode \ --disaggregation-transfer-backend mooncake \ --disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \ --disaggregation-bootstrap-port 9000 \ --attention-backend flashmla \ --host 0.0.0.0 \ --port 61001 \ --trust-remote-code \ --dist-init-addr {decode_master_ip}:62001 \ --nnodes 2 \ --node-rank {node_rank} \ --tp-size 16 \ --dp-size 16 \ --enable-dp-attention \ --mem-fraction-static 0.88 \ --max-running-requests 512 \ --context-length 65535 \ --log-level info \ --decode-log-interval 50 \ --page-size 64 \ --schedule-conservativeness 0.3 \ --enable-cache-report \ --moe-dense-tp-size 1 \ --enable-deepep-moe \ --enable-dp-lm-head \ --cuda-graph-max-bs 32 \ --speculative-algorithm NEXTN \ --speculative-num-steps 1 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 2 \ --init-expert-location /root/expert_workload.json \ --prefill-round-robin-balance \ --quantization fp8 \ --kv-cache-dtype fp8_e4m3 \ --deepep-mode low_latency_overlap \ --enable-single-batch-overlap \ > /home/admin/logs/stdout.log 2>&1 &3. Launching SGLang Router
Note:
{decode_master_ip},{prefill_node_0_ip}, and{prefill_node_1_ip}with the respective IP addresses.nohup python3 -m sglang_router.launch_router \ --pd-disaggregation \ --mini-lb \ --host 0.0.0.0 \ --decode http://{decode_master_ip}:61001 \ --port 8000 \ --prefill http://{prefill_node_0_ip}:61001 \ --prefill http://{prefill_node_1_ip}:61001 \ > /home/admin/logs/router.log 2>&1 &Testing
1. Running the Benchmark
Note:
--request-rateis set toinf, all requests are sent at once, making TTFT and TPOT data less meaningful.{path-to-shareGPT}with the path to the ShareGPT dataset.nohup python3 -m sglang.bench_serving \ --host 0.0.0.0 \ --port 8000 \ --dataset-path {path-to-shareGPT} \ --num-prompt 4096 \ --random-input 4096 \ --random-output 1536 \ --request-rate "inf" \ --max-concurrency 2048 \ --warmup-requests 0 \ --backend sglang \ --dataset-name random \ --random-range-ratio 1 \ > /home/local/workspace/bench.log 2>&1 &2. Observing Logs
To monitor peak performance, filter logs for entries with
running-req: 32:grep -E 'Decode batch.*running-req: 32' /home/admin/logs/sglang.logExample Output (for batch size = 32):
Related PRs
Summary by Sourcery
Enable a one-shot multi-head attention path and TMA-based MoE optimizations across SGLang’s DeepSeek and fused MoE kernels, and add a script for auto-tuning Triton MoE configurations.
New Features:
Enhancements:
Chores: