[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices #11854
+1,689
−100
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.

Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices
Introduction
We published an article on LMSYS titled "Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G", sharing our best practices for deploying the DeepSeek-R1 model on H20-96G hardware.
To facilitate reproduction of our experimental results and provide access to our code, we have released this pull request in the DeepSeek-R1 repository.
Reproduction Steps
Pulling the Docker Image
To obtain the Docker image, use the following command:
The image is hosted at: https://github.com/orgs/antgroup/packages/container/package/sglang
Checking Environment Variables
All environment variables are stored in the
/root/env.shfile, configured for our H20 environment. Before launching SGLang, verify that these variables are suitable for your environment.Launching SGLang
We recommend running four containers: two for Prefill nodes and two for Decode nodes.
1. Launching Prefill Nodes (Identical Configuration for Both Nodes)
Note:
2. Launching Decode Nodes
Note:
{node_rank}to0or1for the respective node.{decode_master_ip}with the IP address of Node 0.Node-0
PYTHONUNBUFFERED=1 \ SGL_ENABLE_JIT_DEEPGEMM=1 \ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64 \ ENABLE_SWAPAB=1 \ nohup python3 -m sglang.launch_server \ --model-path /path/to/DeepSeek-R1 \ --disaggregation-mode decode \ --disaggregation-transfer-backend mooncake \ --disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \ --disaggregation-bootstrap-port 9000 \ --attention-backend flashmla \ --host 0.0.0.0 \ --port 61001 \ --trust-remote-code \ --dist-init-addr {decode_master_ip}:62001 \ --nnodes 2 \ --node-rank {node_rank} \ --tp-size 16 \ --dp-size 16 \ --enable-dp-attention \ --mem-fraction-static 0.88 \ --max-running-requests 512 \ --context-length 65535 \ --log-level info \ --decode-log-interval 50 \ --page-size 64 \ --schedule-conservativeness 0.3 \ --enable-cache-report \ --moe-dense-tp-size 1 \ --enable-deepep-moe \ --enable-dp-lm-head \ --cuda-graph-max-bs 32 \ --speculative-algorithm NEXTN \ --speculative-num-steps 1 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 2 \ --init-expert-location /root/expert_workload.json \ --prefill-round-robin-balance \ --quantization fp8 \ --kv-cache-dtype fp8_e4m3 \ --deepep-mode low_latency_overlap \ --enable-single-batch-overlap \ > /home/admin/logs/stdout.log 2>&1 &3. Launching SGLang Router
Note:
{decode_master_ip},{prefill_node_0_ip}, and{prefill_node_1_ip}with the respective IP addresses.nohup python3 -m sglang_router.launch_router \ --pd-disaggregation \ --mini-lb \ --host 0.0.0.0 \ --decode http://{decode_master_ip}:61001 \ --port 8000 \ --prefill http://{prefill_node_0_ip}:61001 \ --prefill http://{prefill_node_1_ip}:61001 \ > /home/admin/logs/router.log 2>&1 &Testing
1. Running the Benchmark
Note:
--request-rateis set toinf, all requests are sent at once, making TTFT and TPOT data less meaningful.{path-to-shareGPT}with the path to the ShareGPT dataset.nohup python3 -m sglang.bench_serving \ --host 0.0.0.0 \ --port 8000 \ --dataset-path {path-to-shareGPT} \ --num-prompt 4096 \ --random-input 4096 \ --random-output 1536 \ --request-rate "inf" \ --max-concurrency 2048 \ --warmup-requests 0 \ --backend sglang \ --dataset-name random \ --random-range-ratio 1 \ > /home/local/workspace/bench.log 2>&1 &2. Observing Logs
To monitor peak performance, filter logs for entries with
running-req: 32:grep -E 'Decode batch.*running-req: 32' /home/admin/logs/sglang.logExample Output (for batch size = 32):
Related PRs