Questions about benchmark in the blog #1124

jtmer · 2024-08-16T12:37:11Z

jtmer
Aug 16, 2024

I have read the blog "Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM)". It says that with the small model Llama-8B, when we set the input_size=256, output_size=512, we can see the output Throughput of SGlang can reach more than 5000 token/s, while the vllm is only 2000 token/s。

However, I just ran the benchmark and found vllm doesn't perform that bad. I also used the Llama-8B, and set the input_size=256, output_size=512. I also set the batchsize=96 so that running SGlang's benchmark on my computer produces the same results as in the blog（about 5000 token/s）.In this config, I got the Total Latency of SGlang is 14s. I use the same config to vllm(input_size=256, output_size=512, batchsize=96), and got the Total Latency of vllm is 17.4s. This is only a 1.2x speed boost, not like the roughly 2-3 times speed boost seen in blogs. So I wonder what happened.

my environment:
gpu A800
os linux
torch verson = 2.4
cuda verson = 12.1
sglang verson = 0.2.13
vllm verson = 0.5.4

benchmark file I used:
vllm: vllm/benchmarks/benchmark_latency.py
sglang: sglang/python/sglang/bench_latency.py

Answered by zhyncs

Aug 16, 2024

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --disable-log-requests

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 6000 --random-input 256 --random-output 512
python3 -m sglang.bench_serving --backend vllm --dataset-name random --num-prompts 6000 --random-input 256 --random-output 512

And then, compare the output throughput.

View full answer

zhyncs · 2024-08-16T12:39:21Z

zhyncs
Aug 16, 2024
Maintainer

ref https://github.com/sgl-project/sglang/tree/main/benchmark/blog_v0_2

2 replies

jtmer Aug 16, 2024
Author

哥您这也回复也太快了

zhyncs Aug 16, 2024
Maintainer

I turned on email push notifications. 😂

zhyncs · 2024-08-16T12:42:30Z

zhyncs
Aug 16, 2024
Maintainer

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --disable-log-requests

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 6000 --random-input 256 --random-output 512
python3 -m sglang.bench_serving --backend vllm --dataset-name random --num-prompts 6000 --random-input 256 --random-output 512

And then, compare the output throughput.

0 replies

zhyncs · 2024-08-16T12:43:47Z

zhyncs
Aug 16, 2024
Maintainer

I'll close this discussion for now. If there are any further questions, it can be reopened. Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about benchmark in the blog #1124

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Questions about benchmark in the blog #1124

jtmer Aug 16, 2024

Replies: 3 comments · 2 replies

zhyncs Aug 16, 2024 Maintainer

jtmer Aug 16, 2024 Author

zhyncs Aug 16, 2024 Maintainer

zhyncs Aug 16, 2024 Maintainer

zhyncs Aug 16, 2024 Maintainer

jtmer
Aug 16, 2024

Replies: 3 comments 2 replies

zhyncs
Aug 16, 2024
Maintainer

jtmer Aug 16, 2024
Author

zhyncs Aug 16, 2024
Maintainer

zhyncs
Aug 16, 2024
Maintainer

zhyncs
Aug 16, 2024
Maintainer