-
I have read the blog "Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM)". It says that with the small model Llama-8B, when we set the However, I just ran the benchmark and found vllm doesn't perform that bad. I also used the Llama-8B, and set the my environment: benchmark file I used: |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
ref https://github.com/sgl-project/sglang/tree/main/benchmark/blog_v0_2 |
Beta Was this translation helpful? Give feedback.
-
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --disable-log-requests
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 6000 --random-input 256 --random-output 512
python3 -m sglang.bench_serving --backend vllm --dataset-name random --num-prompts 6000 --random-input 256 --random-output 512 And then, compare the output throughput. |
Beta Was this translation helpful? Give feedback.
-
I'll close this discussion for now. If there are any further questions, it can be reopened. Thanks! |
Beta Was this translation helpful? Give feedback.
And then, compare the output throughput.