-
I have a custom test where I basically send in 200 tokens and receive 200 tokens. I run the tests for 10 times measure the time it takes the model to process all the requests. These are my throughput (tok/s) results for llama 3.1 8B.
My understanding of data parallelism is that each prompt would be sent to a separate GPU. Under an optimal scenario, this would mean that my throughput for tp1 dp4 would be 4x tp1 dp1 except for low batch sizes. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Yes. Your understanding is correct. However, to saturate the server, you need a very large batch size. Can you try to make the batch size larger? If you can share the scripts of your custom test, I can take a look and help you debug the reason for not getting the optimal speedup. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
The workload in your benchmark scripts is too lightweight. There are also some bottlenecks in your benchmark scripts which make it unable to correctly measure the time cost of large batch sizes. I suggest using the built-in benchmark scripts in sglang. To show the ideal speedup data parallelsim, we need heavy workloads to fully saturate the server. You can try this benchmark command.
With tp=1, dp=1 (
With tp=1, dp=4 (
So it is close to 4x speedup. |
Beta Was this translation helpful? Give feedback.
The workload in your benchmark scripts is too lightweight. There are also some bottlenecks in your benchmark scripts which make it unable to correctly measure the time cost of large batch sizes. I suggest using the built-in benchmark scripts in sglang. To show the ideal speedup data parallelsim, we need heavy workloads to fully saturate the server.
You can try this benchmark command.
With tp=1, dp=1 (
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B
), I got