What is data parallelism and how can I use it to speed up my processing? #756

w013nad · 2024-07-26T21:21:02Z

w013nad
Jul 26, 2024

I have a custom test where I basically send in 200 tokens and receive 200 tokens. I run the tests for 10 times measure the time it takes the model to process all the requests. These are my throughput (tok/s) results for llama 3.1 8B.

Batch Size	tp=1 dp=1	tp=4 dp=1	tp=1 dp=4
1	109.39	191.31	108.79
2	199.37	357.56	216.82
4	385.22	706.22	434.27
8	723.62	1345.35	818.52
16	1381.88	2429.33	1586.28
32	2130.14	3586.46	3013
64	3823.76	5329.05	5386.9
128	4671.5	6499.04	8168.13

My understanding of data parallelism is that each prompt would be sent to a separate GPU. Under an optimal scenario, this would mean that my throughput for tp1 dp4 would be 4x tp1 dp1 except for low batch sizes.

Answered by Ying1123

Jul 29, 2024

The workload in your benchmark scripts is too lightweight. There are also some bottlenecks in your benchmark scripts which make it unable to correctly measure the time cost of large batch sizes. I suggest using the built-in benchmark scripts in sglang. To show the ideal speedup data parallelsim, we need heavy workloads to fully saturate the server.

You can try this benchmark command.

python3 -m sglang.bench_serving --backend sglang --num-p 500 --disable-stream --dataset-name random --random-input 4096 --random-output 256 --random-range-ratio 1.0 --host localhost --port 30000

With tp=1, dp=1 (python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B), I got

Request throughput…

View full answer

merrymercy · 2024-07-27T13:18:48Z

merrymercy
Jul 27, 2024
Maintainer

Yes. Your understanding is correct. However, to saturate the server, you need a very large batch size. Can you try to make the batch size larger?

If you can share the scripts of your custom test, I can take a look and help you debug the reason for not getting the optimal speedup.

0 replies

w013nad · 2024-07-29T01:25:39Z

w013nad
Jul 29, 2024
Author



import time
import concurrent.futures

from openai import OpenAI


client = OpenAI(
    api_key='YOUR_API_KEY',
    base_url="http://10.72.5.190:15001/v1"
)
model_name = client.models.list().data[0].id
print(model_name)
def submit_request():
    response = client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Count to 67"},
            {"role": "assistant", "content": "Here it goes: \n\n1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67."},
            {"role": "user", "content": "Do it again"}
        ],
        temperature=0.8,
        top_p=0.8,
        # stop=None,  # Ignore EOS token
        # ignore_eos=True,
        max_tokens=200,  # Generate up to 500 tokens
          extra_body={
    "ignore_eos": True
  }
    )
    return response.usage.completion_tokens

batch_sizes = [1,2,4,8,16,32,64,128]
# batch_sizes=[256]
num_batches = 10

for batch_size in batch_sizes:
    total_tokens = 0
    total_time = 0
    for _ in range(num_batches):
        tstart = time.time()
        with concurrent.futures.ThreadPoolExecutor(max_workers=batch_size) as executor:
            futures = [executor.submit(submit_request) for _ in range(batch_size)]
            results = [future.result() for future in futures]
        total_tokens += sum(results)
        total_time += time.time() - tstart
    avg_tokens_per_second = total_tokens / total_time
    print(f"Batch size: {batch_size}, Avg tokens/second: {avg_tokens_per_second:.2f}")
    ```
    
    It's obviously not as sophisticated as what you have but I was just trying to emulate a bunch of individual openai calls.

0 replies

Ying1123 · 2024-07-29T06:10:13Z

Ying1123
Jul 29, 2024
Maintainer

The workload in your benchmark scripts is too lightweight. There are also some bottlenecks in your benchmark scripts which make it unable to correctly measure the time cost of large batch sizes. I suggest using the built-in benchmark scripts in sglang. To show the ideal speedup data parallelsim, we need heavy workloads to fully saturate the server.

You can try this benchmark command.

python3 -m sglang.bench_serving --backend sglang --num-p 500 --disable-stream --dataset-name random --random-input 4096 --random-output 256 --random-range-ratio 1.0 --host localhost --port 30000

With tp=1, dp=1 (python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B), I got

Request throughput (req/s):              5.28

With tp=1, dp=4 (python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B --dp 4), I got

Request throughput (req/s):              18.75

So it is close to 4x speedup.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is data parallelism and how can I use it to speed up my processing? #756

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What is data parallelism and how can I use it to speed up my processing? #756

w013nad Jul 26, 2024

Replies: 3 comments

merrymercy Jul 27, 2024 Maintainer

w013nad Jul 29, 2024 Author

Ying1123 Jul 29, 2024 Maintainer

w013nad
Jul 26, 2024

merrymercy
Jul 27, 2024
Maintainer

w013nad
Jul 29, 2024
Author

Ying1123
Jul 29, 2024
Maintainer