Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark performance #5

Open
BaiStone2017 opened this issue Nov 28, 2024 · 3 comments
Open

benchmark performance #5

BaiStone2017 opened this issue Nov 28, 2024 · 3 comments

Comments

@BaiStone2017
Copy link

In https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm_benchmark_results.md。

The performance of "Non-disaggregated", use 2 A10?

@ShangmingCai
Copy link
Collaborator

In https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm_benchmark_results.md。

The performance of "Non-disaggregated", use 2 A10?

Currently, it is conducted on 1 A10 to test and compare the TTFT latency and verify the feasibility of inter-node disaggregated designs. To fairly compare the total throughput of non-disaggregated and disaggregated designs, we need to conduct experiments under specific prefill/decode workloads to utilize the prefill node fully. However, we have not found a good way to conduct a fair comparison of 2 non-disaggregated instances and 1 prefill + 1 decode without OOM.
According to the author of PR 8498,

"for disagg prefill it will have lower throughput compared to chunked prefill if the prefill workload / decode workload doesn’t match # of prefill GPUs / # of decode GPUs. In my current implementation, the # of prefill GPU / # of decode GPU is 1:1, but the prefill workload / decode workload is typically a really small number (roughly 0.1 IIRC)."

After we solve the TP problem, we will conduct a series of experiments with different GPU ratios. If you are interested, you can also join vllm's slack channel about prefill disaggregation to get the latest updates.

@Edenzzzz
Copy link

Are there benchmark comparisons against NCCL?

@ShangmingCai
Copy link
Collaborator

Are there benchmark comparisons against NCCL?

We are unable to obtain inter-node disaggregated results with NCCL based on PR 8498 currently due to its parallel_state initialization process of disagg_group in conflict with vllm's process_group. This could be fixed with the help of PR 10072, which has already been merged. More results will be provided once we finish the integration of mooncake_transfer_engine with PR 10502.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants