Why do we need to disable CUDA graph (enforce_eager) for Tensor parallelism? #64

skyCreateXian · 2025-01-06T07:52:17Z

As shown in the question, in the Mooncake document:
Option --tensor_parallel_size \ -tp is supported now. But you need to set up --enforce_eager to disable cuda graph. Example: append -tp 2 --enforce_eager to the run command.

My question is why it is necessary to disable CUDA graphics, as this results in a loss of performance. What are the technical difficulties in adapting CUDA static graphics?

The text was updated successfully, but these errors were encountered:

ShangmingCai · 2025-01-07T02:33:43Z

As shown in the question, in the Mooncake document: Option --tensor_parallel_size \ -tp is supported now. But you need to set up --enforce_eager to disable cuda graph. Example: append -tp 2 --enforce_eager to the run command.

My question is why it is necessary to disable CUDA graphics, as this results in a loss of performance. What are the technical difficulties in adapting CUDA static graphics?

This is a legacy issue. In previous versions, when tp=4 and we enable disaggregated prefilling distributed env and cuda graph at the same time, some errors will come up in cuda graph initialization. After testing, this issue has been fixed in recent versions of vllm.

I will modify the document. Thank you for the message.

ShangmingCai mentioned this issue Jan 7, 2025

[Doc] Re-enable cuda graph to improve inference performance. #67

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do we need to disable CUDA graph (enforce_eager) for Tensor parallelism? #64

Why do we need to disable CUDA graph (enforce_eager) for Tensor parallelism? #64

skyCreateXian commented Jan 6, 2025

ShangmingCai commented Jan 7, 2025

Why do we need to disable CUDA graph (enforce_eager) for Tensor parallelism? #64

Why do we need to disable CUDA graph (enforce_eager) for Tensor parallelism? #64

Comments

skyCreateXian commented Jan 6, 2025

ShangmingCai commented Jan 7, 2025