You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As shown in the question, in the Mooncake document: Option --tensor_parallel_size \ -tp is supported now. But you need to set up --enforce_eager to disable cuda graph. Example: append -tp 2 --enforce_eager to the run command.
My question is why it is necessary to disable CUDA graphics, as this results in a loss of performance. What are the technical difficulties in adapting CUDA static graphics?
The text was updated successfully, but these errors were encountered:
As shown in the question, in the Mooncake document: Option --tensor_parallel_size \ -tp is supported now. But you need to set up --enforce_eager to disable cuda graph. Example: append -tp 2 --enforce_eager to the run command.
My question is why it is necessary to disable CUDA graphics, as this results in a loss of performance. What are the technical difficulties in adapting CUDA static graphics?
This is a legacy issue. In previous versions, when tp=4 and we enable disaggregated prefilling distributed env and cuda graph at the same time, some errors will come up in cuda graph initialization. After testing, this issue has been fixed in recent versions of vllm.
I will modify the document. Thank you for the message.
As shown in the question, in the Mooncake document:
Option --tensor_parallel_size \ -tp is supported now. But you need to set up --enforce_eager to disable cuda graph. Example: append -tp 2 --enforce_eager to the run command.
My question is why it is necessary to disable CUDA graphics, as this results in a loss of performance. What are the technical difficulties in adapting CUDA static graphics?
The text was updated successfully, but these errors were encountered: