[Misc]: CUDAGraph captured generation stuck with custom_all_reduce and tensor_parallel=2 #5854

nuzant · 2024-06-26T03:13:06Z

Anything you want to discuss about vllm.

Issue

I have been experimenting on CUDAGraph captured generation with my own transformer model implementation, using custom all-reduce in vLLM as replacement for pytorch all-reduce. CUDAGraph capturing worked well until I tried a certain parallel strategy (tensor parallel = pipeline parallel = data parallel = 2, 8 GPUs). In this configuration, the generation was randomly stuck when replaying the captured graph. This problem did not appear in any other parallel strategies with 8 GPUs. I wonder if anyone had encountered the same problem before? I observed that custom all-reduce use cross_device_reduce_1stage only when world_size=2 (and data of small sizes when world_size>2) instead of cross_device_reduce_2stage. Could this be the root cause of the problem? Thanks for your answers in advance!

The text was updated successfully, but these errors were encountered:

youkaichao · 2024-06-26T03:59:39Z

it's quite difficult to help custom usage of custom allreduce, I suggest asking @hanzhi713 for help, who originally contributed this code.

hanzhi713 · 2024-06-26T04:06:49Z

You might want to share a minimal reproducible code snippet. The stage selection behavior you mentioned is expected so that shouldn't be the problem.

Also, please try the following first and see if they still hang

disable cuda graph but enable custom allreduce using your current strategy
enable cuda graph but disable custom allreduce using your current strategy

nuzant added the misc label Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc]: CUDAGraph captured generation stuck with custom_all_reduce and tensor_parallel=2 #5854

[Misc]: CUDAGraph captured generation stuck with custom_all_reduce and tensor_parallel=2 #5854

nuzant commented Jun 26, 2024

youkaichao commented Jun 26, 2024

hanzhi713 commented Jun 26, 2024

[Misc]: CUDAGraph captured generation stuck with custom_all_reduce and tensor_parallel=2 #5854

[Misc]: CUDAGraph captured generation stuck with custom_all_reduce and tensor_parallel=2 #5854

Comments

nuzant commented Jun 26, 2024

Anything you want to discuss about vllm.

Issue

youkaichao commented Jun 26, 2024

hanzhi713 commented Jun 26, 2024