You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been experimenting on CUDAGraph captured generation with my own transformer model implementation, using custom all-reduce in vLLM as replacement for pytorch all-reduce. CUDAGraph capturing worked well until I tried a certain parallel strategy (tensor parallel = pipeline parallel = data parallel = 2, 8 GPUs). In this configuration, the generation was randomly stuck when replaying the captured graph. This problem did not appear in any other parallel strategies with 8 GPUs. I wonder if anyone had encountered the same problem before? I observed that custom all-reduce use cross_device_reduce_1stage only when world_size=2 (and data of small sizes when world_size>2) instead of cross_device_reduce_2stage. Could this be the root cause of the problem? Thanks for your answers in advance!
The text was updated successfully, but these errors were encountered:
Anything you want to discuss about vllm.
Issue
I have been experimenting on CUDAGraph captured generation with my own transformer model implementation, using custom all-reduce in vLLM as replacement for pytorch all-reduce. CUDAGraph capturing worked well until I tried a certain parallel strategy (tensor parallel = pipeline parallel = data parallel = 2, 8 GPUs). In this configuration, the generation was randomly stuck when replaying the captured graph. This problem did not appear in any other parallel strategies with 8 GPUs. I wonder if anyone had encountered the same problem before? I observed that custom all-reduce use
cross_device_reduce_1stage
only when world_size=2 (and data of small sizes when world_size>2) instead ofcross_device_reduce_2stage
. Could this be the root cause of the problem? Thanks for your answers in advance!The text was updated successfully, but these errors were encountered: