-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang and cuBLASMp error in pmatmul.cu #241
Comments
Dear cuBLASMp experts, From customer, they saw Issue1 sometimes may hang with their setup. Also, for issue2, it crash with Illegal Memory Access |
Hi @GuoxiaWang, Thanks for reaching out! I couldn't reproduce the first issue, I suspect this is related to the environment somehow. Can you please try the following:
As for the second issue, I could reproduce it. I need to debug this but I have a workaround to unblock you. To achieve better performance with AG+GEMM, B can be allocated using nvshmem_malloc. Doing so, resolves the issue. I.e. performing the following changes:
Please let me know if the above solves both issues for you. Thanks, |
Hi @GuoxiaWang, I've debugged the issue and found the problem is the sample. C matrix allocation was wrong and insufficient. I've fixed the sample. FYI, the performance improvement suggestion in my previous comment is still valid. Please give it a try and keep me posted. Thanks, |
@almogsegal However, I have found an issue for PaddlePaddle weight layout: cuBLASMp only supports forward computation, and lacks APIs for backward gradients. To implement it, we face complexities due to layout constraints. For example, the backward computation of ReduceScatter requires Transpose operations, and when calculating Dw and Dx, it needs either two AllGatherMatmuls or one AllGather(Dy) plus Matmul and one AllGatherMatmuls. |
Reproduction Environment:
Build and Execution Commands:
Issue 1: Hang
AG + Matmul
https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASMp/pmatmul.cu#L151-L153
Issue 2: cuBLASMp error
AG + Matmul
https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASMp/pmatmul.cu#L151-L153
The text was updated successfully, but these errors were encountered: