-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMPI+UCX with multiple GPUs error: "named symbol not found" #10304
Comments
This error could be asynchronous, coming from a previous failure. Can you please provide more details on the test case, and UCX/Cuda versions? |
@yosefe - We (@pascal-boeschoten-hapteon and I) are using UCX 1.17.0 (built from source using the tagged release) alongside CUDA 12.1.105. We encounter the above issue when using
If instead, for a given rank, we only use one device at any given time, then the CUDA error disappears and everything works correctly. I.e., the previous pseudo-code would be changed to:
|
Yes, the reason for the error is that the CUDA device was changed. |
@judicaelclair FYI the fix was merged to UCX main branch |
I'm trying to use OpenMPI+UCX with multiple CUDA devices within the same rank but quickly ran into a "named symbol not found" error:
This was with OpenMPI 5.0.5 and UCX 1.17.
Could this be because during the progression of a transfer, the associated CUDA device must be the current one, set with
cudaSetDevice()
? And if so, is there any way to make this work with multiple devices doing transfers in parallel?I also came across a PR that looks like it may fix the issue I'm having: #9645
The text was updated successfully, but these errors were encountered: