-
Notifications
You must be signed in to change notification settings - Fork 573
perf: fix cuda-aware mpi in v3 #4977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR simplifies CUDA-aware MPI detection logic in the Border autograd function by removing version checks and explicit GPU synchronization calls, resulting in cleaner and more direct CUDA support handling.
Key changes:
- Simplified CUDA-aware MPI detection by removing MPI version checks and directly querying
MPIX_Query_cuda_support()
- Changed default
cuda_aware
initialization from 1 to 0 for safer fallback behavior - Removed explicit
gpuDeviceSynchronize()
calls, delegating synchronization to PyTorch's internal mechanisms
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codex Review: Here are some suggestions.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
📝 WalkthroughWalkthroughInitializes Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant F as forward_t
participant B as backward_t
participant MPI as MPI Runtime
participant GPU as GPU Device
rect rgb(250,250,255)
note over F,B: Initialization & CUDA-aware detection (simplified)
F->>MPI: (if NO_CUDA_AWARE not defined) MPIX_Query_cuda_support()
MPI-->>F: cuda_aware = {0|1}
B->>MPI: (if NO_CUDA_AWARE not defined) MPIX_Query_cuda_support()
MPI-->>B: cuda_aware = {0|1}
end
rect rgb(245,255,245)
note over F,GPU: Forward path
alt cuda_aware == 1
F->>GPU: Device-to-device send/recv
else
F->>F: CPU recv/copy path
end
end
rect rgb(245,255,245)
note over B,GPU: Backward path (explicit device sync removed)
alt cuda_aware == 1
B->>GPU: Device-to-device send/recv
else
B->>B: CPU recv/copy path
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
source/op/pt/comm.cc (3)
141-159
: Avoid memcpy/gpuMemcpy with an uninitialized pointer when nsend==0.In the local (same‑rank) branch you always call memcpy/gpuMemcpy with byte count = nsend * tensor_size. When nsend==0, the size is 0 but send_g1 may be uninitialized (only set when nsend!=0 above). Guard the copy.
Apply this diff:
#if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM) #ifdef USE_MPI - if (cuda_aware == 0) { - memcpy(recv_g1, send_g1, - (unsigned long)nsend * tensor_size * sizeof(FPTYPE)); - } else { - gpuMemcpy(recv_g1, send_g1, - (unsigned long)nsend * tensor_size * sizeof(FPTYPE), - gpuMemcpyDeviceToDevice); - } + if (nsend) { + if (cuda_aware == 0) { + memcpy(recv_g1, send_g1, + (unsigned long)nsend * tensor_size * sizeof(FPTYPE)); + } else { + gpuMemcpy(recv_g1, send_g1, + (unsigned long)nsend * tensor_size * sizeof(FPTYPE), + gpuMemcpyDeviceToDevice); + } + } #else - gpuMemcpy(recv_g1, send_g1, - (unsigned long)nsend * tensor_size * sizeof(FPTYPE), - gpuMemcpyDeviceToDevice); + if (nsend) { + gpuMemcpy(recv_g1, send_g1, + (unsigned long)nsend * tensor_size * sizeof(FPTYPE), + gpuMemcpyDeviceToDevice); + } #endif #else - memcpy(recv_g1, send_g1, - (unsigned long)nsend * tensor_size * sizeof(FPTYPE)); + if (nsend) { + memcpy(recv_g1, send_g1, + (unsigned long)nsend * tensor_size * sizeof(FPTYPE)); + } #endif
145-149
: Prefer size_t for byte counts (minor).Casting to unsigned long for byte sizes is non‑portable; use size_t to avoid truncation on platforms where unsigned long != size_t.
Example change:
- (unsigned long)nsend * tensor_size * sizeof(FPTYPE)); + static_cast<size_t>(nsend) * static_cast<size_t>(tensor_size) * sizeof(FPTYPE));Repeat similarly for the other memcpy/gpuMemcpy sites.
Also applies to: 297-303
100-111
: DRY: factor CUDA‑aware detection into a helper.The cuda_aware detection logic is duplicated in forward_t and backward_t. Extract to a tiny inline helper to keep behavior identical and prevent drift.
Example:
+static inline int query_cuda_aware(int mpi_init, int world_size) { + int cuda_aware = 0; +#if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM) + if (mpi_init && world_size >= 1) { +#if !defined(NO_CUDA_AWARE) && (defined(OMPI_MPI_H) || defined(MPIX_CUDA_AWARE_SUPPORT)) + cuda_aware = MPIX_Query_cuda_support(); +#endif + } +#endif + return cuda_aware; +} ... - int cuda_aware = 0; + int cuda_aware = 0; ... - if (world_size >= 1) { -#if !defined(NO_CUDA_AWARE) && (defined(OMPI_MPI_H) || defined(MPIX_CUDA_AWARE_SUPPORT)) - cuda_aware = MPIX_Query_cuda_support(); -#endif - if (cuda_aware == 0) { + cuda_aware = query_cuda_aware(mpi_init, world_size); + if (world_size >= 1) { + if (cuda_aware == 0) { ...Also applies to: 214-226
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
source/op/pt/comm.cc
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Analyze (python)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (cpu, cpu)
🔇 Additional comments (1)
source/op/pt/comm.cc (1)
126-139
: Synchronize CUDA stream before MPI_Send / MPI_Irecv when cuda_aware != 0MPI may access device pointers before CUDA kernels finish; synchronize the current CUDA stream immediately before MPI_Irecv/MPI_Send when cuda_aware != 0 (or gate the sync behind a build flag for A/B testing).
Apply to: source/op/pt/comm.cc — lines 126-139 and 270-282.
Suggested change (example diff):
+#if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM) +#include <ATen/cuda/CUDAContext.h> +#endif ... if (nrecv) { + #if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM) + if (cuda_aware != 0) { cudaStreamSynchronize(at::cuda::getCurrentCUDAStream().stream()); } + #endif MPI_Irecv(recv_g1, nrecv * tensor_size, mpi_type, recvproc[iswap], 0, world, &request); } if (nsend) { + #if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM) + if (cuda_aware != 0) { cudaStreamSynchronize(at::cuda::getCurrentCUDAStream().stream()); } + #endif MPI_Send(send_g1, nsend * tensor_size, mpi_type, sendproc[iswap], 0, world); }Sandbox verification failed: mpirun not found (/bin/bash: line 6: mpirun: command not found). Run a multi-rank repro on a machine with mpirun for both CUDA-aware and non-CUDA-aware MPI to confirm deterministic results, no stalls, and no data corruption.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## devel #4977 +/- ##
==========================================
- Coverage 84.29% 84.20% -0.10%
==========================================
Files 704 705 +1
Lines 68907 69229 +322
Branches 3572 3576 +4
==========================================
+ Hits 58085 58292 +207
- Misses 9681 9797 +116
+ Partials 1141 1140 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This pull request updates the MPI CUDA-awareness detection and handling logic in the
Border
autograd function, simplifying how CUDA support is determined and removing some legacy checks. The changes ensure that CUDA-aware MPI support is queried more directly, and some unnecessary device synchronization calls are removed.The logic for checking CUDA-aware MPI support has been simplified: version checks and redundant branches have been removed, and the code now directly queries
MPIX_Query_cuda_support()
unlessNO_CUDA_AWARE
is defined. [1] [2]Removed explicit
gpuDeviceSynchronize()
calls from both the forward and backward paths, relying on PyTorch's internal synchronization mechanisms instead. [1] [2]Summary by CodeRabbit