perf: fix cuda-aware mpi in v3 #4977

caic99 · 2025-09-16T03:22:01Z

This pull request updates the MPI CUDA-awareness detection and handling logic in the Border autograd function, simplifying how CUDA support is determined and removing some legacy checks. The changes ensure that CUDA-aware MPI support is queried more directly, and some unnecessary device synchronization calls are removed.

The logic for checking CUDA-aware MPI support has been simplified: version checks and redundant branches have been removed, and the code now directly queries MPIX_Query_cuda_support() unless NO_CUDA_AWARE is defined. [1] [2]
Removed explicit gpuDeviceSynchronize() calls from both the forward and backward paths, relying on PyTorch's internal synchronization mechanisms instead. [1] [2]

Summary by CodeRabbit

Performance
- Reduced explicit GPU synchronization, potentially improving throughput during distributed forward/backward operations.
Compatibility
- Safer default when CUDA-aware MPI isn’t present: automatically falls back to CPU-based transfers unless support is detected, improving stability across varied clusters.
Reliability
- Simplified CUDA-aware detection reduces edge-case misconfigurations in mixed MPI environments.
No API Changes
- Public interfaces remain unchanged; existing workflows continue to work.

Copilot

Pull Request Overview

This PR simplifies CUDA-aware MPI detection logic in the Border autograd function by removing version checks and explicit GPU synchronization calls, resulting in cleaner and more direct CUDA support handling.

Key changes:

Simplified CUDA-aware MPI detection by removing MPI version checks and directly querying MPIX_Query_cuda_support()
Changed default cuda_aware initialization from 1 to 0 for safer fallback behavior
Removed explicit gpuDeviceSynchronize() calls, delegating synchronization to PyTorch's internal mechanisms

for more information, see https://pre-commit.ci

chatgpt-codex-connector

Codex Review: Here are some suggestions.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

source/op/pt/comm.cc

coderabbitai · 2025-09-16T03:31:35Z

📝 Walkthrough

Walkthrough

Initializes cuda_aware to 0 in forward_t and backward_t, simplifies CUDA-aware MPI detection to use MPIX_Query_cuda_support() guarded by NO_CUDA_AWARE, removes a gpuDeviceSynchronize() call from backward_t, and preserves CPU vs device copy branching; no public API changes.

Changes

Cohort / File(s)	Summary
MPI CUDA-aware detection & sync change `source/op/pt/comm.cc`	- Set `cuda_aware` default to `0` in `forward_t` and `backward_t`. - Simplify detection: if `NO_CUDA_AWARE` not defined, call `MPIX_Query_cuda_support()` to set `cuda_aware` (version checks removed). - Remove `gpuDeviceSynchronize()` call from `backward_t`. - Preserve branching: CPU copy path when `cuda_aware == 0`, device-to-device when nonzero. - No changes to public/exported APIs or signatures.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant F as forward_t
  participant B as backward_t
  participant MPI as MPI Runtime
  participant GPU as GPU Device

  rect rgb(250,250,255)
  note over F,B: Initialization & CUDA-aware detection (simplified)
  F->>MPI: (if NO_CUDA_AWARE not defined) MPIX_Query_cuda_support()
  MPI-->>F: cuda_aware = {0|1}
  B->>MPI: (if NO_CUDA_AWARE not defined) MPIX_Query_cuda_support()
  MPI-->>B: cuda_aware = {0|1}
  end

  rect rgb(245,255,245)
  note over F,GPU: Forward path
  alt cuda_aware == 1
    F->>GPU: Device-to-device send/recv
  else
    F->>F: CPU recv/copy path
  end
  end

  rect rgb(245,255,245)
  note over B,GPU: Backward path (explicit device sync removed)
  alt cuda_aware == 1
    B->>GPU: Device-to-device send/recv
  else
    B->>B: CPU recv/copy path
  end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

pt: fix compile error when use intel mpi #3919 — Modifies CUDA-aware MPI detection in source/op/pt/comm.cc (use of MPIX_Query_cuda_support() / NO_CUDA_AWARE).

Suggested reviewers

njzjz

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "perf: fix cuda-aware mpi in v3" is concise and directly reflects the PR's primary intent—fixing CUDA-aware MPI behavior/performance—which matches the changes to cuda_aware detection and removal of explicit synchronization in the diff, so it meaningfully summarizes the main change.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c13018 and 8f46f9f.

📒 Files selected for processing (1)

source/op/pt/comm.cc (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

source/op/pt/comm.cc

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)

GitHub Check: build_docker (12)
GitHub Check: build_docker (_cu11, 11)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Test C++ (false)
GitHub Check: Test C++ (true)
GitHub Check: Analyze (c-cpp)

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

source/op/pt/comm.cc (3)

141-159: Avoid memcpy/gpuMemcpy with an uninitialized pointer when nsend==0.

In the local (same‑rank) branch you always call memcpy/gpuMemcpy with byte count = nsend * tensor_size. When nsend==0, the size is 0 but send_g1 may be uninitialized (only set when nsend!=0 above). Guard the copy.

Apply this diff:

 #if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
 #ifdef USE_MPI
-        if (cuda_aware == 0) {
-          memcpy(recv_g1, send_g1,
-                 (unsigned long)nsend * tensor_size * sizeof(FPTYPE));
-        } else {
-          gpuMemcpy(recv_g1, send_g1,
-                    (unsigned long)nsend * tensor_size * sizeof(FPTYPE),
-                    gpuMemcpyDeviceToDevice);
-        }
+        if (nsend) {
+          if (cuda_aware == 0) {
+            memcpy(recv_g1, send_g1,
+                   (unsigned long)nsend * tensor_size * sizeof(FPTYPE));
+          } else {
+            gpuMemcpy(recv_g1, send_g1,
+                      (unsigned long)nsend * tensor_size * sizeof(FPTYPE),
+                      gpuMemcpyDeviceToDevice);
+          }
+        }
 #else
-        gpuMemcpy(recv_g1, send_g1,
-                  (unsigned long)nsend * tensor_size * sizeof(FPTYPE),
-                  gpuMemcpyDeviceToDevice);
+        if (nsend) {
+          gpuMemcpy(recv_g1, send_g1,
+                    (unsigned long)nsend * tensor_size * sizeof(FPTYPE),
+                    gpuMemcpyDeviceToDevice);
+        }
 #endif
 #else
-      memcpy(recv_g1, send_g1,
-             (unsigned long)nsend * tensor_size * sizeof(FPTYPE));
+      if (nsend) {
+        memcpy(recv_g1, send_g1,
+               (unsigned long)nsend * tensor_size * sizeof(FPTYPE));
+      }
 #endif

145-149: Prefer size_t for byte counts (minor).

Casting to unsigned long for byte sizes is non‑portable; use size_t to avoid truncation on platforms where unsigned long != size_t.

Example change:

-                 (unsigned long)nsend * tensor_size * sizeof(FPTYPE));
+                 static_cast<size_t>(nsend) * static_cast<size_t>(tensor_size) * sizeof(FPTYPE));

Repeat similarly for the other memcpy/gpuMemcpy sites.

Also applies to: 297-303

100-111: DRY: factor CUDA‑aware detection into a helper.

The cuda_aware detection logic is duplicated in forward_t and backward_t. Extract to a tiny inline helper to keep behavior identical and prevent drift.

Example:

+static inline int query_cuda_aware(int mpi_init, int world_size) {
+  int cuda_aware = 0;
+#if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
+  if (mpi_init && world_size >= 1) {
+#if !defined(NO_CUDA_AWARE) && (defined(OMPI_MPI_H) || defined(MPIX_CUDA_AWARE_SUPPORT))
+    cuda_aware = MPIX_Query_cuda_support();
+#endif
+  }
+#endif
+  return cuda_aware;
+}
...
-    int cuda_aware = 0;
+    int cuda_aware = 0;
...
-    if (world_size >= 1) {
-#if !defined(NO_CUDA_AWARE) && (defined(OMPI_MPI_H) || defined(MPIX_CUDA_AWARE_SUPPORT))
-      cuda_aware = MPIX_Query_cuda_support();
-#endif
-      if (cuda_aware == 0) {
+    cuda_aware = query_cuda_aware(mpi_init, world_size);
+    if (world_size >= 1) {
+      if (cuda_aware == 0) {
         ...

Also applies to: 214-226

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 34df2b4 and 4c13018.

📒 Files selected for processing (1)

source/op/pt/comm.cc (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)

GitHub Check: Test C++ (false)
GitHub Check: Test C++ (true)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Test Python (4, 3.9)
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Test Python (3, 3.12)
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Test Python (2, 3.12)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Analyze (python)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Analyze (c-cpp)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (cpu, cpu)

🔇 Additional comments (1)

source/op/pt/comm.cc (1)
126-139: Synchronize CUDA stream before MPI_Send / MPI_Irecv when cuda_aware != 0

MPI may access device pointers before CUDA kernels finish; synchronize the current CUDA stream immediately before MPI_Irecv/MPI_Send when cuda_aware != 0 (or gate the sync behind a build flag for A/B testing).

Apply to: source/op/pt/comm.cc — lines 126-139 and 270-282.

Suggested change (example diff):
+#if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
+#include <ATen/cuda/CUDAContext.h>
+#endif
...
         if (nrecv) {
+ #if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
+          if (cuda_aware != 0) { cudaStreamSynchronize(at::cuda::getCurrentCUDAStream().stream()); }
+ #endif
           MPI_Irecv(recv_g1, nrecv * tensor_size, mpi_type, recvproc[iswap], 0,
                     world, &request);
         }
         if (nsend) {
+ #if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
+          if (cuda_aware != 0) { cudaStreamSynchronize(at::cuda::getCurrentCUDAStream().stream()); }
+ #endif
           MPI_Send(send_g1, nsend * tensor_size, mpi_type, sendproc[iswap], 0,
                    world);
         }
Sandbox verification failed: mpirun not found (/bin/bash: line 6: mpirun: command not found). Run a multi-rank repro on a machine with mpirun for both CUDA-aware and non-CUDA-aware MPI to confirm deterministic results, no stalls, and no data corruption.

source/op/pt/comm.cc

codecov · 2025-09-16T04:19:41Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.20%. Comparing base (db22802) to head (8f46f9f).
⚠️ Report is 16 commits behind head on devel.

Additional details and impacted files

@@            Coverage Diff             @@
##            devel    #4977      +/-   ##
==========================================
- Coverage   84.29%   84.20%   -0.10%     
==========================================
  Files         704      705       +1     
  Lines       68907    69229     +322     
  Branches     3572     3576       +4     
==========================================
+ Hits        58085    58292     +207     
- Misses       9681     9797     +116     
+ Partials     1141     1140       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

perf: fix cuda-aware mpi in v3

4c13018

Copilot AI review requested due to automatic review settings September 16, 2025 03:22

Copilot AI reviewed Sep 16, 2025

View reviewed changes

github-actions bot added the OP label Sep 16, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

8f46f9f

for more information, see https://pre-commit.ci

chatgpt-codex-connector bot reviewed Sep 16, 2025

View reviewed changes

source/op/pt/comm.cc Show resolved Hide resolved

coderabbitai bot reviewed Sep 16, 2025

View reviewed changes

source/op/pt/comm.cc Show resolved Hide resolved

caic99 requested a review from njzjz September 16, 2025 05:13

njzjz approved these changes Sep 16, 2025

View reviewed changes

caic99 added this pull request to the merge queue Sep 16, 2025

Merged via the queue into deepmodeling:devel with commit dd55dff Sep 16, 2025
60 checks passed

caic99 deleted the comm branch September 16, 2025 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: fix cuda-aware mpi in v3 #4977

perf: fix cuda-aware mpi in v3 #4977

Uh oh!

caic99 commented Sep 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

coderabbitai bot commented Sep 16, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

codecov bot commented Sep 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

perf: fix cuda-aware mpi in v3 #4977

perf: fix cuda-aware mpi in v3 #4977

Uh oh!

Conversation

caic99 commented Sep 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

caic99 commented Sep 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 16, 2025 •

edited

Loading

codecov bot commented Sep 16, 2025 •

edited

Loading