Skip to content

Conversation

caic99
Copy link
Member

@caic99 caic99 commented Sep 16, 2025

This pull request updates the MPI CUDA-awareness detection and handling logic in the Border autograd function, simplifying how CUDA support is determined and removing some legacy checks. The changes ensure that CUDA-aware MPI support is queried more directly, and some unnecessary device synchronization calls are removed.

  • The logic for checking CUDA-aware MPI support has been simplified: version checks and redundant branches have been removed, and the code now directly queries MPIX_Query_cuda_support() unless NO_CUDA_AWARE is defined. [1] [2]

  • Removed explicit gpuDeviceSynchronize() calls from both the forward and backward paths, relying on PyTorch's internal synchronization mechanisms instead. [1] [2]

Summary by CodeRabbit

  • Performance
    • Reduced explicit GPU synchronization, potentially improving throughput during distributed forward/backward operations.
  • Compatibility
    • Safer default when CUDA-aware MPI isn’t present: automatically falls back to CPU-based transfers unless support is detected, improving stability across varied clusters.
  • Reliability
    • Simplified CUDA-aware detection reduces edge-case misconfigurations in mixed MPI environments.
  • No API Changes
    • Public interfaces remain unchanged; existing workflows continue to work.

@Copilot Copilot AI review requested due to automatic review settings September 16, 2025 03:22
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR simplifies CUDA-aware MPI detection logic in the Border autograd function by removing version checks and explicit GPU synchronization calls, resulting in cleaner and more direct CUDA support handling.

Key changes:

  • Simplified CUDA-aware MPI detection by removing MPI version checks and directly querying MPIX_Query_cuda_support()
  • Changed default cuda_aware initialization from 1 to 0 for safer fallback behavior
  • Removed explicit gpuDeviceSynchronize() calls, delegating synchronization to PyTorch's internal mechanisms

@github-actions github-actions bot added the OP label Sep 16, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review: Here are some suggestions.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

Copy link
Contributor

coderabbitai bot commented Sep 16, 2025

📝 Walkthrough

Walkthrough

Initializes cuda_aware to 0 in forward_t and backward_t, simplifies CUDA-aware MPI detection to use MPIX_Query_cuda_support() guarded by NO_CUDA_AWARE, removes a gpuDeviceSynchronize() call from backward_t, and preserves CPU vs device copy branching; no public API changes.

Changes

Cohort / File(s) Summary
MPI CUDA-aware detection & sync change
source/op/pt/comm.cc
- Set cuda_aware default to 0 in forward_t and backward_t.
- Simplify detection: if NO_CUDA_AWARE not defined, call MPIX_Query_cuda_support() to set cuda_aware (version checks removed).
- Remove gpuDeviceSynchronize() call from backward_t.
- Preserve branching: CPU copy path when cuda_aware == 0, device-to-device when nonzero.
- No changes to public/exported APIs or signatures.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant F as forward_t
  participant B as backward_t
  participant MPI as MPI Runtime
  participant GPU as GPU Device

  rect rgb(250,250,255)
  note over F,B: Initialization & CUDA-aware detection (simplified)
  F->>MPI: (if NO_CUDA_AWARE not defined) MPIX_Query_cuda_support()
  MPI-->>F: cuda_aware = {0|1}
  B->>MPI: (if NO_CUDA_AWARE not defined) MPIX_Query_cuda_support()
  MPI-->>B: cuda_aware = {0|1}
  end

  rect rgb(245,255,245)
  note over F,GPU: Forward path
  alt cuda_aware == 1
    F->>GPU: Device-to-device send/recv
  else
    F->>F: CPU recv/copy path
  end
  end

  rect rgb(245,255,245)
  note over B,GPU: Backward path (explicit device sync removed)
  alt cuda_aware == 1
    B->>GPU: Device-to-device send/recv
  else
    B->>B: CPU recv/copy path
  end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • njzjz

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "perf: fix cuda-aware mpi in v3" is concise and directly reflects the PR's primary intent—fixing CUDA-aware MPI behavior/performance—which matches the changes to cuda_aware detection and removal of explicit synchronization in the diff, so it meaningfully summarizes the main change.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c13018 and 8f46f9f.

📒 Files selected for processing (1)
  • source/op/pt/comm.cc (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • source/op/pt/comm.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: build_docker (12)
  • GitHub Check: build_docker (_cu11, 11)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Analyze (c-cpp)

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
source/op/pt/comm.cc (3)

141-159: Avoid memcpy/gpuMemcpy with an uninitialized pointer when nsend==0.

In the local (same‑rank) branch you always call memcpy/gpuMemcpy with byte count = nsend * tensor_size. When nsend==0, the size is 0 but send_g1 may be uninitialized (only set when nsend!=0 above). Guard the copy.

Apply this diff:

 #if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
 #ifdef USE_MPI
-        if (cuda_aware == 0) {
-          memcpy(recv_g1, send_g1,
-                 (unsigned long)nsend * tensor_size * sizeof(FPTYPE));
-        } else {
-          gpuMemcpy(recv_g1, send_g1,
-                    (unsigned long)nsend * tensor_size * sizeof(FPTYPE),
-                    gpuMemcpyDeviceToDevice);
-        }
+        if (nsend) {
+          if (cuda_aware == 0) {
+            memcpy(recv_g1, send_g1,
+                   (unsigned long)nsend * tensor_size * sizeof(FPTYPE));
+          } else {
+            gpuMemcpy(recv_g1, send_g1,
+                      (unsigned long)nsend * tensor_size * sizeof(FPTYPE),
+                      gpuMemcpyDeviceToDevice);
+          }
+        }
 #else
-        gpuMemcpy(recv_g1, send_g1,
-                  (unsigned long)nsend * tensor_size * sizeof(FPTYPE),
-                  gpuMemcpyDeviceToDevice);
+        if (nsend) {
+          gpuMemcpy(recv_g1, send_g1,
+                    (unsigned long)nsend * tensor_size * sizeof(FPTYPE),
+                    gpuMemcpyDeviceToDevice);
+        }
 #endif
 #else
-      memcpy(recv_g1, send_g1,
-             (unsigned long)nsend * tensor_size * sizeof(FPTYPE));
+      if (nsend) {
+        memcpy(recv_g1, send_g1,
+               (unsigned long)nsend * tensor_size * sizeof(FPTYPE));
+      }
 #endif

145-149: Prefer size_t for byte counts (minor).

Casting to unsigned long for byte sizes is non‑portable; use size_t to avoid truncation on platforms where unsigned long != size_t.

Example change:

-                 (unsigned long)nsend * tensor_size * sizeof(FPTYPE));
+                 static_cast<size_t>(nsend) * static_cast<size_t>(tensor_size) * sizeof(FPTYPE));

Repeat similarly for the other memcpy/gpuMemcpy sites.

Also applies to: 297-303


100-111: DRY: factor CUDA‑aware detection into a helper.

The cuda_aware detection logic is duplicated in forward_t and backward_t. Extract to a tiny inline helper to keep behavior identical and prevent drift.

Example:

+static inline int query_cuda_aware(int mpi_init, int world_size) {
+  int cuda_aware = 0;
+#if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
+  if (mpi_init && world_size >= 1) {
+#if !defined(NO_CUDA_AWARE) && (defined(OMPI_MPI_H) || defined(MPIX_CUDA_AWARE_SUPPORT))
+    cuda_aware = MPIX_Query_cuda_support();
+#endif
+  }
+#endif
+  return cuda_aware;
+}
...
-    int cuda_aware = 0;
+    int cuda_aware = 0;
...
-    if (world_size >= 1) {
-#if !defined(NO_CUDA_AWARE) && (defined(OMPI_MPI_H) || defined(MPIX_CUDA_AWARE_SUPPORT))
-      cuda_aware = MPIX_Query_cuda_support();
-#endif
-      if (cuda_aware == 0) {
+    cuda_aware = query_cuda_aware(mpi_init, world_size);
+    if (world_size >= 1) {
+      if (cuda_aware == 0) {
         ...

Also applies to: 214-226

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 34df2b4 and 4c13018.

📒 Files selected for processing (1)
  • source/op/pt/comm.cc (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Analyze (python)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
🔇 Additional comments (1)
source/op/pt/comm.cc (1)

126-139: Synchronize CUDA stream before MPI_Send / MPI_Irecv when cuda_aware != 0

MPI may access device pointers before CUDA kernels finish; synchronize the current CUDA stream immediately before MPI_Irecv/MPI_Send when cuda_aware != 0 (or gate the sync behind a build flag for A/B testing).

Apply to: source/op/pt/comm.cc — lines 126-139 and 270-282.

Suggested change (example diff):

+#if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
+#include <ATen/cuda/CUDAContext.h>
+#endif
...
         if (nrecv) {
+ #if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
+          if (cuda_aware != 0) { cudaStreamSynchronize(at::cuda::getCurrentCUDAStream().stream()); }
+ #endif
           MPI_Irecv(recv_g1, nrecv * tensor_size, mpi_type, recvproc[iswap], 0,
                     world, &request);
         }
         if (nsend) {
+ #if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
+          if (cuda_aware != 0) { cudaStreamSynchronize(at::cuda::getCurrentCUDAStream().stream()); }
+ #endif
           MPI_Send(send_g1, nsend * tensor_size, mpi_type, sendproc[iswap], 0,
                    world);
         }

Sandbox verification failed: mpirun not found (/bin/bash: line 6: mpirun: command not found). Run a multi-rank repro on a machine with mpirun for both CUDA-aware and non-CUDA-aware MPI to confirm deterministic results, no stalls, and no data corruption.

Copy link

codecov bot commented Sep 16, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.20%. Comparing base (db22802) to head (8f46f9f).
⚠️ Report is 16 commits behind head on devel.

Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #4977      +/-   ##
==========================================
- Coverage   84.29%   84.20%   -0.10%     
==========================================
  Files         704      705       +1     
  Lines       68907    69229     +322     
  Branches     3572     3576       +4     
==========================================
+ Hits        58085    58292     +207     
- Misses       9681     9797     +116     
+ Partials     1141     1140       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@caic99 caic99 requested a review from njzjz September 16, 2025 05:13
@caic99 caic99 added this pull request to the merge queue Sep 16, 2025
Merged via the queue into deepmodeling:devel with commit dd55dff Sep 16, 2025
60 checks passed
@caic99 caic99 deleted the comm branch September 16, 2025 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants