Skip to content

Conversation

hexinw-nvidia
Copy link
Contributor

This pull request enhances the NVIDIA Resiliency Extension (NVRx) InProcess by introducing auto-restart functionality. Previously, NVRx InProcess did not support process restarts, making it unable to recover training jobs from issues like CUDA context corruption or other failures requiring a process restart. Falling back to torchrun or infrastructure-level restarts was costly. This enhancement allows seamless recovery by enabling auto-restart with minimal code changes to the training program.

Changes

Auto-Restart Implementation: Added fork_and_monitor functionality in the nvidia_resiliency_ext.shared_utils.auto_restart module, enabling automatic process restarts for NVRx InProcess.

Unique Iterations Across Restarts: Ensured iterations remain unique across process restarts to prevent key conflicts.

Clear Initial Barrier Keys: Implemented clearing of initial barrier keys to allow reuse across restarts.

Integration Simplicity: Training programs (e.g., pretrain_mamba.py) can enable auto-restart by adding the following four lines of code:

import os
if os.getenv('NVRX_ENABLE_FORK_AND_MONITOR', '1') == '1':
    from nvidia_resiliency_ext.shared_utils.auto_restart import fork_and_monitor
    fork_and_monitor()

@hexinw-nvidia hexinw-nvidia added the ci-approved Approved to run CI label Aug 6, 2025
@rhewett-nv rhewett-nv marked this pull request as draft August 6, 2025 16:53
hexinw-nvidia and others added 16 commits August 7, 2025 14:33
2) Record job_restart_counter. Fixed RetryAbort exception by
   using job_restart_counter.
- If CUDA is installed and available in $PATH, it would
  be nice if build system could use it rather than throwing
  an error.
- Currently we just check for `/usr/local/cuda` and `CUDA_PATH`
  and then throw an error.
- In this PR, we try to check if `nvcc` present in the $PATH and
  try to determine the path of cuda as it's typically done in
  build system like CMake: https://github.com/Kitware/CMake/blob/master/Modules/FindCUDA.cmake#L862

As an example,

On DLCluster with CUDA setup in Conda environment:

```bash
$ which nvcc
/home/scratch.prkumbhar_wwfo/software/x86_64/nvrx-july-2025/conda_envs/nvrx202507/bin/nvcc

$ pip install .
...
      File "/home/prkumbhar/workspace/repos/nvidia/nvidia-resiliency-ext/cupti_build.py", line 59, in build
          raise FileNotFoundError("cuda installation not found in /usr/local/cuda or $CUDA_PATH")
      FileNotFoundError: cuda installation not found in /usr/local/cuda or $CUDA_PATH
```

With this PR, it finds correct CUDA dir:

```bash
  ...
  A setup.py file already exists. Using it.
  CUDA root found: /home/scratch.prkumbhar_wwfo/software/x86_64/nvrx-july-2025/conda_envs/nvrx202507/bin/../targets/x86_64-linux
```

This is a minor thing but thought helpful to improve user experience.
1. ≤16 ranks: Show all individual ranks.
Example: EXCEPTION affecting ranks [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

2. 17-32 ranks: Show first 3 and last 3 ranks.
Example: EXCEPTION affecting ranks [0, 1, 2]...[29, 30, 31] (total: 32)

3. >32 ranks: Show first 5 and last 5 ranks with total count.
Example: EXCEPTION affecting ranks [0, 1, 2, 3, 4]...[27, 28, 29, 30, 31] (total: 32)
@hexinw-nvidia
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-approved Approved to run CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants