Auto restart #139

hexinw-nvidia · 2025-08-06T08:53:21Z

This pull request enhances the NVIDIA Resiliency Extension (NVRx) InProcess by introducing auto-restart functionality. Previously, NVRx InProcess did not support process restarts, making it unable to recover training jobs from issues like CUDA context corruption or other failures requiring a process restart. Falling back to torchrun or infrastructure-level restarts was costly. This enhancement allows seamless recovery by enabling auto-restart with minimal code changes to the training program.

Changes

Auto-Restart Implementation: Added fork_and_monitor functionality in the nvidia_resiliency_ext.shared_utils.auto_restart module, enabling automatic process restarts for NVRx InProcess.

Unique Iterations Across Restarts: Ensured iterations remain unique across process restarts to prevent key conflicts.

Clear Initial Barrier Keys: Implemented clearing of initial barrier keys to allow reuse across restarts.

Integration Simplicity: Training programs (e.g., pretrain_mamba.py) can enable auto-restart by adding the following four lines of code:

import os
if os.getenv('NVRX_ENABLE_FORK_AND_MONITOR', '1') == '1':
    from nvidia_resiliency_ext.shared_utils.auto_restart import fork_and_monitor
    fork_and_monitor()

…workload process.

Cleared the initial barrier keys so they can be re-used across process restart.

…rnally.

2) Record job_restart_counter. Fixed RetryAbort exception by using job_restart_counter.

- If CUDA is installed and available in $PATH, it would be nice if build system could use it rather than throwing an error. - Currently we just check for `/usr/local/cuda` and `CUDA_PATH` and then throw an error. - In this PR, we try to check if `nvcc` present in the $PATH and try to determine the path of cuda as it's typically done in build system like CMake: https://github.com/Kitware/CMake/blob/master/Modules/FindCUDA.cmake#L862 As an example, On DLCluster with CUDA setup in Conda environment: ```bash $ which nvcc /home/scratch.prkumbhar_wwfo/software/x86_64/nvrx-july-2025/conda_envs/nvrx202507/bin/nvcc $ pip install . ... File "/home/prkumbhar/workspace/repos/nvidia/nvidia-resiliency-ext/cupti_build.py", line 59, in build raise FileNotFoundError("cuda installation not found in /usr/local/cuda or $CUDA_PATH") FileNotFoundError: cuda installation not found in /usr/local/cuda or $CUDA_PATH ``` With this PR, it finds correct CUDA dir: ```bash ... A setup.py file already exists. Using it. CUDA root found: /home/scratch.prkumbhar_wwfo/software/x86_64/nvrx-july-2025/conda_envs/nvrx202507/bin/../targets/x86_64-linux ``` This is a minor thing but thought helpful to improve user experience.

… list before using GroupBy

1. ≤16 ranks: Show all individual ranks. Example: EXCEPTION affecting ranks [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 2. 17-32 ranks: Show first 3 and last 3 ranks. Example: EXCEPTION affecting ranks [0, 1, 2]...[29, 30, 31] (total: 32) 3. >32 ranks: Show first 5 and last 5 ranks with total count. Example: EXCEPTION affecting ranks [0, 1, 2, 3, 4]...[27, 28, 29, 30, 31] (total: 32)

…sets.

…workload process.

…art the workload process." This reverts commit 671ece0.

Clean up duplicate utility functions Add missing `__init__.py`

… to trigger FR dump at abort

hexinw-nvidia · 2025-08-21T08:13:11Z

Added the Auto-Restart design doc:

https://docs.google.com/document/d/12JHNWaXlj8nEUF3HhaMtP5CwSG05IYdtMcyq8-hrJ-0/edit?usp=sharing

hexinw-nvidia added 7 commits July 30, 2025 15:48

Added auto_restart module. It can be used to monitor and restart the …

e527f0c

…workload process.

Made iteration unique across process restart.

a0b8e41

Cleared the initial barrier keys so they can be re-used across process restart.

Support external TCPStore service.

6acc17b

.

f5b2b2b

Removed duplicate logging config.

2f3875a

Used different timeout value on initial barrier.

6de3b35

Ensure MonitorProcess will host the TCPStore when it is deployed inte…

dbdace7

…rnally.

hexinw-nvidia added the ci-approved Approved to run CI label Aug 6, 2025

hexinw-nvidia requested review from rhewett-nv, apaithankar, sbak5, namitdhameja and szmigacz August 6, 2025 08:55

rhewett-nv marked this pull request as draft August 6, 2025 16:53

hexinw-nvidia and others added 16 commits August 7, 2025 14:33

Slight enhancement to the indefinite initial barrier timeout.

cdb5273

1) Made fork_and_monitor support clean abort.

13cfedc

2) Record job_restart_counter. Fixed RetryAbort exception by using job_restart_counter.

Add logging to in-process rank assignment

60c8c91

Address review: remove redundant dir check

2b8cb97

Default inprocess examples to CUDA

d8bce87

Truncate chain of exception logging.

854a322

Do not preserve the exception chain

22a058a

Groupby groups only consecutive items that has the same key. Sort the…

90765d9

… list before using GroupBy

.

fe84e2e

Used format_rank_set to show partial ranks and total count for large …

341e920

…sets.

.

fd902ce

.

aaff45e

.

85f7b5c

Changed max_show to 8.

a74e20e

hexinw-nvidia and others added 28 commits August 10, 2025 23:35

Fixed next_rendezvous in v2.3.1 access.

0c0310c

.

f6f86e1

Another compability issue.

da5ec9f

Added auto_restart module. It can be used to monitor and restart the …

9d6ce62

…workload process.

Revert "Added auto_restart module. It can be used to monitor and rest…

df3748f

…art the workload process." This reverts commit 671ece0.

Removed now.

ff22ed1

Add a single GPU health check

48b85ba

Add a trace colllector to collect FR traces

d5675b5

Add FR trace collection at AbortTorchDistributed

0fc04c4

Clean up duplicate utility functions Add missing `__init__.py`

Apply Linting

1c9bcb3

Refine the doc string for TraceCollector

8deef6c

Update for more logging in fr_collection

3761922

Env var value conversion

84c0f13

Remove unnecessary attribute in __init__.py

388672e

Disable stacktrace in FR dump which holds GIL and changes the env var…

0c6f575

… to trigger FR dump at abort

Add a comment to avoid security check on import pickle

168710a

Add trace_path argument in AbortTorchDistributed

d06a02b

Add Docstring for AbortTorchDistributed

d174974

Fix linting issue

7eaf199

Merge branch 'main' into auto_restart

555b7ec

Fixed assertion in test_wrap.py

6176ed5

Added _handle_restart_abort to pass the 130 clean exit code.

ad3ccf8

.

3783b9e

.

91df093

.

55bf3f4

.

9a5a6e2

Handled RestartAbort in the initial barrier stage.

44c0d2c

.

da12a24

Merge remote-tracking branch 'origin/main' into auto_restart

8639c8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auto restart #139

Auto restart #139

Uh oh!

hexinw-nvidia commented Aug 6, 2025

Uh oh!

hexinw-nvidia commented Aug 21, 2025

Uh oh!

Uh oh!

Auto restart #139

Are you sure you want to change the base?

Auto restart #139

Uh oh!

Conversation

hexinw-nvidia commented Aug 6, 2025

Uh oh!

hexinw-nvidia commented Aug 21, 2025

Uh oh!

Uh oh!