-
Notifications
You must be signed in to change notification settings - Fork 32
Auto restart #139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
hexinw-nvidia
wants to merge
64
commits into
NVIDIA:main
Choose a base branch
from
hexinw-nvidia:auto_restart
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Auto restart #139
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…workload process.
Cleared the initial barrier keys so they can be re-used across process restart.
2) Record job_restart_counter. Fixed RetryAbort exception by using job_restart_counter.
- If CUDA is installed and available in $PATH, it would be nice if build system could use it rather than throwing an error. - Currently we just check for `/usr/local/cuda` and `CUDA_PATH` and then throw an error. - In this PR, we try to check if `nvcc` present in the $PATH and try to determine the path of cuda as it's typically done in build system like CMake: https://github.com/Kitware/CMake/blob/master/Modules/FindCUDA.cmake#L862 As an example, On DLCluster with CUDA setup in Conda environment: ```bash $ which nvcc /home/scratch.prkumbhar_wwfo/software/x86_64/nvrx-july-2025/conda_envs/nvrx202507/bin/nvcc $ pip install . ... File "/home/prkumbhar/workspace/repos/nvidia/nvidia-resiliency-ext/cupti_build.py", line 59, in build raise FileNotFoundError("cuda installation not found in /usr/local/cuda or $CUDA_PATH") FileNotFoundError: cuda installation not found in /usr/local/cuda or $CUDA_PATH ``` With this PR, it finds correct CUDA dir: ```bash ... A setup.py file already exists. Using it. CUDA root found: /home/scratch.prkumbhar_wwfo/software/x86_64/nvrx-july-2025/conda_envs/nvrx202507/bin/../targets/x86_64-linux ``` This is a minor thing but thought helpful to improve user experience.
… list before using GroupBy
1. ≤16 ranks: Show all individual ranks. Example: EXCEPTION affecting ranks [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 2. 17-32 ranks: Show first 3 and last 3 ranks. Example: EXCEPTION affecting ranks [0, 1, 2]...[29, 30, 31] (total: 32) 3. >32 ranks: Show first 5 and last 5 ranks with total count. Example: EXCEPTION affecting ranks [0, 1, 2, 3, 4]...[27, 28, 29, 30, 31] (total: 32)
…workload process.
…art the workload process." This reverts commit 671ece0.
Clean up duplicate utility functions Add missing `__init__.py`
… to trigger FR dump at abort
Added the Auto-Restart design doc: https://docs.google.com/document/d/12JHNWaXlj8nEUF3HhaMtP5CwSG05IYdtMcyq8-hrJ-0/edit?usp=sharing |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request enhances the NVIDIA Resiliency Extension (NVRx) InProcess by introducing auto-restart functionality. Previously, NVRx InProcess did not support process restarts, making it unable to recover training jobs from issues like CUDA context corruption or other failures requiring a process restart. Falling back to torchrun or infrastructure-level restarts was costly. This enhancement allows seamless recovery by enabling auto-restart with minimal code changes to the training program.
Changes
Auto-Restart Implementation: Added fork_and_monitor functionality in the nvidia_resiliency_ext.shared_utils.auto_restart module, enabling automatic process restarts for NVRx InProcess.
Unique Iterations Across Restarts: Ensured iterations remain unique across process restarts to prevent key conflicts.
Clear Initial Barrier Keys: Implemented clearing of initial barrier keys to allow reuse across restarts.
Integration Simplicity: Training programs (e.g., pretrain_mamba.py) can enable auto-restart by adding the following four lines of code: