You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Facilitating seamless NVRx integration with PyTorch Lightning.
31
+
-**Framework Integration**
32
+
- Facilitating seamless [fault tolerance](https://nvidia.github.io/nvidia-resiliency-ext/fault_tolerance/integration/ptl.html) and [straggler detection](https://nvidia.github.io/nvidia-resiliency-ext/straggler_det/usage_guide.html#integration-guide) integration with PyTorch Lightning based workloads.
33
+
- Providing integration with NVIDIA [NeMo](https://docs.nvidia.com/nemo-framework/user-guide/latest/resiliency.html) framework, a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (e.g. Automatic Speech Recognition and Text-to-Speech).
is an instatiation of the core utilities to make `torch.save` run asynchronously.
9
9
10
+
:py:class:`nvidia_resiliency_ext.checkpointing.async_ckpt.state_dict_saver.save_state_dict_async_plan` is an instantiation of the core utilities to make `torch.distributed.save_state_dict` run asynchronously.
10
11
11
-
The implementation assumes all training ranks creates :py:class:`core.AsyncCallsQueue` and synchronize with :py:class:`core.AsyncCallsQueue.maybe_finalize_async_calls` by default.
12
+
The implementation assumes all training ranks creates :py:class:`~nvidia_resiliency_ext.checkpointing.async_ckpt.core.AsyncCallsQueue` and synchronize with :py:class:`~nvidia_resiliency_ext.checkpointing.async_ckpt.core.AsyncCallsQueue.maybe_finalize_async_calls` by default.
12
13
13
14
14
-
Requirements
15
-
------------
16
-
:py:class:`nvidia_resiliency_ext.checkpointing.utils` includes a couple of routines used for :py:class:`nvidia_resiliency_ext.checkpointing.async_ckpt.core`
17
-
:py:class:`nvidia_resiliency_ext.checkpointing.utils.wrap_for_async` disables garbage collection in a forked process to run user's checkpoint routine
18
-
to prevent failures incurred by GC, which tries to deallocate CUDA tensors in a forked process.
19
-
This routine requires the first argument of the passed user fn should be state dictionary containing tensors or objects for checkpoint
20
-
21
-
The current implementation uses a forked process to run pre-staged tensors in host memory by pinned memcpy.
22
-
So, the routine should include :py:class:`nvidia_resiliency_ext.checkpointing.utils.preload_tensors` to stage GPU tensors in a state dictionary to host memory before it's passed to `AsyncCallsQueue`
15
+
Implementation Changes and Evolution
16
+
------------------------------------
17
+
* We have deprecated our initial implementation of async checkpointing, :py:class:`~nvidia_resiliency_ext.checkpointing.async_ckpt.core.TemporalAsyncCaller`, using a forked process to run the checkpointing in the background.
18
+
19
+
* :py:class:`~nvidia_resiliency_ext.checkpointing.async_ckpt.core.AsyncCallsQueue` is now initialized by default to use :py:class:`~nvidia_resiliency_ext.checkpointing.async_ckpt.core.PersistentAsyncCaller` instead of :py:class:`~nvidia_resiliency_ext.checkpointing.async_ckpt.core.TemporalAsyncCaller`.
20
+
21
+
* :py:class:`~nvidia_resiliency_ext.checkpointing.async_ckpt.core.PersistentAsyncCaller` spawns a persistent process that runs in a separate CUDA context and forks processes optionally for intra-node parallelism.
22
+
23
+
* Now, we don't need :py:func:`~nvidia_resiliency_ext.checkpointing.utils.wrap_for_async` anymore because :py:class:`~nvidia_resiliency_ext.checkpointing.async_ckpt.core.PersistentAsyncCaller` is safe to call garbage collection in the spawned process.
24
+
25
+
* :py:class:`~nvidia_resiliency_ext.checkpointing.async_ckpt.core.PersistentAsyncCaller` runs :py:func:`~nvidia_resiliency_ext.checkpointing.async_ckpt.filesystem_async.FileSystemWriterAsync.preload_tensors` in the spawned process.
26
+
So, we've added a new field, :py:attr:`~nvidia_resiliency_ext.checkpointing.async_ckpt.core.AsyncRequest.preload_fn`, to pass the preload function(preload_fn) to the spawned process.
27
+
28
+
* The preload_fn should be self-contained with a proper list of arguments with :py:class:`functools.partial`.
29
+
30
+
* The preload_fn should be a function that takes a state dictionary and returns a state dictionary.
31
+
32
+
* :py:class:`~nvidia_resiliency_ext.checkpointing.async_ckpt.core.PersistentAsyncCaller` receives GPU tensor IPC handles and prestages them to host memory through a preload_fn
33
+
so dereference of GPU tensors should be done promptly inside of `preload_fn` if possible.
34
+
35
+
* A proper termination of the persistent process is required for graceful shutdown.
36
+
37
+
* Job schedulers(e.g. Slurm, torchrun) should clean up the persistent process and its child workers when the job step is terminated.
38
+
39
+
* The following changes will be made in the next release to the implementation of :py:class:`~nvidia_resiliency_ext.checkpointing.async_ckpt.core.PersistentAsyncCaller`:
40
+
41
+
* We'll make the persistent process to be terminated when the main process is terminated.
42
+
43
+
* Optional child workers created by :py:class:`~nvidia_resiliency_ext.checkpointing.async_ckpt.filesystem_async.FileSystemWriterAsync` are terminated when the persistent process is terminated.
44
+
23
45
24
46
25
47
Synchronization of Asynchronous Checkpoint Requests
@@ -218,3 +240,13 @@ The following example demonstrates a complete workflow for saving and loading ch
Copy file name to clipboardExpand all lines: docs/source/release-notes.md
+23-1Lines changed: 23 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,14 +2,36 @@
2
2
3
3
NVIDIA Resiliency Extension is a Python package for framework developers and users to implement fault-tolerant features. It improves effective training time by minimizing downtime due to failures and interruptions.
4
4
5
+
## NVIDIA Resiliency Extension v0.4.1
6
+
7
+
### Highlights
8
+
9
+
This hotfix release includes important bug fixes, performance improvements, and minor updates to enhance stability.
10
+
11
+
- Checkpointing
12
+
-[PR 104](https://github.com/NVIDIA/nvidia-resiliency-ext/pull/104), [PR 106](https://github.com/NVIDIA/nvidia-resiliency-ext/pull/106), [PR 108](https://github.com/NVIDIA/nvidia-resiliency-ext/pull/108), [PR 111](https://github.com/NVIDIA/nvidia-resiliency-ext/pull/111) and [PR 116](https://github.com/NVIDIA/nvidia-resiliency-ext/pull/116) fix the asynchronous checkpointing module to switch from temporal to using the persistent worker that uses `spawn` instead of `fork`.
13
+
- The fix in this release is working toward an intermediate milestone of deprecating the use of `fork` and instead using a `spawn` for asynchronous checkpointing. The complete transition to using `spawn` has the following dependencies on `fork` that will be eliminated in upcoming release:
14
+
- Local checkpointing must continue to use the `fork` based asynchronous checkpointing as clarified in the usage guide.
15
+
- File IO operations with multiprocessing can still trigger a `fork`
16
+
17
+
- In-process restart
18
+
-[PR 103](https://github.com/NVIDIA/nvidia-resiliency-ext/pull/103) fixes a case where extra CUDA contexts were created on local rank 0 after restart, consuming extra GPU memory on local rank 0.
19
+
-[PR 112](https://github.com/NVIDIA/nvidia-resiliency-ext/pull/112) fixes the workload state leaks across the restart boundary. The fix addresses a case where objects created in the wrapped function could not be garbage collected after a restart, manifesting as a memory leak.
20
+
21
+
### Known Issues & Limitations
22
+
23
+
- In a future release, we will add changes to automatically terminate the persistent process when the main process terminates.
24
+
- Until this change is implemented, job schedulers must ensure proper termination of the persistent process and its child workers for a graceful shutdown.
25
+
26
+
5
27
## NVIDIA Resiliency Extension v0.4.0
6
28
7
29
### Highlights
8
30
9
31
- Checkpointing
10
32
-[PR 29](https://github.com/NVIDIA/nvidia-resiliency-ext/pull/29) - Support for storing checkpoints to cloud object stores
11
33
- Leverage cloud storage provider’s multithreaded SDK for rapid loading and saving checkpoints to object stores such as AWS S3, Azure Blob
12
-
Storage, Google Cloud Storage and more using NVIDIA Multi-storage Client.
34
+
Storage, Google Cloud Storage and more using NVIDIA Multi-storage Client
13
35
- Provide scalable, reliable, cheaper, single source of truth across clouds/regions
14
36
- Provide opt-out configuration when creating FileSystemWriterAsync class instance to allow users to passthrough to the filesystem
15
37
-[PR 36](https://github.com/NVIDIA/nvidia-resiliency-ext/pull/36) - Critical bug fix to enable async checkpoint loading without errors
0 commit comments