You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This feature is not enabled by default. To enable, set the following options in ds_config.json and download the [DataStates-LLM checkpointing library](https://github.com/DataStates/datastates-llm/). A detailed tutorial is available [here](../../docs/_tutorials/datastates-async-checkpointing.md).
tags: asynchronous checkpointing for minimizing I/O overheads.
4
+
---
5
+
This tutorial will show how to use [DataStates-LLM](https://github.com/DataStates/datastates-llm) for asynchronous checkpointing. DataStates-LLM introduces a lazy asynchronous checkpointing mechanism tailored for LLMs, aiming to minimize I/O overhead and enhance training efficiency. This tutorial provides a guide on integrating DataStates-LLM with the DeepSpeed framework.
6
+
7
+
## Overview of DataStates-LLM
8
+
9
+
DataStates-LLM is designed to address the challenges of frequent checkpointing in LLM training by introducing a lazy asynchronous multi-level approach. It leverages the immutability of model parameters and optimizer states during forward and backward passes to perform non-blocking data transfers, thereby reducing interference with the training process. This method has demonstrated up to 48x faster checkpointing and 2.2x faster end-to-end training times compared to traditional approaches as outlined in [DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models](https://arxiv.org/abs/2406.10707).
10
+
11
+
## Prerequisites
12
+
13
+
Before integrating DataStates-LLM with DeepSpeed, ensure the following:
14
+
15
+
-**DeepSpeed Installation**: DeepSpeed should be installed in your environment. If not, refer to the [DeepSpeed Getting Started Guide](https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/getting-started.md) for installation instructions.
16
+
17
+
-**DataStates-LLM Repository**: Access the DataStates-LLM source code from its [GitHub repository](https://github.com/DataStates/datastates-llm) and follow the installation instructions provided therein.
18
+
19
+
## Configuring DeepSpeed for DataStates-LLM
20
+
21
+
To enable DataStates-LLM's asynchronous checkpointing within DeepSpeed, please modify the `deepspeed_config.json` file to include specific configurations under the `datastates_ckpt` section. Below is an example configuration:
22
+
23
+
```json
24
+
{
25
+
// ... other DeepSpeed configuration options
26
+
"datastates_ckpt": {
27
+
"host_cache_size": 16
28
+
}
29
+
}
30
+
```
31
+
32
+
### Configuration Parameters
33
+
34
+
-**`host_cache_size`**: Specifies the amount of pinned host memory (in gigabytes) reserved for asynchronous checkpoint flushing. Adjust this value based on your system's memory capacity and the size of your model checkpoints.
35
+
36
+
## Implementing DataStates-LLM in Your Training Script
37
+
38
+
After enabling datastates checkpointing the `deepspeed_config.json`, the frequency of checkpointing can be configured by specifying the number of iterations after which the checkpoints should be captured using command-line parameter ` --save-interval`.
39
+
40
+
## Limitations and Ongoing Work
41
+
42
+
1. DataStates-LLM currently only supports the CUDA runtime on Nvidia-based GPUs.
43
+
44
+
45
+
2. DataStates-LLM has only been tested with ZeRO stage-1 without offloading to any other tiers.
46
+
47
+
48
+
3. While the checkpoint layout of datastates matches Huggingface's [safetensor](https://huggingface.co/docs/safetensors/) format, due to pickled objects required by DeepSpeed during restart, it is not fully compatible with safetensor library yet.
49
+
50
+
4. DataStates-LLM does not yet support universal or elastic checkpointing.
51
+
52
+
53
+
## Questions and Support
54
+
55
+
Please use the [DataStates-LLM Github repository](https://github.com/DataStates/datastates-llm) for any questions, issues, or feature requests.
0 commit comments