Merge pull request #271 from Tencent/develop

upgrade to v0.4.4
Tencent · Dec 8, 2021 · f5fee95 · f5fee95
2 parents 5e96f2e + 497c61b
commit f5fee95
Show file tree

Hide file tree

Showing 8 changed files with 71 additions and 460 deletions.
diff --git a/CHANGE_LOG.md b/CHANGE_LOG.md
@@ -0,0 +1,19 @@
+### v0.4.4 Dec. 2021
+The system is successfully evaluated on a multi-node system.
+The benchmark scripts are integrated with memory-centric tiling borrowed from DeepSpeed.
+It trains an 18B model on WeChat Yard.
+
+
+### v0.4.3 Nov. 2021
+The system is evaluated on A100 SuperPod.
+Some optimizations are developed to improve further the model scale and efficiency, including memory saving communication (MSC) and allocation cache (CACHE).
+A severe bug caused by asyn chunk copy using stream is identified and fixed.
+It trains a 50B model on an 8xA100 SuperPod node.
+
+
+### v0.4.0 Nov. 2021,
+The system is upgraded with a better memory tracer.
+We improve the max model scale further than v0.3.0 (15B vs. 12B) on the WeChat Yard Platform.
+
+### v0.3.0 Oct. 2021.
+Our initial version significantly surpasses DeepSpeed both in model-scale and computing efficiency.
diff --git a/README.md b/README.md
@@ -1,10 +1,9 @@
 ## PatrickStar: Parallel Training of Large Language Models via a Chunk-based Memory Management
 
 ![logo](./logo.png)
-### News
-1. Nov. 2021, v0.4.3 releaed. PatrickStar is evaluated on A100 SuperPod. Some execution options are provided, including memory saving communication technique, memory allocation cache. It trains 40B model on a SuperPod node.
-2. Nov. 2021, v0.4.0 released. With a better memory tracer, PatrickStar further improves the max model scale than v0.3.0 (15B vs 12B).
-3. Oct. 2021, v0.3.0 released. Our initial version significantly surpasses DeepSpeed.
+
+### Recent Progress
+See [CHANGE_LOG.md](./CHANGE_LOG.md).
 
 ### Meeting PatrickStar
 Pre-Trained Models (PTM) are becoming the hotspot of both NLP research and industry application. However, the training of PTMs requires enormous hardware resources, which makes it only accessible to small portion of people in the AI community. Now, **PatrickStar will make PTM training available to everyone!**
@@ -13,10 +12,10 @@ Out of memory error (OOM) is the nightmare of every engineer training PTMs. To p
 
 ### System Design
 The idea of Patrick is like this. The non-model data (mainly activations) varies during training, but the current heterogenous training solutions are **statically** spliting the model data to CPU and GPU. To make better use of the GPU, PatrickStar proposes a **dynamic** memory scheduling with the help of a chunk-based memory management module. The memory management of PatrickStar supports offloading everything but the current computing part of the model to CPU to save GPU. In addition, chunk-based memory management is efficient for collective communication when scaling to multiple GPU.
-See [this doc](./INSIDE.md) for the idea behind PatrickStar.
+See the paper and [this doc](./INSIDE.md) for the idea behind PatrickStar.
 
 ### Results
-In experiment, Patrickstar v0.4.3 is able to train a **15 Billion**(15B) param model with 8xTesla V100 GPU and 240GB GPU memory, which is twice as large as the state of art. And the performance of PatrickStar is better for models of the same size as well. The pstar is PatrickStar v0.4.3. The deeps indicates performance of DeepSpeed v0.4.3 using the official example [DeepSpeed example](https://github.com/microsoft/DeepSpeedExamples/blob/master/Megatron-LM-v1.1.5-ZeRO3/examples/ds_pretrain_gpt2-zero3.sh) zero3 stage with activation optimzations openning by default.
+In experiment, Patrickstar v0.4.3 is able to train a **18 Billion**(18B) param model with 8xTesla V100 GPU and 240GB GPU memory, which is over twice as large as the state of art. And the performance of PatrickStar is better for models of the same size as well. The pstar is PatrickStar v0.4.3. The deeps indicates performance of DeepSpeed v0.4.3 using the official example [DeepSpeed example](https://github.com/microsoft/DeepSpeedExamples/blob/master/Megatron-LM-v1.1.5-ZeRO3/examples/ds_pretrain_gpt2-zero3.sh) zero3 stage with activation optimzations openning by default.
 
 ![alt perf](./doc/mgpu_scalability.png "performance testing result")
 

diff --git a/doc/mgpu_scalability.png b/doc/mgpu_scalability.png
diff --git a/doc/optimization_options.md b/doc/optimization_options.md
@@ -1,11 +1,11 @@
 This page explains the optimization options for benchmarking.
-Optimizations is divided into PatrickStar-related ones and general ones.
-General Optimizations can be applied to any PyTorch-based frameworks.
+Optimizations are divided into PatrickStar-related ones and general ones.
+General Optimizations can be applied to any PyTorch-based framework.
 
 ## General Optimizations
 1. Activation Checkpoing (a.k.a gradient checkpointing in [PyTorch](https://pytorch.org/docs/stable/checkpoint.html))
 `--use_ckp`
-Make sure this option is open for large model training. It can largely save activation memory footprint at cost of recomputing.
+Make sure this option is open for large model training. It can primarily save activation memory footprint at the cost of recomputing.
 
 2. Activation Offloading
 `--with_activation_offload`
@@ -14,21 +14,23 @@ Note you have to use activation checkpoing first.
 
 3. CPU Embedding
 `--use_cpu_embedding`
-nn.Embedding is conducted on CPU, save GPU memory. More importantly, it shrinks the chunk size. For some small model, the biggest layer is Embedding. Therefore, the chunk size has to larger than the embedding numel.
+nn.Embedding is executed on CPU, save GPU memory. More importantly, it shrinks the chunk size. For some small models, the most significant layer is Embedding. Therefore, the chunk size has to be larger than the embedding numel.
+
 
 4. Tiling Linear (a.k.a Memory-centric tiling in [DeepSpeed](https://deepspeed.readthedocs.io/en/stable/zero3.html#memory-centric-tiling))
 `--with_tiling_linear`
-Memory-centric tiling (MCT) is able to split a param tensor of linear into pieces, and they do not need to be stored in contiguous memory space. This will help reduce chunk size. To achieve the best performance you have to tune the in_splits/out_splits of the parameters of the function.
+Memory-centric tiling (MCT) can split a param tensor of linear into pieces, and they do not need to be stored in contiguous memory space. This will help reduce chunk size. However, to achieve the best performance, you have to tune the in_splits/out_splits of the function's parameters.
 
 ## PatrickStar-related Optmizations
 
 1. Memory Saving Communication.
 `--with_mem_saving_com`
-Use one-to-all communication to replace the original collective communication. More specifically, reduce scatter is replaced with Nx reduce. all gather is replaced with Nx bcast. In this way, we do not need to keep a Nx chunk buffer for distributed training, therefore saving the GPU memory. This method also changes the CPU-GPU and intra-GPU communication volume. In general, it reduces CPU-GPU comm volume at a cost of increasing intra-GPU bcast comm volume and also lower the intra-GPU bcast bandwidth. However, for some cases, it can improve the overall performance of the system from such tradeoff. It is suitable for training an extremely large model with a computing cluster with high-quality intra-GPU communication bandwidth, i.e. 50B model on a node of SuperPod. Details in Merge Request #250.
+Use one-to-all communication to replace the original collective communication. More specifically, reduce scatter is replaced with Nx reduce. all gather is replaced with Nx bcast. In this way, we do not need to keep a Nx chunk buffer for distributed training, therefore saving the GPU memory. This method also changes the CPU-GPU and intra-GPU communication volume. In general, it reduces CPU-GPU comm volume at a cost of increasing intra-GPU bcast comm volume and also lower the intra-GPU bcast bandwidth. However, in some cases, it can improve the overall performance of the system from such a tradeoff. It is suitable for training an extremely large model with a computing cluster with high-quality intra-GPU communication bandwidth, i.e. 50B model on a node of SuperPod. Details in Merge Request #250.
 
 2. Memory Allocation Caching.
 `--with_mem_cache`
-Use a cache to allocate and release chunk memory. The cache is a size-limited queue, whose capacity is default as 2. It is helpful for Memory Saving Communication in distributed training. It avoid frequent release and allocate memory for remote chunks. See detail in #241.
+Use a cache to allocate and release chunk memory. The cache is a size-limited queue whose capacity is default as 2. It is helpful for Memory Saving Communication in distributed training. It avoids frequent release and allocates memory for remote chunks. See detail in #241.
+
 
 2. Hybrid ADAM:
 `--use_hybrid_adam`
@@ -51,3 +53,28 @@ PatirckStar is famous for dynamic partition model data. With help of this flag y
 6. Release Remote Chunk After Initialization.
 `release_after_init`
 The is a computing efficient irrelevant option used for distributed training. It allocates memory for remote chunks but release it immediately. In this way, we can make sure the model parameter is randomly initialized the same as a serial version. Solve the problem with random seed. It is used in combination with the `--res_check` option to check the correctness of distributed training.
+
+7. Adjusting the quota of CPU and GPU memory of memory tracer.
+We provide ways to adjust the CPU and GPU memory usage quota for the memory tracer. We did not expose this optimization as parameters passed through the command line. As shown in the pretrain_bert_demo.py, there is a JSON config for the memory tracer setting. You can adjust the four ratio suffix values.
+
+`warmup_gpu_chunk_mem_ratio`: the max gpu memory of a GPU can be used for chunks during the warmup iteration.
+
+`overall_gpu_mem_ratio`: the available gpu mem size / real gpu mem capacity. Turn up the value if you meet cpu or gpu OOM during iteration.
+
+`overall_cpu_mem_ratio`: the available cpu mem size / real cpu mem capacity. Turn up the value if you meet cpu or gpu OOM during iteration.
+
+`margin_use_ratio`: Space to host optimizer states in GPU / the rest GPU space excluding the peak chunk-used space after warmup FWD+BWD.
+
+`use_fake_dist`: a debug flag, to simulate multiple-GPU on one GPU. It is used when we are poor. After we have multi-GPU we deprecated this flag.
+
+```
+"mem_tracer": {
+                    "use_async_mem_monitor": args.with_async_mem_monitor,
+                    "warmup_gpu_chunk_mem_ratio": 0.1,
+                    "overall_gpu_mem_ratio": 0.8,
+                    "overall_cpu_mem_ratio": 0.8,
+                    "margin_use_ratio": 0.8,
+                    "use_fake_dist": False,
+                    "with_static_partition": args.with_static_partition,
+                },
+```
diff --git a/examples/pretrain_bert_demo.py b/examples/pretrain_bert_demo.py
@@ -592,11 +592,16 @@ def visit_and_register_hooks(module):
         SEQ_LEN = 1024
         NUM_LAYER = 65
         NUM_HEAD = 16
-    elif MODEL_NAME == "GPT3_15B":
+    elif MODEL_NAME == "GPT3_18B":
         HIDDEN_DIM = 4096
         SEQ_LEN = 1024
         NUM_LAYER = 78
         NUM_HEAD = 16
+    elif MODEL_NAME == "GPT3_17B":
+        HIDDEN_DIM = 4096
+        SEQ_LEN = 1024
+        NUM_LAYER = 90
+        NUM_HEAD = 16
     # The following configs comes from paper
     # Efficient Large-Scale Language Model Training on GPU Clusters
     # NV model is wider in hidden-size
@@ -627,6 +632,11 @@ def visit_and_register_hooks(module):
         SEQ_LEN = 1024
         NUM_LAYER = 50
         NUM_HEAD = 16
+    elif MODEL_NAME == "GPT_DS_50B":
+        HIDDEN_DIM = 8192
+        SEQ_LEN = 1024
+        NUM_LAYER = 62
+        NUM_HEAD = 16
     elif MODEL_NAME == "GPT_DS_60B":
         HIDDEN_DIM = 8192
         SEQ_LEN = 1024