Skip to content

Commit 0731c6e

Browse files
authored
Merge pull request #298 from Tencent/develop
Add SamplesPerSec as a performance indicator
2 parents eab1bf3 + 82a6071 commit 0731c6e

File tree

5 files changed

+13
-10
lines changed

5 files changed

+13
-10
lines changed

CHANGE_LOG.md

+1-3
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,7 @@
11
## v0.4.5 Dec. 2021
2+
Refactor the files in example and add chunk size searching.=
23
Evaluate on 8 nodes of SuperPod. Fix bugs in multi-GPU mem tracer.
34

4-
## v0.4.5 Dec. 2021
5-
Refactor the files in example and add chunk size searching.
6-
75

86
### v0.4.4 Dec. 2021
97
The system is successfully evaluated on a multi-node system.

doc/optimization_options.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ PatirckStar is famous for dynamic partition model data. With help of this flag y
5555
The is a computing efficient irrelevant option used for distributed training. It allocates memory for remote chunks but release it immediately. In this way, we can make sure the model parameter is randomly initialized the same as a serial version. Solve the problem with random seed. It is used in combination with the `--res_check` option to check the correctness of distributed training.
5656

5757
7. Adjusting the quota of CPU and GPU memory of memory tracer.
58-
We provide ways to adjust the CPU and GPU memory usage quota for the memory tracer. We did not expose this optimization as parameters passed through the command line. As shown in the pretrain_bert_demo.py, there is a JSON config for the memory tracer setting. You can adjust the four ratio suffix values.
58+
We provide ways to adjust the CPU and GPU memory usage quota for the memory tracer. We did not expose this optimization as parameters passed through the command line. As shown in the pretrain_demo.py, there is a JSON config for the memory tracer setting. You can adjust the four ratio suffix values.
5959

6060
`warmup_gpu_chunk_mem_ratio`: the max gpu memory of a GPU can be used for chunks during the warmup iteration.
6161

examples/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -19,15 +19,15 @@ python huggingface_bert.py
1919

2020
### Use PatrickStar to train large model
2121

22-
`run_transformers.sh` and `pretrain_bert_demo.py` is an example to train large PTMs with PatrickStar. You could run different size of model by adding config to`run_transformers.sh`.
22+
`run_transformers.sh` and `pretrain_demo.py` is an example to train large PTMs with PatrickStar. You could run different size of model by adding config to`run_transformers.sh`.
2323

2424
The following command will run a model with 4B params:
2525

2626
```bash
2727
env MODEL_NAME=GPT2_4B RES_CHECK=0 DIST_PLAN="patrickstar" bash run_transformers.sh
2828
```
2929

30-
For the available `MODEL_NAME`, please check `pretrain_bert_demo.py`.
30+
For the available `MODEL_NAME`, please check `pretrain_demo.py`.
3131

3232
Check the accuracy of PatrickStar with Bert:
3333

examples/pretrain_bert_demo.py renamed to examples/pretrain_demo.py

+8-3
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
from data_loader import get_bert_data_loader
3939
from patrickstar.profiler import profiler
4040
from patrickstar.runtime import initialize_engine
41-
from patrickstar.utils import see_memory_usage
41+
from patrickstar.utils import see_memory_usage, get_world_size
4242
from patrickstar.utils.logging import log_dist, logger
4343
from patrickstar.utils.model_size_calculator import get_ps_model_size
4444
from model_builder import build_transformer_model
@@ -180,10 +180,13 @@ def test_transformer_model_helper(
180180
f"After step {n}. using {dist_plan}, gradient checkpoint: {is_ckp}, fp16 {is_fp16}",
181181
force=True,
182182
)
183+
world_size = get_world_size()
183184
if dist_plan == "patrickstar":
184185
print(
185186
f'{"[WARM UP] " if n == 0 else ""}'
186-
f"Step {n} elaspe {step_elapse} s, {total_macs / 1e12 / step_elapse} Tflops"
187+
f"Step {n} elaspe {step_elapse} s, "
188+
f"{total_macs / 1e12 / step_elapse} Tflops Per GPU "
189+
f"{args.batch_size * world_size/step_elapse} SamplesPerSec"
187190
)
188191
if n == num_steps - 1:
189192
global_timer.my_timer.print()
@@ -193,7 +196,9 @@ def test_transformer_model_helper(
193196
global_timer.data_move_cnter.reset()
194197
else:
195198
print(
196-
f"Step {n} elaspe {step_elapse} s, {total_macs / 1e12 / step_elapse} Tflops"
199+
f"Step {n} elaspe {step_elapse} s, "
200+
f"{total_macs / 1e12 / step_elapse} Tflops Per GPU "
201+
f"{args.batch_size * world_size/step_elapse} SamplesPerSec"
197202
)
198203

199204
log_dist(f"End Step {n} with {dist_plan}.\n")

examples/run_transformers.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,7 @@ done
202202
else
203203
env OMP_NUM_THREADS=${TNUM} timeout -s SIGKILL 30m python -m torch.distributed.launch --nproc_per_node=${GPU_NUM} \
204204
--nnodes=${NNODES} --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} \
205-
pretrain_bert_demo.py \
205+
pretrain_demo.py \
206206
--default_chunk_size=${CHUNK_SIZE} \
207207
${cmd_opts} \
208208
2>&1 | tee ${LOG_DIR}/${LOG_FILE}

0 commit comments

Comments
 (0)