Does deepspeed partition the model to multi GPUs? #15

vikki7777 · 2023-04-26T07:17:44Z

I am trying to run your code with the flan-t5-base model, I have one machine with multi V100, each gpu has 16G memory, and then I meet the following question:
when I use single gpu, the gpu memory usage is 11.5G,
when I ues 4 gpus, each gpu memory usage is 11.7G,
deepspeed said it could partition the model, so I don't understand why gpu memory usage is almost same?

single gpu:

deepspeed --num_gpus=1  run_seq2seq_deepspeed.py
    --model_id model
    --dataset_path dataset
    --epochs 3
    --per_device_train_batch_size 16
    --per_device_eval_batch_size 16
    --generation_max_length 111
    --lr 1e-4
    --deepspeed configs/ds_flan_t5_z3_offload.json

multi gpus:

deepspeed --num_gpus=4  run_seq2seq_deepspeed.py
    --model_id model
    --dataset_path dataset
    --epochs 3
    --per_device_train_batch_size 16
    --per_device_eval_batch_size 16
    --generation_max_length 111
    --lr 1e-4
    --deepspeed configs/ds_flan_t5_z3_offload.json

ds_config:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

philschmid · 2023-04-27T07:03:45Z

Your model is small enough to fit on a single GPU. Deepspeed then does Data parallelism and runs more models. You should see a faster time to train

vikki7777 · 2023-04-28T02:40:16Z

I changed the other two models: flan-t5-xl and flan-t5-xxl.

flan-t5-xl with single gpu:

deepspeed --include localhost:1  run_seq2seq_deepspeed.py
    --model_id model_xl
    --dataset_path dataset
    --epochs 3
    --per_device_train_batch_size 1
    --per_device_eval_batch_size 1
    --generation_max_length 111
    --lr 1e-4
    --deepspeed configs/ds_flan_t5_z3_offload.json

flan-t5-xl with six gpus:

deepspeed --include localhost:1,2,3,4,5,6  run_seq2seq_deepspeed.py
    --model_id model_xl
    --dataset_path dataset
    --epochs 3
    --per_device_train_batch_size 1
    --per_device_eval_batch_size 1
    --generation_max_length 111
    --lr 1e-4
    --deepspeed configs/ds_flan_t5_z3_offload.json

flan-t5-xxl with single gpu:

flan-t5-xxl with six gpus: also cuda out of memory

For flan-t5-xxl, the results are both out of memory, the process of model partition cannot be seen；
For flan-t5-xl, I thought that when using a single gpu, if the memory occupied by the model is N, and then when using multiple gpus, the memory occupied by each gpu should be 1/N. Is this correct?
However, the results show that when using multiple gpus, the memory usage of each GPU increases (from 8G to 12G), why?
Looking forward to your reply :)

vikki7777 · 2023-05-05T07:41:04Z

@philschmid Could you help me solve the above problem? Thanks a lot.

yulinliu101 · 2023-12-15T21:02:12Z

Hi Vikki, did you figure out this issue? I had 4 V100s and have observed similar situations for the GPU memory management. Even with cpu offload, I wasn't able to fine-tune the Flan-T5-XL (fp32) on my hardware. Although it is also mysterious that I could tune it when loading the model with torch_dtype=torch.bfloat16,, which the OP said V100 does not support bf16 dtypes. Any insights or recommendations will be super helpful!

Thanks!

Attaching some dependencies info:

Cuda driver==12.1
transformers==4.35.2
torch==2.1.1+cu121
accelerate==0.25.0
peft==0.7.0
deepspeed==0.12.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does deepspeed partition the model to multi GPUs? #15

Does deepspeed partition the model to multi GPUs? #15

vikki7777 commented Apr 26, 2023

philschmid commented Apr 27, 2023

vikki7777 commented Apr 28, 2023

vikki7777 commented May 5, 2023

yulinliu101 commented Dec 15, 2023 •

edited

Loading

Does deepspeed partition the model to multi GPUs? #15

Does deepspeed partition the model to multi GPUs? #15

Comments

vikki7777 commented Apr 26, 2023

philschmid commented Apr 27, 2023

vikki7777 commented Apr 28, 2023

vikki7777 commented May 5, 2023

yulinliu101 commented Dec 15, 2023 • edited Loading

yulinliu101 commented Dec 15, 2023 •

edited

Loading