Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does deepspeed partition the model to multi GPUs? #15

Open
vikki7777 opened this issue Apr 26, 2023 · 4 comments
Open

Does deepspeed partition the model to multi GPUs? #15

vikki7777 opened this issue Apr 26, 2023 · 4 comments

Comments

@vikki7777
Copy link

I am trying to run your code with the flan-t5-base model, I have one machine with multi V100, each gpu has 16G memory, and then I meet the following question:
when I use single gpu, the gpu memory usage is 11.5G,
when I ues 4 gpus, each gpu memory usage is 11.7G,
deepspeed said it could partition the model, so I don't understand why gpu memory usage is almost same?

single gpu:
1gpu
deepspeed --num_gpus=1  run_seq2seq_deepspeed.py 
    --model_id model 
    --dataset_path dataset 
    --epochs 3 
    --per_device_train_batch_size 16 
    --per_device_eval_batch_size 16 
    --generation_max_length 111 
    --lr 1e-4 
    --deepspeed configs/ds_flan_t5_z3_offload.json 

multi gpus:
4gpus
deepspeed --num_gpus=4  run_seq2seq_deepspeed.py 
    --model_id model 
    --dataset_path dataset 
    --epochs 3 
    --per_device_train_batch_size 16 
    --per_device_eval_batch_size 16 
    --generation_max_length 111 
    --lr 1e-4 
    --deepspeed configs/ds_flan_t5_z3_offload.json 

ds_config:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

@philschmid
Copy link
Owner

Your model is small enough to fit on a single GPU. Deepspeed then does Data parallelism and runs more models. You should see a faster time to train

@vikki7777
Copy link
Author

I changed the other two models: flan-t5-xl and flan-t5-xxl.

flan-t5-xl with single gpu:
flan-t5-xl-1gpu
deepspeed --include localhost:1  run_seq2seq_deepspeed.py 
    --model_id model_xl 
    --dataset_path dataset 
    --epochs 3 
    --per_device_train_batch_size 1 
    --per_device_eval_batch_size 1 
    --generation_max_length 111 
    --lr 1e-4 
    --deepspeed configs/ds_flan_t5_z3_offload.json 

flan-t5-xl with six gpus:
flan-t5-xl-4gpus
deepspeed --include localhost:1,2,3,4,5,6  run_seq2seq_deepspeed.py 
    --model_id model_xl 
    --dataset_path dataset 
    --epochs 3 
    --per_device_train_batch_size 1 
    --per_device_eval_batch_size 1 
    --generation_max_length 111 
    --lr 1e-4 
    --deepspeed configs/ds_flan_t5_z3_offload.json 

flan-t5-xxl with single gpu:
flan-t5-xxl-1gpu
flan-t5-xxl with six gpus: also cuda out of memory

For flan-t5-xxl, the results are both out of memory, the process of model partition cannot be seen;
For flan-t5-xl, I thought that when using a single gpu, if the memory occupied by the model is N, and then when using multiple gpus, the memory occupied by each gpu should be 1/N. Is this correct?
However, the results show that when using multiple gpus, the memory usage of each GPU increases (from 8G to 12G), why?
Looking forward to your reply :)

@vikki7777
Copy link
Author

@philschmid Could you help me solve the above problem? Thanks a lot.

@yulinliu101
Copy link

yulinliu101 commented Dec 15, 2023

Hi Vikki, did you figure out this issue? I had 4 V100s and have observed similar situations for the GPU memory management. Even with cpu offload, I wasn't able to fine-tune the Flan-T5-XL (fp32) on my hardware. Although it is also mysterious that I could tune it when loading the model with torch_dtype=torch.bfloat16,, which the OP said V100 does not support bf16 dtypes. Any insights or recommendations will be super helpful!

Thanks!

Attaching some dependencies info:

Cuda driver==12.1
transformers==4.35.2
torch==2.1.1+cu121
accelerate==0.25.0
peft==0.7.0
deepspeed==0.12.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants