-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does deepspeed partition the model to multi GPUs? #15
Comments
Your model is small enough to fit on a single GPU. Deepspeed then does Data parallelism and runs more models. You should see a faster time to train |
@philschmid Could you help me solve the above problem? Thanks a lot. |
Hi Vikki, did you figure out this issue? I had 4 V100s and have observed similar situations for the GPU memory management. Even with cpu offload, I wasn't able to fine-tune the Flan-T5-XL (fp32) on my hardware. Although it is also mysterious that I could tune it when loading the model with Thanks! Attaching some dependencies info:
|
I am trying to run your code with the flan-t5-base model, I have one machine with multi V100, each gpu has 16G memory, and then I meet the following question:
when I use single gpu, the gpu memory usage is 11.5G,
when I ues 4 gpus, each gpu memory usage is 11.7G,
deepspeed said it could partition the model, so I don't understand why gpu memory usage is almost same?
single gpu:
deepspeed --num_gpus=1 run_seq2seq_deepspeed.py
--model_id model
--dataset_path dataset
--epochs 3
--per_device_train_batch_size 16
--per_device_eval_batch_size 16
--generation_max_length 111
--lr 1e-4
--deepspeed configs/ds_flan_t5_z3_offload.json
multi gpus:
deepspeed --num_gpus=4 run_seq2seq_deepspeed.py
--model_id model
--dataset_path dataset
--epochs 3
--per_device_train_batch_size 16
--per_device_eval_batch_size 16
--generation_max_length 111
--lr 1e-4
--deepspeed configs/ds_flan_t5_z3_offload.json
ds_config:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
The text was updated successfully, but these errors were encountered: