-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU offload when not using offload deepspeed config file #19
Comments
Can you share the code you use? Do you only want to do inference? What hardware do you have available? |
Hi Phillip, many thanks for getting back. For the code, I modified your script. I have included the code below. I am essentially calling trainer.predict(). At this point, I am just running inference with the pre-trained Flan T5 XXL model. For the hardware I used g5.48xlarge instance. I enable bf16. There is a related issue when I try doing the same (just inference with pre-trained Flan T5 XXL) on p3dn.24xlarge instance. This time I use fp32. When I use your config file (ds_flan_t5_z3_config.json) I get OOM even with a batch size of 1. Here as well, I do the same as in I call trainer.predict() and pass the whole CNN Dailymail test set as HF Dataset. I did check Flan T5 XL on p3dn instance with fp32 and it works without OOM. Let me know if you need other details. Code:
|
On a similar note: What prevents p3/p3dn type instances with V100 GPUs from training Flan-T5-XXL? Trying Flan-T5-XL on 4 p3.16XL instances and debugging with sagemaker (CW metrics for GPU mem utilization are a red-herring and always shows ~100% usage on all nodes) The following graphs show no more than 30% gpu mem usage on all 4 p3.16xl nodes at any point during training. By that logic, shouldn't Flan-T5-XXL also fit on the same cluster or even a bigger one (14 node cluster with p3.16xl I tried above failed with OOM almost instantaneously) I'd like to understand the limitation of deepspeed's sharding abilities; Is the most granular piece that's being sharded still too big to fit on a single 16GB V100 gpu I'd be surprised if that was the case Or is it because each gpu tries to get weights from other gpus that it needs for computation and OOM due to the accumulated parameter size? |
Hi Philipp,
thanks for your awesome blog on training Flan T5 XXL. I am playing around with it and doing just zero-shot inference using ds_flan_t5_z3_config_bf16.json deepspeed config file. I believe this should not do any offload however I see the following in the deepspeed logs
I am also seeing logs mentioning trace cache. Is this related to CPU offload?
Thanks again and looking forward to your reply.
The text was updated successfully, but these errors were encountered: