You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In section 4.2 (on the VPT Foundation Model Training), the paper states that
Preliminary experiments suggested that our model could benefit from 30 epochs of training and that a 0.5 billion parameter model was required to stay in the efficient learning regime63 for that training duration (Appendix H), which took ∼9 days on 720 V100 GPUs.
Could you give some insight as to what required using this many GPUs? Did it have to do with data parallel, model parallel, or yet other reasons?
Thank you.
The text was updated successfully, but these errors were encountered:
Hey! You could try poking the authors with an email directly. I am not part of the authors, but my understanding is that they did it purely for data-parallel purposes; even the biggest VPT size fits into 32GB V100. With more GPUs they could shorten the training wall-clock time, so I guess they just used as many as they had available :D
In section 4.2 (on the VPT Foundation Model Training), the paper states that
Could you give some insight as to what required using this many GPUs? Did it have to do with data parallel, model parallel, or yet other reasons?
Thank you.
The text was updated successfully, but these errors were encountered: