9 days on 720 GPUs? #24

jens321 · 2023-02-08T17:13:34Z

In section 4.2 (on the VPT Foundation Model Training), the paper states that

Preliminary experiments suggested that our model could benefit from 30 epochs of training and that a 0.5 billion parameter model was required to stay in the efficient learning regime63 for that training duration (Appendix H), which took ∼9 days on 720 V100 GPUs.

Could you give some insight as to what required using this many GPUs? Did it have to do with data parallel, model parallel, or yet other reasons?

Thank you.

Miffyli · 2023-02-08T19:39:21Z

Hey! You could try poking the authors with an email directly. I am not part of the authors, but my understanding is that they did it purely for data-parallel purposes; even the biggest VPT size fits into 32GB V100. With more GPUs they could shorten the training wall-clock time, so I guess they just used as many as they had available :D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

9 days on 720 GPUs? #24

9 days on 720 GPUs? #24

jens321 commented Feb 8, 2023

Miffyli commented Feb 8, 2023

9 days on 720 GPUs? #24

9 days on 720 GPUs? #24

Comments

jens321 commented Feb 8, 2023

Miffyli commented Feb 8, 2023