-
Notifications
You must be signed in to change notification settings - Fork 129
Training is much slower than you described in paper. #8
Comments
@zhaone Hi, have you found the reason? Here's my environment: 4 * Titan RTX, batch size 128 (4*32), distributed training using Horovod. Btw, one more thing I notice is that my log shows one epoch takes over 2440 while ~900 in the provided log file, and in #2 they report ~1200 (4 * RTX2080Ti). But the evaluation results are similar. Here's my training log:
Provided log file:
|
No, I have not solved this problem yet, but your speed is not so ridiculously slow compared with mine (3 times slower than yours). Have you checked where the speed bottleneck is? for example IO? |
|
@MasterIzumi i have the same question. And when i use |
Hi, I recently want to reproduce your result and can get the metric your described in paper but I got a problems that the training (almost 3 days) than you described in paper (less than 12 hours).
Environment:
horovod
to pytorchDDP
since thehorovod
framework is really hard to set up (even with officialhorovod
docker I still got some errors I can't resolve)Did I do something wrong? I'm sure that I use
DDP
correctly and also sure that the bottleneck of training speed is optimization (not IO or something else). Have others met the same problems like me?The text was updated successfully, but these errors were encountered: