training speed problem #1071
-
| I have tried to train ResNeXt-50 32x4d with A100*8 by scripts as provided: From the training log, I observed that the training speed is getting slower by batches in each epoch. Do you know where the issue is? Is other hardware like CPU might be the problem? Train: 0 [  50/834 (  6%)]  Loss: 6.940 (6.94)  Time: 0.334s, 4598.09/s  (0.898s, 1709.76/s)  LR: 1.000e-04  Data: 0.048 (0.231)
Train: 0 [ 100/834 ( 12%)]  Loss: 6.939 (6.94)  Time: 0.262s, 5861.05/s  (2.308s,  665.63/s)  LR: 1.000e-04  Data: 0.038 (1.417)
Train: 0 [ 150/834 ( 18%)]  Loss: 6.940 (6.94)  Time: 0.256s, 6009.04/s  (3.347s,  458.94/s)  LR: 1.000e-04  Data: 0.036 (1.927)
Train: 0 [ 200/834 ( 24%)]  Loss: 6.926 (6.94)  Time: 11.068s,  138.78/s  (3.993s,  384.71/s)  LR: 1.000e-04  Data: 0.038 (1.584)
Train: 0 [ 250/834 ( 30%)]  Loss: 6.934 (6.94)  Time: 0.257s, 5982.28/s  (4.286s,  358.35/s)  LR: 1.000e-04  Data: 0.031 (1.459)
Train: 0 [ 300/834 ( 36%)]  Loss: 6.925 (6.94)  Time: 0.258s, 5958.37/s  (4.548s,  337.74/s)  LR: 1.000e-04  Data: 0.037 (1.223)
Train: 0 [ 350/834 ( 42%)]  Loss: 6.926 (6.93)  Time: 22.307s,   68.86/s  (4.716s,  325.72/s)  LR: 1.000e-04  Data: 0.030 (1.055)
Train: 0 [ 400/834 ( 48%)]  Loss: 6.923 (6.93)  Time: 0.260s, 5903.79/s  (4.831s,  317.92/s)  LR: 1.000e-04  Data: 0.033 (1.003)
Train: 0 [ 450/834 ( 54%)]  Loss: 6.930 (6.93)  Time: 0.267s, 5750.61/s  (4.895s,  313.80/s)  LR: 1.000e-04  Data: 0.042 (0.940)
Train: 0 [ 500/834 ( 60%)]  Loss: 6.924 (6.93)  Time: 0.253s, 6072.91/s  (4.987s,  308.01/s)  LR: 1.000e-04  Data: 0.038 (0.878)
Train: 0 [ 550/834 ( 66%)]  Loss: 6.927 (6.93)  Time: 0.292s, 5269.00/s  (5.098s,  301.29/s)  LR: 1.000e-04  Data: 0.033 (0.859)
Train: 0 [ 600/834 ( 72%)]  Loss: 6.923 (6.93)  Time: 0.253s, 6066.72/s  (5.119s,  300.08/s)  LR: 1.000e-04  Data: 0.034 (0.909)
Train: 0 [ 650/834 ( 78%)]  Loss: 6.927 (6.93)  Time: 0.259s, 5922.92/s  (5.138s,  298.97/s)  LR: 1.000e-04  Data: 0.041 (1.092)
Train: 0 [ 700/834 ( 84%)]  Loss: 6.927 (6.93)  Time: 14.519s,  105.79/s  (5.202s,  295.25/s)  LR: 1.000e-04  Data: 14.272 (1.326)
Train: 0 [ 750/834 ( 90%)]  Loss: 6.920 (6.93)  Time: 5.617s,  273.48/s  (5.242s,  292.99/s)  LR: 1.000e-04  Data: 1.984 (1.522)
Train: 0 [ 800/834 ( 96%)]  Loss: 6.921 (6.93)  Time: 0.260s, 5896.42/s  (5.285s,  290.62/s)  LR: 1.000e-04  Data: 0.043 (1.484) | 
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Beta Was this translation helpful? Give feedback.
@rwightman