You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While trying to train with resnext_101_64x4d taken from the fastai/fastai/models repository, and using FP32, there is a huge spike in memory usage. I observed that if I reduce the batch size from 512 to something like 64 or 32 (this is based on the network being trained) then the training goes through but it is a lot slower than if the batch size were 512.
At the start of the training, probably after the first batch, the memory usage goes down drastically. For example here is a capture of memory usage for one of the GPUs in the machine, for batch size set to 32, with everything else being the same for resnext_101_64x4d.
Memory usage in MiB
73
73
81
122
508
680
1084
1094
1094
1094
1094
3048
14024
13976
4776
9356
4696
1706
3110
5170
5596
5596
Note that the same thing repeats for all the 8 GPUs except for a small variations in the values.
As can be seen from above with 32 batch size the memory occupancy only 5596 MiB during the rest of the training (up to 14 epochs, then due to the size it changes). The rest of the memory is unused.
If it is possible to reduce this initial spike in memory usage or if there is someway to change the batch size to a bigger batch size once the training reaches a stable state, it would make the training a lot faster. I tried setting up a bigger batch size from epoch 2 by adding an additional phase with sz : 128 and bs : 512 but it doesn't seem to work for some reason.
Thanks for this amazing work.
The text was updated successfully, but these errors were encountered:
While trying to train with resnext_101_64x4d taken from the fastai/fastai/models repository, and using FP32, there is a huge spike in memory usage. I observed that if I reduce the batch size from 512 to something like 64 or 32 (this is based on the network being trained) then the training goes through but it is a lot slower than if the batch size were 512.
At the start of the training, probably after the first batch, the memory usage goes down drastically. For example here is a capture of memory usage for one of the GPUs in the machine, for batch size set to 32, with everything else being the same for resnext_101_64x4d.
Note that the same thing repeats for all the 8 GPUs except for a small variations in the values.
As can be seen from above with 32 batch size the memory occupancy only 5596 MiB during the rest of the training (up to 14 epochs, then due to the size it changes). The rest of the memory is unused.
If it is possible to reduce this initial spike in memory usage or if there is someway to change the batch size to a bigger batch size once the training reaches a stable state, it would make the training a lot faster. I tried setting up a bigger batch size from epoch 2 by adding an additional phase with sz : 128 and bs : 512 but it doesn't seem to work for some reason.
Thanks for this amazing work.
The text was updated successfully, but these errors were encountered: