Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge spike in memory usage during initialization #13

Open
manjunaths opened this issue Sep 18, 2018 · 1 comment
Open

Huge spike in memory usage during initialization #13

manjunaths opened this issue Sep 18, 2018 · 1 comment

Comments

@manjunaths
Copy link

While trying to train with resnext_101_64x4d taken from the fastai/fastai/models repository, and using FP32, there is a huge spike in memory usage. I observed that if I reduce the batch size from 512 to something like 64 or 32 (this is based on the network being trained) then the training goes through but it is a lot slower than if the batch size were 512.

At the start of the training, probably after the first batch, the memory usage goes down drastically. For example here is a capture of memory usage for one of the GPUs in the machine, for batch size set to 32, with everything else being the same for resnext_101_64x4d.

Memory usage in MiB
73
73
81
122
508
680
1084
1094
1094
1094
1094
3048
14024
13976
4776
9356
4696
1706
3110
5170
5596
5596

Note that the same thing repeats for all the 8 GPUs except for a small variations in the values.
As can be seen from above with 32 batch size the memory occupancy only 5596 MiB during the rest of the training (up to 14 epochs, then due to the size it changes). The rest of the memory is unused.

If it is possible to reduce this initial spike in memory usage or if there is someway to change the batch size to a bigger batch size once the training reaches a stable state, it would make the training a lot faster. I tried setting up a bigger batch size from epoch 2 by adding an additional phase with sz : 128 and bs : 512 but it doesn't seem to work for some reason.

Thanks for this amazing work.

@yaroslavvb
Copy link
Collaborator

Hm, could the spike be somehow connected to data preloading? @bearpelican

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants