You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to reproduce the FLAN-T5-XXL (11B) results from this blog post.
I have an 8xA10G instance. Since the blog shows that you can run FLAN-T5-XXL (11B) training on a 4xA10G setup, I was surprised to see that I get a CUDA OOM error as soon as the first training epoch starts:
OutOfMemoryError: CUDA out of memory. Tried to allocate 3.73 GiB. GPU 0 has a total capacity of 21.99 GiB of which 723.06 MiB is free. Including non-PyTorch memory, this process has 21.27 GiB memory in use. Of the allocated memory 17.87 GiB is allocated by PyTorch, and 2.96 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
I have even tried to run at batch=1, but that didn't help, and I have double checked that bf16 is enabled.
Additionally, I have attempted to run default T5-11B and T5-3B models using Accelerate + DeepSpeed (level 3, bf16) using the instructions from this tutorial, and I also get a CUDA OOM. The only case in which I do not get an OOM, is when I run the default T5-Large at batch=1.
I'm not sure where I am going wrong. The error message suggests that something is reserving almost all of the GPU memory before the "real" allocations start (as only 723 MiB is free.)
The text was updated successfully, but these errors were encountered:
Update: Running google/flan-t5-xl (3B parameters) with Accelerate+DeepSpeed (level 3, bf16) seems to work for batch=1, but uses about 88% of memory per GPU - still seems far too high.
Another update: It seems that DeepSpeed level 2 and level 3 take up the same GPU memory on my setup. Perhaps level 3 is not fully activating?
I am trying to reproduce the FLAN-T5-XXL (11B) results from this blog post.
I have an 8xA10G instance. Since the blog shows that you can run FLAN-T5-XXL (11B) training on a 4xA10G setup, I was surprised to see that I get a CUDA OOM error as soon as the first training epoch starts:
I have even tried to run at batch=1, but that didn't help, and I have double checked that bf16 is enabled.
Additionally, I have attempted to run default T5-11B and T5-3B models using Accelerate + DeepSpeed (level 3, bf16) using the instructions from this tutorial, and I also get a CUDA OOM. The only case in which I do not get an OOM, is when I run the default T5-Large at batch=1.
I'm not sure where I am going wrong. The error message suggests that something is reserving almost all of the GPU memory before the "real" allocations start (as only 723 MiB is free.)
The text was updated successfully, but these errors were encountered: