Out of Memory: Cannot reproduce T5-XXL run on 8xA10G. #49

slai-natanijel · 2024-03-22T00:21:48Z

I am trying to reproduce the FLAN-T5-XXL (11B) results from this blog post.

I have an 8xA10G instance. Since the blog shows that you can run FLAN-T5-XXL (11B) training on a 4xA10G setup, I was surprised to see that I get a CUDA OOM error as soon as the first training epoch starts:

OutOfMemoryError: CUDA out of memory. Tried to allocate 3.73 GiB. GPU 0 has a total capacity of 21.99 GiB of which 723.06 MiB is free. Including non-PyTorch memory, this process has 21.27 GiB memory in use. Of the allocated memory 17.87 GiB is allocated by PyTorch, and 2.96 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

I have even tried to run at batch=1, but that didn't help, and I have double checked that bf16 is enabled.

Additionally, I have attempted to run default T5-11B and T5-3B models using Accelerate + DeepSpeed (level 3, bf16) using the instructions from this tutorial, and I also get a CUDA OOM. The only case in which I do not get an OOM, is when I run the default T5-Large at batch=1.

I'm not sure where I am going wrong. The error message suggests that something is reserving almost all of the GPU memory before the "real" allocations start (as only 723 MiB is free.)

The text was updated successfully, but these errors were encountered:

philschmid · 2024-03-22T06:56:57Z

What versions of the libraries do you use? What sequence length do you use?

slai-natanijel · 2024-03-22T13:27:07Z

python 3.10.9
accelerate 0.28.0
deepspeed 0.14.0
transformers 4.39.0

The output max sequence length is the default value of 128.

Is the blog post using DeepSpeed, and the Huggingface tutorial with Accelerate+DeepSpeed quite similar in principle?

slai-natanijel · 2024-03-22T14:12:53Z

Update: Running google/flan-t5-xl (3B parameters) with Accelerate+DeepSpeed (level 3, bf16) seems to work for batch=1, but uses about 88% of memory per GPU - still seems far too high.

Another update: It seems that DeepSpeed level 2 and level 3 take up the same GPU memory on my setup. Perhaps level 3 is not fully activating?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of Memory: Cannot reproduce T5-XXL run on 8xA10G. #49

Out of Memory: Cannot reproduce T5-XXL run on 8xA10G. #49

slai-natanijel commented Mar 22, 2024

philschmid commented Mar 22, 2024

slai-natanijel commented Mar 22, 2024 •

edited

Loading

slai-natanijel commented Mar 22, 2024 •

edited

Loading

Out of Memory: Cannot reproduce T5-XXL run on 8xA10G. #49

Out of Memory: Cannot reproduce T5-XXL run on 8xA10G. #49

Comments

slai-natanijel commented Mar 22, 2024

philschmid commented Mar 22, 2024

slai-natanijel commented Mar 22, 2024 • edited Loading

slai-natanijel commented Mar 22, 2024 • edited Loading

slai-natanijel commented Mar 22, 2024 •

edited

Loading

slai-natanijel commented Mar 22, 2024 •

edited

Loading