Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Memory: Cannot reproduce T5-XXL run on 8xA10G. #49

Open
slai-natanijel opened this issue Mar 22, 2024 · 3 comments
Open

Out of Memory: Cannot reproduce T5-XXL run on 8xA10G. #49

slai-natanijel opened this issue Mar 22, 2024 · 3 comments

Comments

@slai-natanijel
Copy link

I am trying to reproduce the FLAN-T5-XXL (11B) results from this blog post.

I have an 8xA10G instance. Since the blog shows that you can run FLAN-T5-XXL (11B) training on a 4xA10G setup, I was surprised to see that I get a CUDA OOM error as soon as the first training epoch starts:

OutOfMemoryError: CUDA out of memory. Tried to allocate 3.73 GiB. GPU 0 has a total capacity of 21.99 GiB of which 723.06 MiB is free. Including non-PyTorch memory, this process has 21.27 GiB memory in use. Of the allocated memory 17.87 GiB is allocated by PyTorch, and 2.96 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

I have even tried to run at batch=1, but that didn't help, and I have double checked that bf16 is enabled.

Additionally, I have attempted to run default T5-11B and T5-3B models using Accelerate + DeepSpeed (level 3, bf16) using the instructions from this tutorial, and I also get a CUDA OOM. The only case in which I do not get an OOM, is when I run the default T5-Large at batch=1.

I'm not sure where I am going wrong. The error message suggests that something is reserving almost all of the GPU memory before the "real" allocations start (as only 723 MiB is free.)

@philschmid
Copy link
Owner

What versions of the libraries do you use? What sequence length do you use?

@slai-natanijel
Copy link
Author

slai-natanijel commented Mar 22, 2024

python 3.10.9
accelerate 0.28.0
deepspeed 0.14.0
transformers 4.39.0

The output max sequence length is the default value of 128.

Is the blog post using DeepSpeed, and the Huggingface tutorial with Accelerate+DeepSpeed quite similar in principle?

@slai-natanijel
Copy link
Author

slai-natanijel commented Mar 22, 2024

Update: Running google/flan-t5-xl (3B parameters) with Accelerate+DeepSpeed (level 3, bf16) seems to work for batch=1, but uses about 88% of memory per GPU - still seems far too high.

Another update: It seems that DeepSpeed level 2 and level 3 take up the same GPU memory on my setup. Perhaps level 3 is not fully activating?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants