Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instruction tuning of LLama2 is significantly slower compared to documented 3 hours fine-tuning time on A10G. #35

Open
mlscientist2 opened this issue Sep 20, 2023 · 1 comment

Comments

@mlscientist2
Copy link

mlscientist2 commented Sep 20, 2023

Hi,

First of all, thanks for setting up the nicely formatted code for fine-tuning LLaMa2 in 4-bits.
I was able to follow all the steps and was able to setup training of the model (as shown in your tutorial/ ipython notebook): https://www.philschmid.de/instruction-tune-llama-2

Your tutorial mentions that the training time on a g5.2x large without flash attention is around 3hours. However, running your code shows training time as 40hours! Can you help narrow down the difference/ issue?

I am attaching some screen-shots. On a high-level I suspect there is a bottleneck in data-loader (since the code is only using 1 cpu core), I did try adding the num_workers flag in TrainingArguments but that did not help. GPU utilization seems decent.

image
image
image

Any thoughts @philschmid ?

@mlscientist2 mlscientist2 changed the title Instruction tuning of LLama2 is significantly slower compared to claimed 3 hours fine-tuning time on A10G. Instruction tuning of LLama2 is significantly slower compared to documented 3 hours fine-tuning time on A10G. Sep 21, 2023
@hassantsyed
Copy link

hassantsyed commented Nov 30, 2023

any ideas here?
showing 11 hrs for 1024 context and 22 hrs for 2048. Would love to get this down to 3!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants