-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training low perf #58
Comments
I didn't time each component specifically. For pre-training/main training, each iteration took around 0.28/0.66 seconds. |
Thanks it is important to have a reference. In my tests we are barely under 20% of GPUs occupancy... I am investigating. |
I've really tested the original and compiled code with last stable pytorch, pytorch nightly, with A100, H100, with different number of workers, different number of GPU, with a larger batch size, with local ssd, using larger image like Davis fullres, with larger crops to fill the memory. In any of these configurations I've achieved a decent GPU load with the |
You have (all good then)? Or you haven't...? |
No, in the best combo the load it is always around 20%. |
I see. I think there is a typo in your previous comment. |
it is quite high.. of course it depends by the num of workers. E.g. The H100 instance have 207 cores with 98 workers and batch size 32 we have an avg CPU load 50/55%. |
I just tried with the latest code and PyTorch (small model). This is on a different machine and I had to increase the number of workers in the pre-training stage to 32. I couldn't get it to 90+ utilization on average, but it is a lot better than 20%. With this utilization the avg_time is similar -- 0.283/0.801 for pre-training/main training after warm-up. The pre-training stage is more CPU-intensive and has a lower GPU utilization. For reference, below are the screenshots during pre-training and main training respectively. It is likely that with better GPUs like H100, the CPUs would need to work extra hard to keep the GPUs fed but in any case, they should not be slower than the 0.283/0.801 avg_time. |
Are you getting "good" avg_time? |
Currently I am testing only the With |
Can I ask you a bit of details about your perf number on base model training? How much time require a forward and backward pass? How much time the dataloader?
I find it very hard to have a minimum decent GPU load also using a local SSD for the data. I've also tested with a similar setup as your paper with A100 GPUs.
Have you tested it with Pytorch 2.x?
The text was updated successfully, but these errors were encountered: