-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSDP broken on newer versions of pytorch #450
Comments
I think it's worth mentioning that while this patch did work for me for medium models, deadlocks were still so prevalent when trying to train large models that I had to rewrite the trainer in my fork with PyTorch Lightning to make them go away. So I think there may be some other issues at play here. |
@nateraw Hmm interesting - done a couple trainings on the large models (using dora) and didn't have any issues with deadlocks |
This may have been hardware specific - I had issues on H100s, but not A100s |
I'm training on A40s and no issues so maybe hardware specific, I did see some other people post issues about H100s |
This commit: pytorch/pytorch@a832967
Released in torch 2.1.0 breaks this https://github.com/facebookresearch/audiocraft/blob/main/audiocraft/optim/fsdp.py#L126
This ticket (#358) has an implementation that works on torch 2.1.0 - but there should probably be some backward compatibility for a final implementation.
The text was updated successfully, but these errors were encountered: