-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maybe need accelerator.reduce? Loss scale mismatch DiT official code #16
Comments
I used a single node with 4 GPUs for training, and the loss is normal, around 0.15. Is this 'reduce' function used for multi-node multi-GPU situations? Additionally, the model I trained has a very high FID; in the 256x256 setting, after training for 400k steps, the FID is 70. I'm not sure where the problem is. |
I use a single node with 8 GPUs |
I use dit-xl-2 with 256x256 resolution. After 400k step, my FID is 70..., I do not know why. Maybe cause by bf16 training? |
can i get your wechat to discuss?
|
@wangyanhui666 |
@ShenZhang-Shin |
@dlsrbgg33 yes, I did |
DiT official code loss log
while Fast DiT loss is much smaller than official loss
I think maybe fast dit code miss gathering the loss across all GPUs. After using accelerator.reduce
avg_loss = accelerator.reduce(avg_loss, reduction="sum")
the loss matches the result of official code
The text was updated successfully, but these errors were encountered: