Maybe need accelerator.reduce? Loss scale mismatch DiT official code #16

ShenZhang-Shin · 2024-09-14T07:58:52Z

DiT official code loss log

while Fast DiT loss is much smaller than official loss

I think maybe fast dit code miss gathering the loss across all GPUs. After using accelerator.reduce
avg_loss = accelerator.reduce(avg_loss, reduction="sum")
the loss matches the result of official code

The text was updated successfully, but these errors were encountered:

wangyanhui666 · 2024-09-24T12:21:50Z

I used a single node with 4 GPUs for training, and the loss is normal, around 0.15. Is this 'reduce' function used for multi-node multi-GPU situations?

Additionally, the model I trained has a very high FID; in the 256x256 setting, after training for 400k steps, the FID is 70. I'm not sure where the problem is.

ShenZhang-Shin · 2024-09-24T14:32:47Z

I use a single node with 8 GPUs
dit-s-2 with 256x256 resolution？ After 400k steps, my FID is 69.76, 1.3 larger than DiT paper's result
maybe FP16, or VAE pre-extraction? @chuanyangjin

wangyanhui666 · 2024-09-24T16:08:24Z

I use a single node with 8 GPUs dit-s-2 with 256x256 resolution？ After 400k steps, my FID is 69.76, 1.3 larger than DiT paper's result maybe FP16, or VAE pre-extraction? @chuanyangjin

I use dit-xl-2 with 256x256 resolution. After 400k step, my FID is 70..., I do not know why. Maybe cause by bf16 training?

wangyanhui666 · 2024-09-24T16:10:42Z

can i get your wechat to discuss?

I use a single node with 8 GPUs dit-s-2 with 256x256 resolution？ After 400k steps, my FID is 69.76, 1.3 larger than DiT paper's result maybe FP16, or VAE pre-extraction? @chuanyangjin

ShenZhang-Shin · 2024-09-25T03:59:38Z

@wangyanhui666
Well, too high FID for dit-xl-2, you should check your code.
Ok, you can send your wechat to [email protected]. I friend you

dlsrbgg33 · 2024-10-04T15:48:42Z

@ShenZhang-Shin
Did you fix the performance gap issue?

ShenZhang-Shin · 2024-10-17T09:19:17Z

@dlsrbgg33 yes, I did
I use fast-dit tricks and achieve the same performance of dit without cfg.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maybe need accelerator.reduce? Loss scale mismatch DiT official code #16

Maybe need accelerator.reduce? Loss scale mismatch DiT official code #16

ShenZhang-Shin commented Sep 14, 2024

wangyanhui666 commented Sep 24, 2024

ShenZhang-Shin commented Sep 24, 2024

wangyanhui666 commented Sep 24, 2024

wangyanhui666 commented Sep 24, 2024

ShenZhang-Shin commented Sep 25, 2024

dlsrbgg33 commented Oct 4, 2024

ShenZhang-Shin commented Oct 17, 2024

Maybe need accelerator.reduce? Loss scale mismatch DiT official code #16

Maybe need accelerator.reduce? Loss scale mismatch DiT official code #16

Comments

ShenZhang-Shin commented Sep 14, 2024

wangyanhui666 commented Sep 24, 2024

ShenZhang-Shin commented Sep 24, 2024

wangyanhui666 commented Sep 24, 2024

wangyanhui666 commented Sep 24, 2024

ShenZhang-Shin commented Sep 25, 2024

dlsrbgg33 commented Oct 4, 2024

ShenZhang-Shin commented Oct 17, 2024