Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

调大batchsize就会报错 #20

Open
LewisLeiyongsheng opened this issue Jul 10, 2023 · 4 comments
Open

调大batchsize就会报错 #20

LewisLeiyongsheng opened this issue Jul 10, 2023 · 4 comments

Comments

@LewisLeiyongsheng
Copy link

image
如图,当Batchsize为8的时候能够正常训练,但是当调大为16及以上后就会报错。使用的是特斯拉V100显存32G,理论上调到80都是够用的。

@LewisLeiyongsheng
Copy link
Author

调试后定位到错误,当调大Batchsize后,会出现数据为Nan的情况,定位到ShuffleNetv2的3D卷积的位置,在进行卷积运算之后就会报错
image

@LewisLeiyongsheng
Copy link
Author

发现现在batchsize为8的时候也会报错了,再定位发现有些输入数据非常多0,是我的数据加载有问题吗?
a26d0bb1e7bdebe762783477ce0c0e0

@T-wow
Copy link

T-wow commented Jul 1, 2024

你好,请问,请问为什么进行DDP训练损失异常的大,但是进行单卡训练没有这种情况。谢谢解答

@LewisLeiyongsheng
Copy link
Author

我也不太清楚,那就尽量单卡训练呗,其实训练时间也不算长

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants