Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问训练时长一般是多少呢? #20

Open
Sherlock-hh opened this issue Sep 30, 2020 · 6 comments
Open

请问训练时长一般是多少呢? #20

Sherlock-hh opened this issue Sep 30, 2020 · 6 comments

Comments

@Sherlock-hh
Copy link

训练自己的数据集,一共321张图片,epoch=500,batch_size==8(10就会显示out of memory),
2020-09-30 09:26:59.030397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9484 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:07:00.0, compute capability: 6.1) 2020-09-30 09:26:59.033584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 9484 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:08:00.0, compute capability: 6.1) 2020-09-30 09:26:59.036737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 9484 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:89:00.0, compute capability: 6.1) 2020-09-30 09:26:59.039634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 9484 MB memory) -> physical GPU (device: 3, name: TITAN Xp, pci bus id: 0000:8a:00.0, compute capability: 6.1)
这个报应该时4个gpu都用上了吧,为啥我都得10个小时左右才能训练完。
而且训练过程中的loss周期性的起伏
image
是什么原因呢?期待您的回答,谢谢!

@bubbliiiing
Copy link
Owner

你这500epoch……一个小时50Epoch,一个Epoch 1分钟都不到……很久吗

@Sherlock-hh
Copy link
Author

主要是我看别人训练一个gpu,也是500epoch,他5个小时就训练完了,给我整的很慌张,而且一跑这个都不能随便开别的软件,一开就out of memory,loss还起起伏伏的。。有没有啥办法能一边跑一边保存啊。。让我停在loss比较小的时候?

@bubbliiiing
Copy link
Owner

1、多gpu不一定比少gpu块
2、不本来就会保存么

@Sherlock-hh
Copy link
Author

谢谢大佬回复,我跑完了,模型也保存下来了(之前时因为工作站的电脑不归我一个人使,其他人跑一下我的代码就会显示gpu不够就停了,然后我就老得重新跑)。但是测试的时候一个boundingbox都没有输出orz,求问大佬这一般是啥情况?(我修改好了路径)

@bubbliiiing
Copy link
Owner

@Sherlock-hh
Copy link
Author

好嘞,谢谢,我去检查一下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants