-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
分布式训练 #8
Comments
@taqta 我在 python -m torch.distributed.run --nproc_per_node=8 train.py --cuda -dist -d ava_v2.2 不加dist的话是无法启动单机多卡的DDP训练的。 目前我还没尝试过DDP训练,因为没那么多资源,所以不太确定这当中是不是存在着什么bug。 |
好的,谢谢。我再试一下。 |
多卡训练是可以的 但是每个epoch进行val的时候出现错误:
CHILD PROCESS FAILED WITH NO ERROR_FILE |
@zhangyunming 使用UCF101数据集训练的话,确实会出现这个问题,可能是因为同样的路径在不同设备上有不同规定的缘故,目前还没有彻底解决,建议在训练的时候,取消--eval参数,不要在训练过程中去做测试。AVA数据集没有这个bug |
您好,我在进行多卡训练的时候就遇到了同样的报错,请问您是怎么解决的呢 |
|
你好!我尝试运行下列命令:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py --cuda -d ava_v2.2 --root /data/ztq/data/ava/ -v yowo_v2_nano --num_workers 4 --eval_epoch 1 --max_epoch 10 --lr_epoch 3 4 5 6 -lr 0.0001 -ldr 0.5 -bs 8 -accu 16 -K 16 --eval
但是遇到了报错:
train.py: error: unrecognized arguments: --local_rank=0
Killing subprocess 2890
Killing subprocess 2891
Traceback (most recent call last):
File "/home/dsphaoyang/anaconda3/envs/yowo/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/dsphaoyang/anaconda3/envs/yowo/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/dsphaoyang/anaconda3/envs/yowo/lib/python3.6/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/home/dsphaoyang/anaconda3/envs/yowo/lib/python3.6/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/dsphaoyang/anaconda3/envs/yowo/lib/python3.6/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/dsphaoyang/anaconda3/envs/yowo/bin/python', '-u', 'train.py', '--local_rank=1', '--cuda', '-d', 'ava_v2.2', '--root', '/data/ztq/data/ava/', '-v', 'yowo_v2_nano', '--num_workers', '4', '--eval_epoch', '1', '--max_epoch', '10', '--lr_epoch', '3', '4', '5', '6', '-lr', '0.0001', '-ldr', '0.5', '-bs', '8', '-accu', '16', '-K', '16', '--eval']' returned non-zero exit status 2.
这应该是由于传入的参数没有local_rank所导致的,但是如果传入参数没有local_rank,请问怎样才能进行单机多卡的训练呢,因为这样使用不了torch.distributed.launch
The text was updated successfully, but these errors were encountered: