Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

分布式训练 #8

Open
taqta opened this issue May 25, 2023 · 6 comments
Open

分布式训练 #8

taqta opened this issue May 25, 2023 · 6 comments

Comments

@taqta
Copy link

taqta commented May 25, 2023

你好!我尝试运行下列命令:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py --cuda -d ava_v2.2 --root /data/ztq/data/ava/ -v yowo_v2_nano --num_workers 4 --eval_epoch 1 --max_epoch 10 --lr_epoch 3 4 5 6 -lr 0.0001 -ldr 0.5 -bs 8 -accu 16 -K 16 --eval

但是遇到了报错:

train.py: error: unrecognized arguments: --local_rank=0
Killing subprocess 2890
Killing subprocess 2891
Traceback (most recent call last):
File "/home/dsphaoyang/anaconda3/envs/yowo/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/dsphaoyang/anaconda3/envs/yowo/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/dsphaoyang/anaconda3/envs/yowo/lib/python3.6/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/home/dsphaoyang/anaconda3/envs/yowo/lib/python3.6/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/dsphaoyang/anaconda3/envs/yowo/lib/python3.6/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/dsphaoyang/anaconda3/envs/yowo/bin/python', '-u', 'train.py', '--local_rank=1', '--cuda', '-d', 'ava_v2.2', '--root', '/data/ztq/data/ava/', '-v', 'yowo_v2_nano', '--num_workers', '4', '--eval_epoch', '1', '--max_epoch', '10', '--lr_epoch', '3', '4', '5', '6', '-lr', '0.0001', '-ldr', '0.5', '-bs', '8', '-accu', '16', '-K', '16', '--eval']' returned non-zero exit status 2.

这应该是由于传入的参数没有local_rank所导致的,但是如果传入参数没有local_rank,请问怎样才能进行单机多卡的训练呢,因为这样使用不了torch.distributed.launch

@yjh0410
Copy link
Owner

yjh0410 commented May 25, 2023

@taqta 我在train.py文件中更新了DDP的代码,为了能够顺利启动DDP训练,你应该在你的训练命令里添加-dist参数,例如:

python -m torch.distributed.run --nproc_per_node=8 train.py --cuda -dist -d ava_v2.2

不加dist的话是无法启动单机多卡的DDP训练的。

目前我还没尝试过DDP训练,因为没那么多资源,所以不太确定这当中是不是存在着什么bug。

@taqta
Copy link
Author

taqta commented May 25, 2023

好的,谢谢。我再试一下。

@zhangyunming
Copy link

多卡训练是可以的 但是每个epoch进行val的时候出现错误:
calculating Frame mAP ...
/home/zhangyunming/YOWOv2/
Traceback (most recent call last):
File "train.py", line 330, in
train()
File "train.py", line 274, in train
eval_one_epoch(args, model_without_ddp, evaluator, epoch, path_to_save)
File "train.py", line 290, in eval_one_epoch
evaluator.evaluate_frame_map(model_eval, epoch + 1)
File "/home/zhangyunming/YOWOv2/evaluator/ucf_jhmdb_evaluator.py", line 141, in evaluate_frame_map
metric_list = evaluate_frameAP(self.gt_folder, current_dir, self.iou_thresh,
File "/home/zhangyunming/YOWOv2/evaluator/cal_frame_mAP.py", line 946, in evaluate_frameAP
allBoundingBoxes, allClasses = getBoundingBoxes(
File "/home/zhangyunming/YOWOv2/evaluator/cal_frame_mAP.py", line 860, in getBoundingBoxes
x = float(splitLine[1])
IndexError: list index out of range
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2569361) of binary: /usr/bin/python3
/home/zhangyunming/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning:


           CHILD PROCESS FAILED WITH NO ERROR_FILE                

CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 2569361 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1

@yjh0410
Copy link
Owner

yjh0410 commented Jun 9, 2023

@zhangyunming 使用UCF101数据集训练的话,确实会出现这个问题,可能是因为同样的路径在不同设备上有不同规定的缘故,目前还没有彻底解决,建议在训练的时候,取消--eval参数,不要在训练过程中去做测试。AVA数据集没有这个bug

@liuxiao0309
Copy link

多卡训练是可以的 但是每个epoch进行val的时候出现错误: calculating Frame mAP ... /home/zhangyunming/YOWOv2/ Traceback (most recent call last): File "train.py", line 330, in train() File "train.py", line 274, in train eval_one_epoch(args, model_without_ddp, evaluator, epoch, path_to_save) File "train.py", line 290, in eval_one_epoch evaluator.evaluate_frame_map(model_eval, epoch + 1) File "/home/zhangyunming/YOWOv2/evaluator/ucf_jhmdb_evaluator.py", line 141, in evaluate_frame_map metric_list = evaluate_frameAP(self.gt_folder, current_dir, self.iou_thresh, File "/home/zhangyunming/YOWOv2/evaluator/cal_frame_mAP.py", line 946, in evaluate_frameAP allBoundingBoxes, allClasses = getBoundingBoxes( File "/home/zhangyunming/YOWOv2/evaluator/cal_frame_mAP.py", line 860, in getBoundingBoxes x = float(splitLine[1]) IndexError: list index out of range ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2569361) of binary: /usr/bin/python3 /home/zhangyunming/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning:

           CHILD PROCESS FAILED WITH NO ERROR_FILE                

CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 2569361 (local_rank 0) FAILED (exitcode 1) Error msg: Process failed with exitcode 1

您好,我在进行多卡训练的时候就遇到了同样的报错,请问您是怎么解决的呢

@T-wow
Copy link

T-wow commented Jul 1, 2024

ca
您好,请问为什么进行DDP训练损失异常的大,但是进行单卡训练没有这种情况。谢谢解答

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants