All child threads getting stuck when training with multiple GPU using dist_train.sh
With CUDA == 11.3, Pytorch == 1.10
After diagnosis, I found it was stuck at https://github.com/open-mmlab/OpenPCDet/blob/master/pcdet/utils/common_utils.py#L166-L171
I modified the code from
dist.init_process_group(
backend=backend,
init_method='tcp://127.0.0.1:%d' % tcp_port,
rank=local_rank,
world_size=num_gpus
)
to
dist.init_process_group(
backend=backend
)
and it worked.
I'm curious why this is so, and if someone else is having the same problem, you can try to do the same.
All child threads getting stuck when training with multiple GPU using dist_train.sh
With CUDA == 11.3, Pytorch == 1.10
After diagnosis, I found it was stuck at https://github.com/open-mmlab/OpenPCDet/blob/master/pcdet/utils/common_utils.py#L166-L171
I modified the code from
to
and it worked.
I'm curious why this is so, and if someone else is having the same problem, you can try to do the same.