Got stuck when training with multiple GPU using dist_train.sh

All child threads getting stuck when training with multiple GPU using dist_train.sh
With CUDA == 11.3,  Pytorch == 1.10
After diagnosis, I found it was stuck at https://github.com/open-mmlab/OpenPCDet/blob/master/pcdet/utils/common_utils.py#L166-L171

I modified the code from
```
dist.init_process_group(
        backend=backend,
        init_method='tcp://127.0.0.1:%d' % tcp_port,
        rank=local_rank,
        world_size=num_gpus
)
```
to
```
dist.init_process_group(
        backend=backend
)
```
and it worked.

I'm curious why this is so, and if someone else is having the same problem, you can try to do the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Got stuck when training with multiple GPU using dist_train.sh #696

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Got stuck when training with multiple GPU using dist_train.sh #696

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions