[Issue]: Incompatibility with libfabric 1.22 - RCCL hangs on multinode PyTorch distributed training

### Problem Description

Dear ROCm developers, it seems that this plug in is not compatible with the latest version of lib fabric present on Setonix (Pawsey Supercomputing Centre). I am trying to run a[ simple distributed training job.](https://github.com/PawseySC/machine-learning-examples/tree/main/pytorch/mnist_ddp ) on 2 nodes and the execution hangs right at the start. After some time, I get the following error messages:

```
[rank14]: Traceback (most recent call last):
[rank14]:   File "/software/projects/pawsey0001/cdipietrantonio/machine-learning-examples/pytorch/mnist_ddp/mnist.py", line 100, in <module>
[rank14]:     main()
[rank14]:   File "/software/projects/pawsey0001/cdipietrantonio/machine-learning-examples/pytorch/mnist_ddp/mnist.py", line 97, in main
[rank14]:     train(args.epochs)
[rank14]:   File "/software/projects/pawsey0001/cdipietrantonio/machine-learning-examples/pytorch/mnist_ddp/mnist.py", line 55, in train
[rank14]:     model = DistributedDataParallel(model, device_ids=[local_rank])
[rank14]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/parallel/distributed.py", line 835, in __init__
[rank14]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank14]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/utils.py", line 282, in _verify_param_shape_across_processes
[rank14]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank14]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: RuntimeError: DDP expects same model across all ranks, but Rank 14 has 10 params, while rank 2 has inconsistent 0 params.
[rank12]:[E916 13:33:59.240756795 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 12] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank12]:[E916 13:33:59.240861047 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 12] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank12]:[E916 13:33:59.241590278 FlightRecorder.cpp:162] Error opening file for writing Flight Recorder debug info: /home/cdipietrantonio/.cache/torch/nccl_trace_rank_12
[rank8]:[E916 13:33:59.247535519 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 8] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank8]:[E916 13:33:59.247653928 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 8] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank8]:[E916 13:33:59.248455149 FlightRecorder.cpp:162] Error opening file for writing Flight Recorder debug info: /home/cdipietrantonio/.cache/torch/nccl_trace_rank_8
[rank11]:[E916 13:33:59.252456777 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 11] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank11]:[E916 13:33:59.252550609 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 11] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank10]:[E916 13:33:59.252966254 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 10] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank10]:[E916 13:33:59.253081066 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 10] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank11]:[E916 13:33:59.253232429 FlightRecorder.cpp:162] Error opening file for writing Flight Recorder debug info: /home/cdipietrantonio/.cache/torch/nccl_trace_rank_11
[rank10]:[E916 13:33:59.253833763 FlightRecorder.cpp:162] Error opening file for writing Flight Recorder debug info: /home/cdipietrantonio/.cache/torch/nccl_trace_rank_10
```

I am using PyTorch 2.7.1 and ROCm 6.3.3. Libfabric is 1.22. 

When I use libfabric 1.15, everything works well.

### Operating System

SLES15-SP6

### CPU

AMD Trento

### GPU

AMD MI250X

### ROCm Version

6.4.1

### ROCm Component

_No response_

### Steps to Reproduce

_No response_

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: Incompatibility with libfabric 1.22 - RCCL hangs on multinode PyTorch distributed training #16

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: Incompatibility with libfabric 1.22 - RCCL hangs on multinode PyTorch distributed training #16

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions