-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
Problem Description
Dear ROCm developers, it seems that this plug in is not compatible with the latest version of lib fabric present on Setonix (Pawsey Supercomputing Centre). I am trying to run a simple distributed training job. on 2 nodes and the execution hangs right at the start. After some time, I get the following error messages:
[rank14]: Traceback (most recent call last):
[rank14]: File "/software/projects/pawsey0001/cdipietrantonio/machine-learning-examples/pytorch/mnist_ddp/mnist.py", line 100, in <module>
[rank14]: main()
[rank14]: File "/software/projects/pawsey0001/cdipietrantonio/machine-learning-examples/pytorch/mnist_ddp/mnist.py", line 97, in main
[rank14]: train(args.epochs)
[rank14]: File "/software/projects/pawsey0001/cdipietrantonio/machine-learning-examples/pytorch/mnist_ddp/mnist.py", line 55, in train
[rank14]: model = DistributedDataParallel(model, device_ids=[local_rank])
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/parallel/distributed.py", line 835, in __init__
[rank14]: _verify_param_shape_across_processes(self.process_group, parameters)
[rank14]: File "/usr/local/lib/python3.12/dist-packages/torch/distributed/utils.py", line 282, in _verify_param_shape_across_processes
[rank14]: return dist._verify_params_across_processes(process_group, tensors, logger)
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: RuntimeError: DDP expects same model across all ranks, but Rank 14 has 10 params, while rank 2 has inconsistent 0 params.
[rank12]:[E916 13:33:59.240756795 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 12] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank12]:[E916 13:33:59.240861047 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 12] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank12]:[E916 13:33:59.241590278 FlightRecorder.cpp:162] Error opening file for writing Flight Recorder debug info: /home/cdipietrantonio/.cache/torch/nccl_trace_rank_12
[rank8]:[E916 13:33:59.247535519 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 8] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank8]:[E916 13:33:59.247653928 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 8] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank8]:[E916 13:33:59.248455149 FlightRecorder.cpp:162] Error opening file for writing Flight Recorder debug info: /home/cdipietrantonio/.cache/torch/nccl_trace_rank_8
[rank11]:[E916 13:33:59.252456777 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 11] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank11]:[E916 13:33:59.252550609 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 11] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank10]:[E916 13:33:59.252966254 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 10] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank10]:[E916 13:33:59.253081066 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 10] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank11]:[E916 13:33:59.253232429 FlightRecorder.cpp:162] Error opening file for writing Flight Recorder debug info: /home/cdipietrantonio/.cache/torch/nccl_trace_rank_11
[rank10]:[E916 13:33:59.253833763 FlightRecorder.cpp:162] Error opening file for writing Flight Recorder debug info: /home/cdipietrantonio/.cache/torch/nccl_trace_rank_10
I am using PyTorch 2.7.1 and ROCm 6.3.3. Libfabric is 1.22.
When I use libfabric 1.15, everything works well.
Operating System
SLES15-SP6
CPU
AMD Trento
GPU
AMD MI250X
ROCm Version
6.4.1
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Metadata
Metadata
Assignees
Labels
No labels