PP >=2 Still Hangs in Multinode Despite Previous PR Fix

### Bug description

After reading this [PR,](https://github.com/pytorch/torchtitan/issues/1492), I upgraded my environment to use a PyTorch Version (build from src) that includes this fix: https://github.com/pytorch/pytorch/commit/e63c2b21c186a7d2ab8a8953b8aa1535f2e96e58. 

However, when training DeepSeek on TorchTitan with PP = 2, DP Shard = 2, TP = 4, I am still getting a hang. NCCL shows a warning of 'mismatched bytes recv'. The hang occurs if PP >= 2. 

The hang occurs after this NCCL warning:
```
[2025-09-01 21:12:17] nid001097:5838:6020 [0] transport/net_socket.cc:561 NCCL WARN NET/Socket : peer 10.249.23.66<41138> message truncated : receiving 8 bytes instead of 4. If you believe your socket network is in a healthy state, there may be a mismatch in collective sizes or environment settings (e.g. NCCL_PROTO, NCCL_ALGO) between ranks
nid001097:5838:6020 [0] NCCL INFO transport/net.cc:1393 -> 5
```
I tried enforcing the same NCCL_PROTO  NCCL_ALGO across ranks but it did not fix the hang.

Attached are the NCCL and training logs.

[PP_hang_NCCL.log](https://github.com/user-attachments/files/22146930/PP_hang_NCCL.log)
[PP_hang_output.log](https://github.com/user-attachments/files/22146931/PP_hang_output.log)

### Versions

Environment: Running on 4 nodes, each with 4 A100 GPUs. Python 3.11.7. Torch: 2.9.0a0+git82d2d23. 

DeepSeek Config: https://github.com/chuhan-ouyang/torchtitan/blob/deepseek/torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PP >=2 Still Hangs in Multinode Despite Previous PR Fix #1681

Bug description

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PP >=2 Still Hangs in Multinode Despite Previous PR Fix #1681

Description

Bug description

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions