Skip to content

PP >=2 Still Hangs in Multinode Despite Previous PR Fix #1681

@chuhan-ouyang

Description

@chuhan-ouyang

Bug description

After reading this PR,, I upgraded my environment to use a PyTorch Version (build from src) that includes this fix: pytorch/pytorch@e63c2b2.

However, when training DeepSeek on TorchTitan with PP = 2, DP Shard = 2, TP = 4, I am still getting a hang. NCCL shows a warning of 'mismatched bytes recv'. The hang occurs if PP >= 2.

The hang occurs after this NCCL warning:

[2025-09-01 21:12:17] nid001097:5838:6020 [0] transport/net_socket.cc:561 NCCL WARN NET/Socket : peer 10.249.23.66<41138> message truncated : receiving 8 bytes instead of 4. If you believe your socket network is in a healthy state, there may be a mismatch in collective sizes or environment settings (e.g. NCCL_PROTO, NCCL_ALGO) between ranks
nid001097:5838:6020 [0] NCCL INFO transport/net.cc:1393 -> 5

I tried enforcing the same NCCL_PROTO NCCL_ALGO across ranks but it did not fix the hang.

Attached are the NCCL and training logs.

PP_hang_NCCL.log
PP_hang_output.log

Versions

Environment: Running on 4 nodes, each with 4 A100 GPUs. Python 3.11.7. Torch: 2.9.0a0+git82d2d23.

DeepSeek Config: https://github.com/chuhan-ouyang/torchtitan/blob/deepseek/torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml

Thank you.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions