You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running a simple model including torch.nn.LayerNorm using deepspeed zero3 with torch.compile and compiled_autograd. An error occurs:
site-packages/torch/_subclasses/fake_tensor.py:2017] RuntimeError: Attempting to broadcast a dimension of length 0 at -1! Mismatching argument at index 1 had torch.Size([0]); but expected shape should be broadcastable to [100, 120]
We first found this error in BERT model with deepspeed Zero3 with torch.compile and compiled_autograd.
It's ok for deepspeed Zero1/2 with torch.compile and compiled_autograd
It's ok for deepspeed Zero3 with torch.compile and without compiled_autograd
There are a lot of graph beaks and recompiles in deepspeed Zero3 with torch.compile.
To simplify the issue, I made a small reproducer to extract error op(torch.nn.LayerNorm)
Expected behavior
Running the model with deepspeed Zero3 without error.
Investigation
The error: "RuntimeError: Attempting to broadcast a dimension of length 0 at -1! Mismatching argument at index 1 had torch.Size([0]); but expected shape should be broadcastable to [128, 128, 1600]"
It occurs when compiled autograd tries to trace the backward graph.
It appears in LayerNorm backward decompositions. It tries to broadcast weight_cast(torch.Size([0]) to grad_out_cast' shape([128,128,1600]) and fails.
If bypassing the LayerNorm weight by setting nn.LayerNorm(120, eps=1e-12, elementwise_affine=False) instead of elementwise_affine=True in the file deepspeed_reproducer_cpu.py, the running is ok.
System info:
OS: Ubuntu 22.04
No GPU (it's device-independent, so we use CPU to reproduce)
Python version 3.10.12
PyTorch version 2.5.1
DeepSpeed version 0.15.3
To Reproduce
Steps to reproduce the behavior:
Set environment variable for more verbose logs: TORCH_LOGS="+dynamo,graph,graph_code,graph_breaks,recompiles,aot_graphs,aot_joint_graph,compiled_autograd_verbose"
Run with deepspeed --num_nodes 1 --num_gpus 1 deepspeed_reproducer_cpu.py
Hi @tohtana, I have tried setting stage3_param_persistence_threshold to zero, but it seems it doesn't help. The error still occurs.
I also opened an issue in pytorch.
Describe the bug
When running a simple model including torch.nn.LayerNorm using deepspeed zero3 with torch.compile and compiled_autograd. An error occurs:
We first found this error in BERT model with deepspeed Zero3 with torch.compile and compiled_autograd.
Expected behavior
Running the model with deepspeed Zero3 without error.
Investigation
The error: "RuntimeError: Attempting to broadcast a dimension of length 0 at -1! Mismatching argument at index 1 had torch.Size([0]); but expected shape should be broadcastable to [128, 128, 1600]"
It occurs when compiled autograd tries to trace the backward graph.
It appears in LayerNorm backward decompositions. It tries to broadcast weight_cast(torch.Size([0]) to grad_out_cast' shape([128,128,1600]) and fails.
If bypassing the LayerNorm weight by setting
nn.LayerNorm(120, eps=1e-12, elementwise_affine=False)
instead ofelementwise_affine=True
in the file deepspeed_reproducer_cpu.py, the running is ok.System info:
To Reproduce
Steps to reproduce the behavior:
TORCH_LOGS="+dynamo,graph,graph_code,graph_breaks,recompiles,aot_graphs,aot_joint_graph,compiled_autograd_verbose"
deepspeed --num_nodes 1 --num_gpus 1 deepspeed_reproducer_cpu.py
The text was updated successfully, but these errors were encountered: