Describe the bug
When I trained a deepseek 16B model using GRPO+sglang+megatron in verl, I found the time taken for each step is steadily increasing, specifically the time for update_policy is continuously growing. Is there any bug?
Stack trace/logs
I found that the time consumption of these two functions in megatron-core, (backward_step and get_grad_norm) has increased.
stack trace 1:
forward_backward_pipelining_without_interleaving (core/pipeline_parallel/schedules.py:1959
backward_step (core/pipeline_parallel/schedules.py:400)
backward (torch/autograd/init.py:347)
_engine_run_backward (torch/autograd/graph.py:823)
stack trace 2:
get_grad_norm (core/optimizer/optimizer.py:192)
get_grad_norm_fp32 (core/optimizer/clip_grads.py:137)
Environment (please complete the following information):
- Megatron-LM : megatron-core==0.12.0
- PyTorch 2.6.0
- CUDA version:12.4
- NCCL version:2.21.5
Additional context
Add any other context about the problem here.