Skip to content

[BUG]The time consumption of these two functions, backward_step and get_grad_norm, has increased #1691

@zwc163

Description

@zwc163

Describe the bug
When I trained a deepseek 16B model using GRPO+sglang+megatron in verl, I found the time taken for each step is steadily increasing, specifically the time for update_policy is continuously growing. Is there any bug?

Stack trace/logs
I found that the time consumption of these two functions in megatron-core, (backward_step and get_grad_norm) has increased.
stack trace 1:
forward_backward_pipelining_without_interleaving (core/pipeline_parallel/schedules.py:1959
backward_step (core/pipeline_parallel/schedules.py:400)
backward (torch/autograd/init.py:347)
_engine_run_backward (torch/autograd/graph.py:823)
stack trace 2:
get_grad_norm (core/optimizer/optimizer.py:192)
get_grad_norm_fp32 (core/optimizer/clip_grads.py:137)

Environment (please complete the following information):

  • Megatron-LM : megatron-core==0.12.0
  • PyTorch 2.6.0
  • CUDA version:12.4
  • NCCL version:2.21.5

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions