Logging behaviour when steps_per_generation > gradient_accumulation_steps in GRPOTrainer

Hello. I am training a model using GRPO with steps_per_generation  = 2*gradient accumulation steps, and my logs look something like the following:
```
  0%|          | 0/1562 [00:00<?, ?it/s]INFO 09-15 14:29:21 [executor_base.py:204] It took 0.292539 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 09-15 14:29:21 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:31:05 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:31:07 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 49.25 GiB memory is still in use.
INFO 09-15 14:31:07 [executor_base.py:188] It took 1.295267 seconds to fall asleep.
INFO 09-15 14:31:16 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:31:17 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 48.55 GiB memory is still in use.
INFO 09-15 14:31:17 [executor_base.py:188] It took 1.595915 seconds to fall asleep.
INFO 09-15 14:31:28 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:31:29 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 49.11 GiB memory is still in use.
INFO 09-15 14:31:29 [executor_base.py:188] It took 1.461282 seconds to fall asleep.
INFO 09-15 14:31:45 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:31:47 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 49.25 GiB memory is still in use.
INFO 09-15 14:31:47 [executor_base.py:188] It took 1.713201 seconds to fall asleep.
[2025-09-15 14:37:19,498] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | optimizer_allgather: 33.42 | optimizer_gradients: 14.04 | optimizer_step: 58.16
[2025-09-15 14:37:19,501] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd_microstep: 121398.40 | bwd_microstep: 122186.67 | bwd_inner_microstep: 121895.88 | bwd_allreduce_microstep: 286.20 | step_microstep: 135.19
[2025-09-15 14:37:19,503] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd: 121396.43 | bwd: 122186.46 | bwd_inner: 121896.55 | bwd_allreduce: 285.83 | step: 135.20

  0%|          | 1/1562 [08:01<208:41:00, 481.27s/it]
                                                     
{'loss': 0.0389, 'grad_norm': 0.06330768764019012, 'learning_rate': 0.0, 'num_tokens': 6265709.0, 'completions/mean_length': 3095.2470703125, 'completions/min_length': 378.0, 'completions/max_length': 8192.0, 'completions/clipped_ratio': 0.060546875, 'completions/mean_terminated_length': 2766.76611328125, 'completions/min_terminated_length': 378.0, 'completions/max_terminated_length': 8192.0, 'rewards/list_reward/mean': 0.0029296875, 'rewards/list_reward/std': 0.054073650389909744, 'rewards/format_reward/mean': -0.0400390625, 'rewards/format_reward/std': 0.1961466521024704, 'rewards/soft_overlong_punishment/mean': -0.1396808624267578, 'rewards/soft_overlong_punishment/std': 0.33162686228752136, 'reward': -0.1767902374267578, 'reward_std': 0.47051888704299927, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.04104730114340782, 'sampling/sampling_logp_difference/max': 3.437645196914673, 'sampling/importance_sampling_ratio/min': 0.032140281051397324, 'sampling/importance_sampling_ratio/mean': 0.9999222755432129, 'sampling/importance_sampling_ratio/max': 2.0, 'kl': 0.0, 'entropy': 0.8257587123662233, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.0}

  0%|          | 1/1562 [08:01<208:41:00, 481.27s/it][2025-09-15 14:37:19,728] [INFO] [profiler.py:82:start_profile] Flops profiler started
[2025-09-15 14:40:07,671] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | optimizer_allgather: 32.00 | optimizer_gradients: 14.01 | optimizer_step: 45.43
[2025-09-15 14:40:08,113] [INFO] [profiler.py:230:end_profile] Flops profiler finished
[2025-09-15 14:40:08,115] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd_microstep: 41583.50 | bwd_microstep: 123261.98 | bwd_inner_microstep: 122977.34 | bwd_allreduce_microstep: 279.95 | step_microstep: 104.55
[2025-09-15 14:40:08,117] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd: 0.00 | bwd: 0.00 | bwd_inner: 122977.77 | bwd_allreduce: 278.81 | step: 0.00

  0%|          | 2/1562 [10:49<128:51:13, 297.36s/it]
                                                     
{'loss': -0.0409, 'grad_norm': 0.05034307762980461, 'learning_rate': 1.2658227848101265e-08, 'kl': 0.0, 'entropy': 0.825460085645318, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.0}

  0%|          | 2/1562 [10:49<128:51:13, 297.36s/it]INFO 09-15 14:40:10 [executor_base.py:204] It took 0.988489 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 09-15 14:40:10 [executor_base.py:204] It took 1.057249 seconds to wake up tags {'kv_cache', 'weights'}.
INFO 09-15 14:40:10 [executor_base.py:204] It took 1.080800 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 09-15 14:40:10 [executor_base.py:204] It took 1.007129 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 09-15 14:40:10 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:40:10 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:40:10 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:40:10 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:42:08 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:42:09 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 67.18 GiB memory is still in use.
INFO 09-15 14:42:09 [executor_base.py:188] It took 1.463854 seconds to fall asleep.
INFO 09-15 14:42:20 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:42:21 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 67.74 GiB memory is still in use.
INFO 09-15 14:42:21 [executor_base.py:188] It took 1.460260 seconds to fall asleep.
INFO 09-15 14:42:35 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:42:36 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 67.88 GiB memory is still in use.
INFO 09-15 14:42:36 [executor_base.py:188] It took 1.503160 seconds to fall asleep.
INFO 09-15 14:42:47 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:42:48 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 77.17 GiB memory is still in use.
INFO 09-15 14:42:48 [executor_base.py:188] It took 1.600564 seconds to fall asleep.
[2025-09-15 14:48:06,446] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | optimizer_allgather: 32.02 | optimizer_gradients: 14.23 | optimizer_step: 45.45
[2025-09-15 14:48:06,449] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd_microstep: 114578.15 | bwd_microstep: 117430.47 | bwd_inner_microstep: 117137.45 | bwd_allreduce_microstep: 286.56 | step_microstep: 105.41
[2025-09-15 14:48:06,452] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd: 114577.10 | bwd: 117430.12 | bwd_inner: 117138.34 | bwd_allreduce: 286.16 | step: 105.42

  0%|          | 3/1562 [18:48<164:33:29, 379.99s/it]
                                                     
{'loss': -0.0684, 'grad_norm': 0.043870024383068085, 'learning_rate': 2.531645569620253e-08, 'num_tokens': 12020746.0, 'completions/mean_length': 3445.5126953125, 'completions/min_length': 324.0, 'completions/max_length': 8192.0, 'completions/clipped_ratio': 0.1181640625, 'completions/mean_terminated_length': 2809.493896484375, 'completions/min_terminated_length': 324.0, 'completions/max_terminated_length': 8192.0, 'rewards/list_reward/mean': 0.0078125, 'rewards/list_reward/std': 0.08808542788028717, 'rewards/format_reward/mean': -0.091796875, 'rewards/format_reward/std': 0.2888800799846649, 'rewards/soft_overlong_punishment/mean': -0.1961212158203125, 'rewards/soft_overlong_punishment/std': 0.3845544755458832, 'reward': -0.2801055908203125, 'reward_std': 0.6255616545677185, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.040210433304309845, 'sampling/sampling_logp_difference/max': 4.728130340576172, 'sampling/importance_sampling_ratio/min': 0.00884298887103796, 'sampling/importance_sampling_ratio/mean': 0.9999879598617554, 'sampling/importance_sampling_ratio/max': 2.0, 'kl': 0.000720990683475975, 'entropy': 0.8006362821906805, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.0}

  0%|          | 3/1562 [18:48<164:33:29, 379.99s/it][2025-09-15 14:50:48,366] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | optimizer_allgather: 32.14 | optimizer_gradients: 14.02 | optimizer_step: 45.35
[2025-09-15 14:50:48,368] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd_microstep: 37983.79 | bwd_microstep: 119344.26 | bwd_inner_microstep: 119051.60 | bwd_allreduce_microstep: 286.91 | step_microstep: 105.03
[2025-09-15 14:50:48,370] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd: 37983.53 | bwd: 119343.62 | bwd_inner: 119052.35 | bwd_allreduce: 286.32 | step: 105.34

  0%|          | 4/1562 [21:30<127:11:36, 293.90s/it]
                                                     
{'loss': -0.0506, 'grad_norm': 0.04704352095723152, 'learning_rate': 3.79746835443038e-08, 'kl': 0.0007260022939590272, 'entropy': 0.8083300068974495, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.0}
```

The completion and reward stats are printed every two steps (ie every vllm generation step) as opposed to every optimizer step. Is this behaviour intended?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Logging behaviour when steps_per_generation > gradient_accumulation_steps in GRPOTrainer #4086

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Logging behaviour when steps_per_generation > gradient_accumulation_steps in GRPOTrainer #4086

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions