generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Open
Labels
🏋 GRPORelated to GRPORelated to GRPO
Description
Hello. I am training a model using GRPO with steps_per_generation = 2*gradient accumulation steps, and my logs look something like the following:
0%| | 0/1562 [00:00<?, ?it/s]INFO 09-15 14:29:21 [executor_base.py:204] It took 0.292539 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 09-15 14:29:21 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:31:05 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:31:07 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 49.25 GiB memory is still in use.
INFO 09-15 14:31:07 [executor_base.py:188] It took 1.295267 seconds to fall asleep.
INFO 09-15 14:31:16 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:31:17 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 48.55 GiB memory is still in use.
INFO 09-15 14:31:17 [executor_base.py:188] It took 1.595915 seconds to fall asleep.
INFO 09-15 14:31:28 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:31:29 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 49.11 GiB memory is still in use.
INFO 09-15 14:31:29 [executor_base.py:188] It took 1.461282 seconds to fall asleep.
INFO 09-15 14:31:45 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:31:47 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 49.25 GiB memory is still in use.
INFO 09-15 14:31:47 [executor_base.py:188] It took 1.713201 seconds to fall asleep.
[2025-09-15 14:37:19,498] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | optimizer_allgather: 33.42 | optimizer_gradients: 14.04 | optimizer_step: 58.16
[2025-09-15 14:37:19,501] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd_microstep: 121398.40 | bwd_microstep: 122186.67 | bwd_inner_microstep: 121895.88 | bwd_allreduce_microstep: 286.20 | step_microstep: 135.19
[2025-09-15 14:37:19,503] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd: 121396.43 | bwd: 122186.46 | bwd_inner: 121896.55 | bwd_allreduce: 285.83 | step: 135.20
0%| | 1/1562 [08:01<208:41:00, 481.27s/it]
{'loss': 0.0389, 'grad_norm': 0.06330768764019012, 'learning_rate': 0.0, 'num_tokens': 6265709.0, 'completions/mean_length': 3095.2470703125, 'completions/min_length': 378.0, 'completions/max_length': 8192.0, 'completions/clipped_ratio': 0.060546875, 'completions/mean_terminated_length': 2766.76611328125, 'completions/min_terminated_length': 378.0, 'completions/max_terminated_length': 8192.0, 'rewards/list_reward/mean': 0.0029296875, 'rewards/list_reward/std': 0.054073650389909744, 'rewards/format_reward/mean': -0.0400390625, 'rewards/format_reward/std': 0.1961466521024704, 'rewards/soft_overlong_punishment/mean': -0.1396808624267578, 'rewards/soft_overlong_punishment/std': 0.33162686228752136, 'reward': -0.1767902374267578, 'reward_std': 0.47051888704299927, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.04104730114340782, 'sampling/sampling_logp_difference/max': 3.437645196914673, 'sampling/importance_sampling_ratio/min': 0.032140281051397324, 'sampling/importance_sampling_ratio/mean': 0.9999222755432129, 'sampling/importance_sampling_ratio/max': 2.0, 'kl': 0.0, 'entropy': 0.8257587123662233, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.0}
0%| | 1/1562 [08:01<208:41:00, 481.27s/it][2025-09-15 14:37:19,728] [INFO] [profiler.py:82:start_profile] Flops profiler started
[2025-09-15 14:40:07,671] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | optimizer_allgather: 32.00 | optimizer_gradients: 14.01 | optimizer_step: 45.43
[2025-09-15 14:40:08,113] [INFO] [profiler.py:230:end_profile] Flops profiler finished
[2025-09-15 14:40:08,115] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd_microstep: 41583.50 | bwd_microstep: 123261.98 | bwd_inner_microstep: 122977.34 | bwd_allreduce_microstep: 279.95 | step_microstep: 104.55
[2025-09-15 14:40:08,117] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd: 0.00 | bwd: 0.00 | bwd_inner: 122977.77 | bwd_allreduce: 278.81 | step: 0.00
0%| | 2/1562 [10:49<128:51:13, 297.36s/it]
{'loss': -0.0409, 'grad_norm': 0.05034307762980461, 'learning_rate': 1.2658227848101265e-08, 'kl': 0.0, 'entropy': 0.825460085645318, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.0}
0%| | 2/1562 [10:49<128:51:13, 297.36s/it]INFO 09-15 14:40:10 [executor_base.py:204] It took 0.988489 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 09-15 14:40:10 [executor_base.py:204] It took 1.057249 seconds to wake up tags {'kv_cache', 'weights'}.
INFO 09-15 14:40:10 [executor_base.py:204] It took 1.080800 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 09-15 14:40:10 [executor_base.py:204] It took 1.007129 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 09-15 14:40:10 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:40:10 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:40:10 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:40:10 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:42:08 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:42:09 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 67.18 GiB memory is still in use.
INFO 09-15 14:42:09 [executor_base.py:188] It took 1.463854 seconds to fall asleep.
INFO 09-15 14:42:20 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:42:21 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 67.74 GiB memory is still in use.
INFO 09-15 14:42:21 [executor_base.py:188] It took 1.460260 seconds to fall asleep.
INFO 09-15 14:42:35 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:42:36 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 67.88 GiB memory is still in use.
INFO 09-15 14:42:36 [executor_base.py:188] It took 1.503160 seconds to fall asleep.
INFO 09-15 14:42:47 [block_pool.py:280] Successfully reset prefix cache
INFO 09-15 14:42:48 [gpu_worker.py:115] Sleep mode freed 51.57 GiB memory, 77.17 GiB memory is still in use.
INFO 09-15 14:42:48 [executor_base.py:188] It took 1.600564 seconds to fall asleep.
[2025-09-15 14:48:06,446] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | optimizer_allgather: 32.02 | optimizer_gradients: 14.23 | optimizer_step: 45.45
[2025-09-15 14:48:06,449] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd_microstep: 114578.15 | bwd_microstep: 117430.47 | bwd_inner_microstep: 117137.45 | bwd_allreduce_microstep: 286.56 | step_microstep: 105.41
[2025-09-15 14:48:06,452] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd: 114577.10 | bwd: 117430.12 | bwd_inner: 117138.34 | bwd_allreduce: 286.16 | step: 105.42
0%| | 3/1562 [18:48<164:33:29, 379.99s/it]
{'loss': -0.0684, 'grad_norm': 0.043870024383068085, 'learning_rate': 2.531645569620253e-08, 'num_tokens': 12020746.0, 'completions/mean_length': 3445.5126953125, 'completions/min_length': 324.0, 'completions/max_length': 8192.0, 'completions/clipped_ratio': 0.1181640625, 'completions/mean_terminated_length': 2809.493896484375, 'completions/min_terminated_length': 324.0, 'completions/max_terminated_length': 8192.0, 'rewards/list_reward/mean': 0.0078125, 'rewards/list_reward/std': 0.08808542788028717, 'rewards/format_reward/mean': -0.091796875, 'rewards/format_reward/std': 0.2888800799846649, 'rewards/soft_overlong_punishment/mean': -0.1961212158203125, 'rewards/soft_overlong_punishment/std': 0.3845544755458832, 'reward': -0.2801055908203125, 'reward_std': 0.6255616545677185, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.040210433304309845, 'sampling/sampling_logp_difference/max': 4.728130340576172, 'sampling/importance_sampling_ratio/min': 0.00884298887103796, 'sampling/importance_sampling_ratio/mean': 0.9999879598617554, 'sampling/importance_sampling_ratio/max': 2.0, 'kl': 0.000720990683475975, 'entropy': 0.8006362821906805, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.0}
0%| | 3/1562 [18:48<164:33:29, 379.99s/it][2025-09-15 14:50:48,366] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | optimizer_allgather: 32.14 | optimizer_gradients: 14.02 | optimizer_step: 45.35
[2025-09-15 14:50:48,368] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd_microstep: 37983.79 | bwd_microstep: 119344.26 | bwd_inner_microstep: 119051.60 | bwd_allreduce_microstep: 286.91 | step_microstep: 105.03
[2025-09-15 14:50:48,370] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd: 37983.53 | bwd: 119343.62 | bwd_inner: 119052.35 | bwd_allreduce: 286.32 | step: 105.34
0%| | 4/1562 [21:30<127:11:36, 293.90s/it]
{'loss': -0.0506, 'grad_norm': 0.04704352095723152, 'learning_rate': 3.79746835443038e-08, 'kl': 0.0007260022939590272, 'entropy': 0.8083300068974495, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.0}
The completion and reward stats are printed every two steps (ie every vllm generation step) as opposed to every optimizer step. Is this behaviour intended?
Metadata
Metadata
Assignees
Labels
🏋 GRPORelated to GRPORelated to GRPO