Conversation
|
|
There was a problem hiding this comment.
Code Review
This pull request adds support for the CFPO policy loss. The implementation is a good start, but it includes some unused code, likely from a copy-paste, which should be removed for clarity. I've provided specific suggestions for this. A more critical issue is that the new 'cfpo' loss mode is not added to the configuration files (PolicyLossConfig in verl/workers/config/actor.py and verl/trainer/config/actor/actor.yaml), which will prevent users from selecting it. This should be addressed to make the feature usable.
verl/trainer/ppo/core_algos.py
Outdated
| clip_ratio_c = config.get( # Lower bound of the ratio for dual-clip PPO. See https://arxiv.org/pdf/1912.09729. | ||
| "clip_ratio_c", 3.0 | ||
| ) | ||
|
|
||
| cliprange = clip_ratio | ||
| cliprange_low = clip_ratio_low | ||
| cliprange_high = clip_ratio_high | ||
|
|
||
| assert clip_ratio_c > 1.0, ( | ||
| "The lower bound of the clip_ratio_c for dual-clip PPO should be greater than 1.0," | ||
| + f" but get the value: {clip_ratio_c}." | ||
| ) |
There was a problem hiding this comment.
The clip_ratio_c parameter and its associated assertion are specific to dual-clip PPO and are not used in the CFPO loss calculation or its metrics. This code should be removed to avoid confusion and keep the implementation clean.
| clip_ratio_c = config.get( # Lower bound of the ratio for dual-clip PPO. See https://arxiv.org/pdf/1912.09729. | |
| "clip_ratio_c", 3.0 | |
| ) | |
| cliprange = clip_ratio | |
| cliprange_low = clip_ratio_low | |
| cliprange_high = clip_ratio_high | |
| assert clip_ratio_c > 1.0, ( | |
| "The lower bound of the clip_ratio_c for dual-clip PPO should be greater than 1.0," | |
| + f" but get the value: {clip_ratio_c}." | |
| ) | |
| cliprange = clip_ratio | |
| cliprange_low = clip_ratio_low | |
| cliprange_high = clip_ratio_high |
verl/trainer/ppo/core_algos.py
Outdated
| ### This code is for logging purposes | ||
| pg_losses1 = -advantages * ratio | ||
| if cliprange_low is None: | ||
| cliprange_low = cliprange | ||
| if cliprange_high is None: | ||
| cliprange_high = cliprange | ||
| pg_losses2 = -advantages * torch.clamp( | ||
| ratio, 1 - cliprange_low, 1 + cliprange_high | ||
| ) # - clip(ratio, 1-cliprange, 1+cliprange) * A | ||
| clip_pg_losses1 = torch.maximum( | ||
| pg_losses1, pg_losses2 | ||
| ) # max(-ratio * A, -clip(ratio, 1-cliprange, 1+cliprange) * A) | ||
| pg_clipfrac = verl_F.masked_mean(torch.gt(pg_losses2, pg_losses1).float(), response_mask) | ||
|
|
||
| pg_losses3 = -advantages * clip_ratio_c | ||
|
|
||
| pg_clipfrac_lower = verl_F.masked_mean( | ||
| torch.gt(clip_pg_losses1, pg_losses3) * (advantages < 0).float(), response_mask | ||
| ) | ||
| ### This code is for logging purposes |
There was a problem hiding this comment.
This block of code, marked for "logging purposes", appears to be dead code from a copy-paste. The variables pg_clipfrac and pg_clipfrac_lower are calculated here but are immediately overwritten by a new calculation later in the function. Other variables defined in this block are not used outside of it. This block should be removed to improve clarity and avoid unnecessary computations.
CFPO in VERL
Overview
Adds support for CFPO (Clipping-Free Policy Optimization), a clipping-free alternative to PPO-style objectives. CFPO replaces ratio clipping with a smooth quadratic penalty, removing zero-gradient regions while maintaining stable updates.
Features
Reference
https://arxiv.org/abs/2601.22801