simplify adv and loss code #1382

faresobeid · 2025-12-02T23:26:36Z

Note

Vectorizes advantage computation and overhauls RL loss with new masking/sequence controls, removing loss_scale from compute_loss and updating training/tests accordingly.

Advantage:
- Vectorize compute_advantages using tensor reshaping; support length-weighted baseline via completion_lengths; remove per-group helper and return flat list.
Loss/Training:
- Introduce new masking config in LossConfig: mask_low/high, seq_mask_low/high, seq_mask_neg_adv/pos_adv, seq_clip, and constant_norm.
- Refactor compute_loss:
  - Remove loss_scale arg; add sequence- and token-level masking with new thresholds and adv-aware sequence masks; sequence ratio clipping and optional per-sequence normalization.
  - Return expanded metrics (token_masked*, seq_masked_*, and KL splits).
- Update train.py to compute loss_scale (batch-size for sequence or constant_norm) and divide loss after compute_loss.
Tests:
- Update unit tests to call compute_loss without loss_scale.

^{Written by Cursor Bugbot for commit 5324fd0. This will update automatically on new commits. Configure here.}

src/prime_rl/trainer/rl/loss.py

mikasenghaas · 2025-12-03T07:52:06Z

src/prime_rl/trainer/rl/loss.py



 def compute_loss(
-    trainer_logprobs: Any,  # list of Float[Tensor, "seq_i"] with potentially different seq_i lengths


lol why did we have any here before

mikasenghaas · 2025-12-03T07:52:27Z

src/prime_rl/trainer/rl/loss.py

-def shift_logits(logits: Float[Tensor, "batch seq vocab"]) -> Float[Tensor, "batch seq vocab"]:
+def shift_logits(logits: Tensor) -> Tensor:
    """Removes final token logits and adds a zero logit for the first token."""
-    # We drop the last logit because it corresponds to the next token that will be sampled but is not here yet


comments are nice no?

mikasenghaas · 2025-12-03T07:52:36Z

src/prime_rl/trainer/rl/loss.py


-@jaxtyped(typechecker=typechecker)
 @torch.compile(dynamic=True)
-def selective_log_softmax(


why remove all the jaxtyping?

i think its nice for ppl not so familiar with the code to know the input/ output tensor shapes. also good to catch at runtime if we violate these

Ya fair, just cus it's very set in place what the shapes would be. Will put them back

it should be very set in place no?

cursor · 2025-12-04T04:52:21Z

src/prime_rl/trainer/rl/loss.py

+        seq_mask_pos_adv = (seq_ratio > cfg.seq_mask_pos_adv) & (seq_adv > 0)
+
+        if cfg.ratio_type == "sequence":
+            log_ratio = trainer_logprobs - trainer_logprobs.detach() + torch.clamp(seq_log_ratio, max=cfg.seq_clip).detach()


Bug: Sequence mode gradient clipping behavior changed

In sequence mode, the old code applied torch.clamp after computing trainer_logprobs - trainer_logprobs.detach() + seq_log_ratio.detach(), which would zero out gradients when the sequence ratio exceeded the clip value. The new code applies clamp before .detach() on seq_log_ratio, meaning gradients always flow through trainer_logprobs - trainer_logprobs.detach() regardless of ratio magnitude. This removes the implicit gradient blocking behavior for extreme importance ratios, potentially affecting training stability.

mikasenghaas · 2025-12-08T03:03:10Z

src/prime_rl/trainer/rl/config.py

these look like they should be in a separate pr?

ya gonna revamp this pr as theres lots of algorithm options we will want to add

simplify adv and loss code

a016dda

cursor bot reviewed Dec 2, 2025

View reviewed changes

src/prime_rl/trainer/rl/loss.py Outdated Show resolved Hide resolved

fix

a2c224a

mikasenghaas reviewed Dec 3, 2025

View reviewed changes

Add more RL algorithm options

5324fd0

cursor bot reviewed Dec 4, 2025

View reviewed changes

mikasenghaas reviewed Dec 8, 2025

View reviewed changes

faresobeid closed this Dec 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

simplify adv and loss code #1382

simplify adv and loss code #1382

faresobeid commented Dec 2, 2025 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

mikasenghaas Dec 3, 2025

Uh oh!

mikasenghaas Dec 3, 2025

Uh oh!

mikasenghaas Dec 3, 2025

Uh oh!

mikasenghaas Dec 3, 2025

Uh oh!

faresobeid Dec 3, 2025

Uh oh!

mikasenghaas Dec 3, 2025

Uh oh!

cursor bot Dec 4, 2025

Uh oh!

mikasenghaas Dec 8, 2025

Uh oh!

faresobeid Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		def compute_loss(
		trainer_logprobs: Any, # list of Float[Tensor, "seq_i"] with potentially different seq_i lengths

simplify adv and loss code #1382

simplify adv and loss code #1382

Conversation

faresobeid commented Dec 2, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Dec 4, 2025

Choose a reason for hiding this comment

Bug: Sequence mode gradient clipping behavior changed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

faresobeid commented Dec 2, 2025 •

edited by cursor bot

Loading