Skip to content

Conversation

@AlienKevin
Copy link
Contributor

@AlienKevin AlienKevin commented Jan 18, 2026

This PR fixes a regression in the DAPO loss computation by switching from per-example normalization (/ n_i) back to global token normalization (/ N). Per-example normalization gives shorter responses disproportionately more gradient weight, which hurts math reasoning tasks where correct answers often require detailed, longer derivations. Global normalization weights all examples equally regardless of response length.

Check out #2039 (comment) for full context and experimental validation.

Copilot AI review requested due to automatic review settings January 18, 2026 04:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a regression in the DAPO loss computation by switching from per-example normalization to global token normalization. Per-example normalization gives shorter responses disproportionately more gradient weight, which hurts math reasoning tasks where correct answers often require longer, detailed derivations. The change ensures all examples are weighted equally regardless of response length.

Changes:

  • Modified DAPO loss computation to use global token normalization instead of per-example normalization
  • Improved rollout worker logging to use weight_step consistently for metric tracking
  • Implemented deterministic seeding for rollout workers using worker index instead of process ID
  • Pinned triton dependency to version 3.5.0 for consistency
  • Increased weight transfer timeout from 300s to 600s for better stability

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
lib/marin/src/marin/rl/rl_losses.py Changed DAPO loss to divide by total tokens across batch (global) instead of per-example, preventing shorter sequences from getting disproportionate gradient weight
lib/marin/src/marin/rl/rollout_worker.py Added worker_index field and updated logging to consistently use _current_weight_step instead of local step counter for metric tracking
lib/marin/src/marin/rl/rl_job.py Refactored worker task creation to use deterministic seeding based on worker index and removed unused os import
lib/marin/src/marin/rl/rl_experiment_utils.py Increased max_weight_transfer_wait_time from 300s to 600s for improved stability
lib/marin/pyproject.toml Pinned triton==3.5.0 for vllm extra on Linux x86_64 platforms
uv.lock Consolidated triton dependency to version 3.5.0, removing 3.5.1 entries

Per-example normalization (/ n_i) gives shorter responses more gradient
weight, which hurts math reasoning where detailed derivations are longer.
Global normalization (/ N) weights all examples equally.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@AlienKevin AlienKevin force-pushed the kevin/rl-fix-regression branch from deb3ebd to 5254938 Compare January 18, 2026 16:25
@AlienKevin AlienKevin enabled auto-merge (squash) January 19, 2026 06:09
@AlienKevin AlienKevin merged commit 5ff6295 into main Jan 19, 2026
7 of 8 checks passed
@AlienKevin AlienKevin deleted the kevin/rl-fix-regression branch January 19, 2026 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants