[RL] Fix loss: use global token normalization instead of per-example #2376

AlienKevin · 2026-01-18T04:43:39Z

This PR fixes a regression in the DAPO loss computation by switching from per-example normalization (/ n_i) back to global token normalization (/ N). Per-example normalization gives shorter responses disproportionately more gradient weight, which hurts math reasoning tasks where correct answers often require detailed, longer derivations. Global normalization weights all examples equally regardless of response length.

Check out #2039 (comment) for full context and experimental validation.

Copilot

Pull request overview

This PR fixes a regression in the DAPO loss computation by switching from per-example normalization to global token normalization. Per-example normalization gives shorter responses disproportionately more gradient weight, which hurts math reasoning tasks where correct answers often require longer, detailed derivations. The change ensures all examples are weighted equally regardless of response length.

Changes:

Modified DAPO loss computation to use global token normalization instead of per-example normalization
Improved rollout worker logging to use weight_step consistently for metric tracking
Implemented deterministic seeding for rollout workers using worker index instead of process ID
Pinned triton dependency to version 3.5.0 for consistency
Increased weight transfer timeout from 300s to 600s for better stability

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`lib/marin/src/marin/rl/rl_losses.py`	Changed DAPO loss to divide by total tokens across batch (global) instead of per-example, preventing shorter sequences from getting disproportionate gradient weight
`lib/marin/src/marin/rl/rollout_worker.py`	Added worker_index field and updated logging to consistently use _current_weight_step instead of local step counter for metric tracking
`lib/marin/src/marin/rl/rl_job.py`	Refactored worker task creation to use deterministic seeding based on worker index and removed unused os import
`lib/marin/src/marin/rl/rl_experiment_utils.py`	Increased max_weight_transfer_wait_time from 300s to 600s for improved stability
`lib/marin/pyproject.toml`	Pinned triton==3.5.0 for vllm extra on Linux x86_64 platforms
`uv.lock`	Consolidated triton dependency to version 3.5.0, removing 3.5.1 entries

Per-example normalization (/ n_i) gives shorter responses more gradient weight, which hurts math reasoning where detailed derivations are longer. Global normalization (/ N) weights all examples equally. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Copilot AI review requested due to automatic review settings January 18, 2026 04:43

Copilot started reviewing on behalf of AlienKevin January 18, 2026 04:44 View session

AlienKevin requested a review from ahmeda14960 January 18, 2026 04:45

Copilot AI reviewed Jan 18, 2026

View reviewed changes

AlienKevin force-pushed the kevin/rl-fix-regression branch from deb3ebd to 5254938 Compare January 18, 2026 16:25

ahmeda14960 approved these changes Jan 19, 2026

View reviewed changes

AlienKevin added 2 commits January 18, 2026 21:58

Merge branch 'main' into kevin/rl-fix-regression

0669d3c

Merge branch 'main' into kevin/rl-fix-regression

f707541

AlienKevin enabled auto-merge (squash) January 19, 2026 06:09

AlienKevin merged commit 5ff6295 into main Jan 19, 2026
7 of 8 checks passed

AlienKevin deleted the kevin/rl-fix-regression branch January 19, 2026 06:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RL] Fix loss: use global token normalization instead of per-example #2376

[RL] Fix loss: use global token normalization instead of per-example #2376

Uh oh!

AlienKevin commented Jan 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[RL] Fix loss: use global token normalization instead of per-example #2376

[RL] Fix loss: use global token normalization instead of per-example #2376

Uh oh!

Conversation

AlienKevin commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AlienKevin commented Jan 18, 2026 •

edited

Loading