[WIP] RL experimentation branch + synthetic data generation #2039

BabyChouSr · 2025-11-18T06:08:48Z

Description

In-flight weight updates
DAPO
Loss function
Zero-variance prompt filtering
Length penalty

Checklist

You ran uv run python infra/pre-commit.py --all-files to lint/format your code
You ran 'pytest' to test your code
Delete this checklist

Resolves multiple unit test failures introduced by recent API changes and fixes a threading/execution-context issue in the weight transfer tests. Changes: * `tests/rl/test_weight_transfer.py`: Fix Arrow Flight tests by adding a shared `job_context` fixture so Server and Client actors run in the same execution context. * `tests/rl/test_curriculum.py`: Update `CurriculumConfig` initialization to include `max_seq_len`; update `RolloutStats` initialization to include `temperature` and `top_k`. * `tests/rl/test_replay_buffer.py`: Update `Rollout` initialization to include `top_k=None`; fix batch shape assertions. * `tests/rl/test_train_batch.py`: Add missing `pad_to` argument to `convert_rollout_to_training_format` and `create_training_batch_from_rollouts`; update expected dictionary keys accordingly. * `tests/rl/environments/*.py`: Update `DummyInferenceContext.batch_completions` signature to accept the `top_k` argument.

AGENTS.md

experiments/exp1782_vllm_rl.py

ahmeda14960 · 2026-01-10T01:04:55Z

experiments/exp2039_rl_math500.py

@@ -243,7 +324,7 @@ def rl_train(name: str, experiment_config: ExperimentConfig) -> ExecutorStep:
        mode=WeightTransferMode.ARROW_FLIGHT,
        sync_interval_steps=1,
        # We are running on-policy, so wait for new weights from the trainer after each episode.
-        max_weight_transfer_wait_time=120,
+        max_weight_transfer_wait_time=300,


could we make a rl_macro.py file or something explaining where all these magic numbers come from? or just define the macros up top with an explainer one or two sentences

Good point, addressed with a refactoring

experiments/exp2039_rl_math500.py

experiments/math500_prompts.py

lib/levanter/pyproject.toml

ahmeda14960 · 2026-01-10T01:10:20Z

lib/marin/src/marin/rl/environments/inference_ctx/async_vllm.py

+
+try:
+    from vllm import LLM, SamplingParams
+    from vllm.outputs import RequestOutput


lib/marin/src/marin/rl/environments/inference_ctx/base.py

… Llama 3.1 8B Instruct

AlienKevin · 2026-01-10T17:23:42Z

RL accuracy maintained after recent merges and refactorings

As expected, MATH-500 run started at 0.29 and quickly reached close to 0.5 (0.488) in 18 steps.

actor wandb, trainer wandb

test commit
test command:

uv run lib/marin/src/marin/run/ray_run.py --env_vars WANDB_API_KEY ${WANDB_API_KEY} --env_vars WANDB_ENTITY marin-community --env_vars WANDB_PROJECT marin --env_vars TPU_CI true --env_vars HF_TOKEN ${HF_TOKEN} --cluster us-central1 --extra vllm,math --no_wait -- python experiments/exp2039_rl_math500.py --force_run_failed True

…h vLLM-tpu

…und torchvision wheel issue astral-sh/uv#16386 (comment)

…env.py that indirectly depends on vllm to pass on CI/cpu

ahmeda14960 · 2026-01-12T23:47:05Z

lib/marin/src/marin/rl/curriculum.py

    """Parameters for sampling rollouts from an environment."""

    temperature: float = 1.0
+    top_k: int | None = None


Given that we had a top k bug, do we want to set a warning or something here? I don't know when you'd ever want greedy decoding for RL training

ahmeda14960 · 2026-01-12T23:48:05Z

lib/marin/src/marin/rl/environments/base.py

            n_generations: Number of generations per example
            temperature: Sampling temperature for generation
            prng_key: JAX random key for sampling
            mode: "train" or "eval" - which dataset to sample from


maybe base env is a good place to add a warning

lib/marin/src/marin/rl/environments/gsm8k_env.py

ahmeda14960 · 2026-01-13T00:07:52Z

lib/marin/src/marin/rl/rl_losses.py


 import equinox as eqx
 import jax
 import jax.numpy as jnp


GENERAL comment:

We should factorize this more. Instead of having commented out lines in PPO, just make different methods for DAPO, PPO, GRPO etc. this way we can more easily test them against one another.

normally code re-use is good but for RL it can bite us so a bit of repetition is fine here

Should be addressed in #2327

ahmeda14960 · 2026-01-13T00:24:32Z

lib/marin/src/marin/rl/rl_losses.py

+    # loss = -1 * jnp.mean(jnp.sum(loss_objective * loss_masks, axis=1) / max_output_tokens)
+
+    # more like DAPO loss
+    loss = -1 * jnp.mean(jnp.sum(loss_objective * loss_masks, axis=1) / jnp.sum(loss_masks))


I think the PPO loss is normalized incorrectly.

We dividing each example’s objective by the total number of unmasked tokens across the whole batch, and then we also average across the batch. That effectively divides by batch size twice, so the loss/gradients get smaller as you increase batch size

We should either normalize per-example by that example’s token count and then average, or do a single global token-average over the batch.

Suggested change

loss = -1 * jnp.mean(jnp.sum(loss_objective * loss_masks, axis=1) / jnp.sum(loss_masks))

loss = -1 * jnp.mean(jnp.sum(loss_objective * loss_masks, axis=1) / jnp.sum(loss_masks, axis=1)))

Thanks for spotting this! Could you check if 7d8f7fa dresses this issue?

Reverts changes to lib/fray/src/fray/job/context.py to match main. Refactors downstream usage of .call() in rollout_worker.py, train_worker.py, and curriculum.py to use .remote() and get_default_job_ctx().get() instead.

AlienKevin · 2026-01-13T06:12:51Z

Recommended Review & Merge Order

To facilitate a smooth review and merge process, I recommend reviewing these sub-PRs in the following order (foundation first, then logic, then specific experiments/features):

Upgrade tpu-inference in alignment with jax==0.8.0: Upgrade tpu-inference in alignment with jax==0.8.0 #2330 (Foundation & dependencies)
Refactor Inference Context & Fix vLLM TopK: Refactor Inference Context & Fix vLLM TopK #2329 (Inference layer updates)
Support inflight weight updates: Support inflight weight updates #2325 (Core RL infrastructure)
RL Loss Improvements: RL Loss Improvements #2327 (Loss functions & sampling warnings)
MATH-500 RL Environment and Experiment: MATH-500 RL Environment and Experiment #2326 (Main experiment environment)
Add GSM8K RL Environment: Add GSM8K RL Environment #2324 (Secondary environment)
Classification Processing: Classification Processing #2328 (Feature-specific processing)
Update MockEnv logic: Update MockEnv logic #2334 (Environment utilities)
Remove old RL scripts and update architecture docs: Remove old RL scripts and update architecture docs #2333 (Final cleanup)
Ray Auth workaround to be able to submit jobs: Ray Auth workaround #2335

All changes from chris/exp-rl are now strictly accounted for across these 10 disjoint PRs. instruction_datasets.py is already merged in main and is thus excluded from the diffs.

AlienKevin · 2026-01-16T02:04:40Z

Rerunning MATH-500 after sub-PR merges:

actor wandb, trainer wandb
test commit: 9d3fc65

The above run got stuck after 60 steps when the rollout worker kept dying.

Rerunning again with some QoL updates:
actor wandb, trainer wandb
test commit: 2c56194

AlienKevin · 2026-01-17T19:29:20Z

MATH-500 RL Training Regression Analysis

Summary

A change to the DAPO loss normalization in rl_losses.py switched from global token normalization to per-example normalization. While both are valid approaches, the change caused a regression in MATH-500 training performance, likely because global normalization better suits tasks requiring detailed reasoning.

Context: Original Review Comment

This change was suggested in a PR review comment:

I think the PPO loss is normalized incorrectly. We dividing each example's objective by the total number of unmasked tokens across the whole batch, and then we also average across the batch. That effectively divides by batch size twice, so the loss/gradients get smaller as you increase batch size.

The concern about "dividing by batch size twice" is valid mathematically, but under AdamW this doesn't matter — AdamW normalizes gradients per-parameter, so constant scaling factors cancel out. What does matter is the relative weighting between examples, which the change altered.

Evidence from WandB

Comparing runs:

Before (working): llama-3.1-8bi-math-lr=2e-6-bs=1024-20260109-225716
After (regression): llama-3.1-8bi-math-lr=2e-6-bs=1024-20260116-081004

Metric	Before	After	Change
`train/loss`	-0.00049	-0.040	82x larger
`grad/norm/total`	0.003	0.407	130x larger
`pass_at_1` (final)	0.444	0.386	-13% accuracy
`train_correct_accuracy`	0.602	0.530	-12% accuracy

Root Cause

The change was in compute_dapo_loss():

# Per-example normalization (introduced regression)
loss = -1 * jnp.mean(jnp.sum(loss_objective * loss_masks, axis=1) / jnp.sum(loss_masks, axis=1))
#                                                                    ^^^^^^^^^^^^^^^^^^^^^^^^
#                                                                    Per-example token count

# Global normalization (original behavior)
loss = -1 * jnp.mean(jnp.sum(loss_objective * loss_masks, axis=1) / jnp.sum(loss_masks))
#                                                                    ^^^^^^^^^^^^^^^^^
#                                                                    Total batch tokens

Why This Matters

The two formulas represent different objectives:

Normalization	Formula	What it rewards
Global (`/ N`)	`-mean(L_i / N)`	Each token in the reasoning chain (process)
Per-example (`/ n_i`)	`-mean(L_i / n_i)`	Getting the right answer (outcome)

Setup:

L_i = sum of loss_objective for example i
n_i = number of tokens in example i
N = total tokens across batch = Σn_i
B = batch size

Global normalization: loss = -mean(L_i / N)

∂loss/∂L_i = -1/(B × N)   # Same constant for all examples

→ Longer responses get proportionally more gradient signal

Per-example normalization: loss = -mean(L_i / n_i)

∂loss/∂L_i = -1/(B × n_i)   # Varies by example length

→ All responses contribute equally regardless of length

Concrete example:

Example	Length (n_i)	Global Weight	Per-Example Weight
A	100	1/600	1/100
B	500	1/600	1/500

With global normalization (N=600), example B has 5x more gradient influence because it's 5x longer. With per-example normalization, both contribute equally.

Impact on Math Reasoning

For math reasoning, global normalization works better empirically because:

Correct solutions often require detailed step-by-step derivations → longer responses
We want to reinforce the entire reasoning process, not just the final answer
Per-example normalization treats a terse correct answer the same as a detailed proof

The WandB evidence shows this matters: -13% accuracy when switching to per-example normalization.

Fix

Restored original DAPO loss normalization in rl_losses.py:234-242:

def compute_dapo_loss(loss_objective, loss_masks):
    """Compute DAPO-like loss (global token normalization)."""
    return -1 * jnp.mean(jnp.sum(loss_objective * loss_masks, axis=1) / jnp.sum(loss_masks))

Why Global Normalization Works Better Here

The original reviewer's concern was that we "divide by batch size twice." Let's trace through:

loss = -1 * mean(sum(L_i) / N)  # N = total tokens across batch
     = -1 * (1/B) * sum(L_i / N)  # B = batch size

This does produce smaller gradients with larger batches. However:

AdamW cancels constant factors — The optimizer normalizes by running variance, so grad * c produces the same update as grad after a few steps.
The key difference is example weighting — Global normalization weights all examples by the same factor (1/N). Per-example normalization weights each example by 1/n_i, giving shorter responses more influence.
Batch size sensitivity is already handled — Learning rate schedules and hyperparameter tuning account for batch size effects. Changing the loss formula requires re-tuning hyperparameters.

Per-example normalization may be preferable for other tasks where response length shouldn't affect gradient weight.

Reference: How Tinker Handles This

For comparison, Tinker's importance_sampling loss uses sum reduction without explicit normalization:

loss = -(prob_ratio * advantages).sum()

From their documentation:

"If you would like to explore different aggregation schemes, you can include that in the advantage tensor computation."

Tinker delegates normalization to the advantage computation, which uses simple mean-centering within groups:

# tinker-cookbook/tinker_cookbook/rl/data_processing.py
def compute_advantages(trajectory_groups_P):
    for traj_group in trajectory_groups_P:
        rewards_G = torch.tensor(traj_group.get_total_rewards())
        advantages_G = rewards_G - rewards_G.mean()  # Center within group

This approach gives longer responses more gradient signal (similar to our global normalization), since the sum is not divided by sequence length.

Verified to fix regression

Before the regression fix, pass@1 maxed out at 0.45 and is less stable (dipped to 0.35 at step 17):

After applying the fix which reverts advantage back to the per-example normalization, we reached 0.486 pass@1 on MATH-500 in 23 steps and stablizes around 0.47 for the next ~100 steps, closely matching the behavior observed before.

actor wandb, trainer wandb
test commit: deb3ebd

…2376) This PR fixes a regression in the DAPO loss computation by switching from per-example normalization (/ n_i) back to global token normalization (/ N). Per-example normalization gives shorter responses disproportionately more gradient weight, which hurts math reasoning tasks where correct answers often require detailed, longer derivations. Global normalization weights all examples equally regardless of response length. Check out #2039 (comment) for full context and experimental validation. Co-authored-by: Claude Opus 4.5 <[email protected]>

BabyChouSr added 30 commits October 20, 2025 22:53

Initial vLLM test piece done

56414ac

vLLM load weight

3c56d97

Initial rollout worker done

d30183b

Add code

e260780

Add new

1bc975e

Add whole curriculum

d8cd5a0

Separate parsing logic and allow passing in sampling params

53e92b6

Add fix to load the fetch params weights to CPU

cf4de94

Address comments so far

cb217bb

maybe_edit probably not needed

6bf98dc

Clean up a bit more and add integration

3f34704

Remove extra run

5a6b2a6

Add config

5714633

Merge

d40410e

Revert some cluster config changes

8c9d397

Revert dockerfile changes and use vLLM dockerfile

16986b2

Delete unused files

ef5d2b8

Remove subdir

a316bb3

Remove some files

c81dd6c

Remove extra print

addc014

Add Rollout worker abstraction fixes

e20ea55

REmove commneted out code

2b4411d

Some fixes

00cbaf8

Add changes

13deae3

Add changes

6c008f0

Merge

e491606

Address comments

6798839

Revert pyproject

9592aec

Add back the original uv lock

5b1881f

Guard vllm import

d1db1c4

AlienKevin added 7 commits January 9, 2026 22:54

Remove unused RL and vLLM inference experiments

4b1ca97

chore: Resolve uv.lock conflict

3da0367

Fixed linter issues in math_env.py

8303527

Removed unused reward calculation script

0b21d63

Removed more unused scripts

1a9cec3

Removed tinker-cookbook

74dbaaf

ahmeda14960 reviewed Jan 10, 2026

View reviewed changes

AlienKevin added 6 commits January 10, 2026 02:45

Removed unrelated infra notes in AGENTS.md

5f33bd6

Removed try except around vllm imports

ba37e7d

Removed commented debug logs in inference_ctx/base.py

06125fa

Removed unused comments in exp1782_vllm_rl.py to focus on MATH-500 on…

9694b9e

… Llama 3.1 8B Instruct

Renamed exp1782_vllm_rl.py to exp2039_rl_math500.py

a1bd567

Refactor RL experiment config for reusability

0087241

AlienKevin added 5 commits January 10, 2026 17:38

Merge branch 'main' into chris/exp-rl

98104d9

Set torchvision upper bound to be <0.24.1 to ensure compatability wit…

b87c4ed

…h vLLM-tpu

Lock torchvision to 0.24.0 to be compatible with vllm-tpu and workaro…

8faeb3e

…und torchvision wheel issue astral-sh/uv#16386 (comment)

Made vllm imports optional in vllm.py to allow tests like test_gsm8k_…

3e34e48

…env.py that indirectly depends on vllm to pass on CI/cpu

Remove unnecessary pinned triton dependency

e261bff

AlienKevin force-pushed the chris/exp-rl branch from c8abda0 to e261bff Compare January 11, 2026 19:15

Merge branch 'main' into chris/exp-rl

2e5dee3

ahmeda14960 reviewed Jan 12, 2026

View reviewed changes

ahmeda14960 requested changes Jan 13, 2026

View reviewed changes

Refactor: Revert RayActorMethod.call and explicit future handling

ec2bd39

Reverts changes to lib/fray/src/fray/job/context.py to match main. Refactors downstream usage of .call() in rollout_worker.py, train_worker.py, and curriculum.py to use .remote() and get_default_job_ctx().get() instead.

AlienKevin mentioned this pull request Jan 18, 2026

[RL] Fix loss: use global token normalization instead of per-example #2376

Merged

	loss = -1 * jnp.mean(jnp.sum(loss_objective * loss_masks, axis=1) / jnp.sum(loss_masks))
	loss = -1 * jnp.mean(jnp.sum(loss_objective * loss_masks, axis=1) / jnp.sum(loss_masks, axis=1)))

[WIP] RL experimentation branch + synthetic data generation #2039

Are you sure you want to change the base?

[WIP] RL experimentation branch + synthetic data generation #2039

Uh oh!

Conversation

BabyChouSr commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AlienKevin commented Jan 10, 2026

RL accuracy maintained after recent merges and refactorings

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlienKevin commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Recommended Review & Merge Order

Uh oh!

AlienKevin commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rerunning MATH-500 after sub-PR merges:

Uh oh!

AlienKevin commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MATH-500 RL Training Regression Analysis

Summary

Context: Original Review Comment

Evidence from WandB

Root Cause

Why This Matters

Impact on Math Reasoning

Fix

Why Global Normalization Works Better Here

Reference: How Tinker Handles This

Verified to fix regression

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BabyChouSr commented Nov 18, 2025 •

edited

Loading

AlienKevin commented Jan 13, 2026 •

edited

Loading

AlienKevin commented Jan 16, 2026 •

edited

Loading

AlienKevin commented Jan 17, 2026 •

edited

Loading