fix: Fix OOM in validation during colocated training #1159

jseppanen · 2025-09-18T12:48:55Z

What does this PR do ?

Fix OOM after validation when training with colocation enabled.

Currently optimizer state is not offloaded from GPU memory when refit is not triggered (e.g. after validation), which causes vLLM to run out of memory in the next generation after validation.

Summary by CodeRabbit

Performance
- Improved memory management during colocated inference, reducing peak memory usage during generation.
- Lowers risk of out-of-memory errors and can improve training stability and throughput in affected workflows.
Chores
- No changes to public APIs or configuration required.

Signed-off-by: Jarno Seppänen <[email protected]>

coderabbitai · 2025-09-18T12:53:59Z

📝 Walkthrough

Walkthrough

Adds guarded calls to policy.offload_after_refit() before policy_generation.prepare_for_generation() in grpo_train when colocated inference is enabled, in two code paths. No other logic, signatures, or error handling changed.

Changes

Cohort / File(s)	Change summary
GRPO training memory offload before generation `nemo_rl/algorithms/grpo.py`	In two branches of grpo_train, when colocated inference is enabled, insert policy.offload_after_refit() before policy_generation.prepare_for_generation() to free optimizer memory prior to generation. No API/signature changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant T as GRPO Trainer
  participant P as Policy
  participant G as PolicyGeneration

  rect rgba(230,240,255,0.5)
    note over T,P: Training step (colocated inference enabled)
    T->>P: refit/update (optimizer in memory)
    alt Colocated inference
      T->>P: offload_after_refit()
      note right of P: Unload optimizer to free memory
    end
    T->>G: prepare_for_generation()
    G-->>T: ready
    T->>G: generate()
    G-->>T: samples
  end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly and accurately summarizes the primary change: fixing an out-of-memory (OOM) during validation when training with colocation enabled. It is concise, specific to the reported issue in the PR objectives, and clearly communicates the intent to reviewers.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	Based on the provided summary and objectives, this PR introduces a small, targeted change in grpo_train to offload optimizer state before generation when colocation is enabled, with no public API/signature changes. This is a memory-management fix to prevent OOM after validation and does not alter training numerics or core algorithm logic. While offloading could have minor performance implications, it is scoped and conditional, not a major feature or refactor. Therefore, in the absence of explicit test results in the description, the change qualifies as minor and the check passes.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch jseppanen/fix-validation-oom

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

nemo_rl/algorithms/grpo.py (2)

621-623: Gate offload to vLLM backend; clarify comment

Unconditionally offloading on any colocated setup may cause unnecessary CPU↔GPU churn for Megatron-backed generation. Restrict to vLLM where the OOM was observed, and tighten the comment.

-                        if colocated_inference:
-                            policy.offload_after_refit()  # unload optimizer to make space for generation
+                        if colocated_inference and master_config["policy"]["generation"]["backend"] == "vllm":
+                            policy.offload_after_refit()  # offload model/optimizer buffers to CPU to free GPU memory for vLLM generation

775-777: Same gating for validation path

Apply the same vLLM-only guard here to avoid device thrash on Megatron.

-                        if colocated_inference:
-                            policy.offload_after_refit()  # unload optimizer to make space for generation
+                        if colocated_inference and master_config["policy"]["generation"]["backend"] == "vllm":
+                            policy.offload_after_refit()  # offload model/optimizer buffers to CPU to free GPU memory for vLLM generation

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee8f5aa and 92e0e07.

📒 Files selected for processing (1)

nemo_rl/algorithms/grpo.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_rl/algorithms/grpo.py (4)

nemo_rl/models/policy/lm_policy.py (1)

offload_after_refit (581-584)

nemo_rl/models/policy/megatron_policy_worker.py (1)

offload_after_refit (1759-1780)

nemo_rl/models/policy/dtensor_policy_worker.py (1)

offload_after_refit (1437-1460)

nemo_rl/models/policy/interfaces.py (1)

offload_after_refit (119-120)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Lint check
GitHub Check: Post submodule check comment / Comment on PR
GitHub Check: Post automodel integration comment / Comment on PR

fix: Fix OOM in validation during colocated training

92e0e07

Signed-off-by: Jarno Seppänen <[email protected]>

jseppanen requested a review from a team as a code owner September 18, 2025 12:48

coderabbitai bot reviewed Sep 18, 2025

View reviewed changes

euronymous-aithal added the r0.4.0 label Sep 19, 2025

yfw approved these changes Sep 23, 2025

View reviewed changes

terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Sep 23, 2025

terrykong enabled auto-merge (squash) September 23, 2025 18:56

terrykong temporarily deployed to nemo-ci September 23, 2025 18:57 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci September 23, 2025 18:58 — with GitHub Actions Inactive

parthchadha approved these changes Sep 25, 2025

View reviewed changes

Merge branch 'main' into jseppanen/fix-validation-oom

0187741

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Sep 30, 2025

terrykong temporarily deployed to nemo-ci September 30, 2025 21:41 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci September 30, 2025 21:46 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci October 1, 2025 00:25 — with GitHub Actions Inactive

terrykong merged commit cc8a93e into main Oct 1, 2025
40 of 41 checks passed

terrykong deleted the jseppanen/fix-validation-oom branch October 1, 2025 01:16

zhandaz mentioned this pull request Oct 2, 2025

GRPO OOM after checkpointing #1057

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Fix OOM in validation during colocated training #1159

fix: Fix OOM in validation during colocated training #1159

Uh oh!

jseppanen commented Sep 18, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 18, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

fix: Fix OOM in validation during colocated training #1159

fix: Fix OOM in validation during colocated training #1159

Uh oh!

Conversation

jseppanen commented Sep 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 18, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jseppanen commented Sep 18, 2025 •

edited by coderabbitai bot

Loading