Skip to content

Conversation

jseppanen
Copy link
Contributor

@jseppanen jseppanen commented Sep 18, 2025

What does this PR do ?

Fix OOM after validation when training with colocation enabled.

Currently optimizer state is not offloaded from GPU memory when refit is not triggered (e.g. after validation), which causes vLLM to run out of memory in the next generation after validation.

Summary by CodeRabbit

  • Performance

    • Improved memory management during colocated inference, reducing peak memory usage during generation.
    • Lowers risk of out-of-memory errors and can improve training stability and throughput in affected workflows.
  • Chores

    • No changes to public APIs or configuration required.

@jseppanen jseppanen requested a review from a team as a code owner September 18, 2025 12:48
Copy link
Contributor

coderabbitai bot commented Sep 18, 2025

📝 Walkthrough

Walkthrough

Adds guarded calls to policy.offload_after_refit() before policy_generation.prepare_for_generation() in grpo_train when colocated inference is enabled, in two code paths. No other logic, signatures, or error handling changed.

Changes

Cohort / File(s) Change summary
GRPO training memory offload before generation
nemo_rl/algorithms/grpo.py
In two branches of grpo_train, when colocated inference is enabled, insert policy.offload_after_refit() before policy_generation.prepare_for_generation() to free optimizer memory prior to generation. No API/signature changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant T as GRPO Trainer
  participant P as Policy
  participant G as PolicyGeneration

  rect rgba(230,240,255,0.5)
    note over T,P: Training step (colocated inference enabled)
    T->>P: refit/update (optimizer in memory)
    alt Colocated inference
      T->>P: offload_after_refit()
      note right of P: Unload optimizer to free memory
    end
    T->>G: prepare_for_generation()
    G-->>T: ready
    T->>G: generate()
    G-->>T: samples
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and accurately summarizes the primary change: fixing an out-of-memory (OOM) during validation when training with colocation enabled. It is concise, specific to the reported issue in the PR objectives, and clearly communicates the intent to reviewers.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed Based on the provided summary and objectives, this PR introduces a small, targeted change in grpo_train to offload optimizer state before generation when colocation is enabled, with no public API/signature changes. This is a memory-management fix to prevent OOM after validation and does not alter training numerics or core algorithm logic. While offloading could have minor performance implications, it is scoped and conditional, not a major feature or refactor. Therefore, in the absence of explicit test results in the description, the change qualifies as minor and the check passes.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jseppanen/fix-validation-oom

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
nemo_rl/algorithms/grpo.py (2)

621-623: Gate offload to vLLM backend; clarify comment

Unconditionally offloading on any colocated setup may cause unnecessary CPU↔GPU churn for Megatron-backed generation. Restrict to vLLM where the OOM was observed, and tighten the comment.

-                        if colocated_inference:
-                            policy.offload_after_refit()  # unload optimizer to make space for generation
+                        if colocated_inference and master_config["policy"]["generation"]["backend"] == "vllm":
+                            policy.offload_after_refit()  # offload model/optimizer buffers to CPU to free GPU memory for vLLM generation

775-777: Same gating for validation path

Apply the same vLLM-only guard here to avoid device thrash on Megatron.

-                        if colocated_inference:
-                            policy.offload_after_refit()  # unload optimizer to make space for generation
+                        if colocated_inference and master_config["policy"]["generation"]["backend"] == "vllm":
+                            policy.offload_after_refit()  # offload model/optimizer buffers to CPU to free GPU memory for vLLM generation
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee8f5aa and 92e0e07.

📒 Files selected for processing (1)
  • nemo_rl/algorithms/grpo.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_rl/algorithms/grpo.py (4)
nemo_rl/models/policy/lm_policy.py (1)
  • offload_after_refit (581-584)
nemo_rl/models/policy/megatron_policy_worker.py (1)
  • offload_after_refit (1759-1780)
nemo_rl/models/policy/dtensor_policy_worker.py (1)
  • offload_after_refit (1437-1460)
nemo_rl/models/policy/interfaces.py (1)
  • offload_after_refit (119-120)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR

@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Sep 23, 2025
@terrykong terrykong enabled auto-merge (squash) September 23, 2025 18:56
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Sep 30, 2025
@terrykong terrykong merged commit cc8a93e into main Oct 1, 2025
40 of 41 checks passed
@terrykong terrykong deleted the jseppanen/fix-validation-oom branch October 1, 2025 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI:L1 Run doctests, unit tests, and functional tests r0.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants