DSV3 feat branch #1160

joyang-nv · 2025-09-18T15:20:57Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added example configs for GRPO and SFT with Moonlight-16B, DTensor v2, sequence packing, and vLLM-based generation.
- Auto-patching of runtime environments to locate Transformers modules across workers.
- Expanded optional dependencies to enable advanced GEMM/EP kernels and Transformer Engine support.
Refactor
- Reworked DTensor policy worker initialization and checkpoint loading for more reliable, scalable multi-parallel training.
Chores
- Updated Automodel submodule branch and commit references.

Signed-off-by: Jonas Yang <[email protected]>

github-actions · 2025-09-18T15:21:31Z

⚠️ File Consistency Check

Check based on commit: 17aa295 (PR #1160 from user/joyang/dsv3_v2)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-09-18T15:21:52Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 17aa295 (PR #1160 from user/joyang/dsv3_v2)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/71162c284d315193cbb4011081228da2ba943c27/
CURRENT (PR #1160 from user/joyang/dsv3_v2): https://github.com/NVIDIA-NeMo/Automodel/commits/c84b9c401b7633340846c77773b48a7b9be90d1f/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

coderabbitai · 2025-09-18T15:34:03Z

📝 Walkthrough

Walkthrough

Submodule config and pointer updated. Two new YAML configs added. Runtime worker environment now patches Transformers module directory. DTensor policy worker v2 refactored for new distributed/parallel setup and base-checkpoint loading. New utility added for PYTHONPATH patching. Optional dependencies extended in pyproject and build settings updated.

Changes

Cohort / File(s)	Summary of Changes
Git/Submodules `\.gitmodules`, `3rdparty/Automodel-workspace/Automodel`	Submodule branch changed to hemil/dsv3-updates; shallow clone disabled. Submodule pointer advanced to commit c84b9c4.
Training Configurations (new) `examples/configs/grpo_math_moonlight_te_deepep.yaml`, `examples/configs/sft_moonlight_16b_te_deepep.yaml`	Added full GRPO and SFT YAML configs for Moonlight-16B with DTensor v2, vLLM generation (for GRPO), checkpointing, logging, and resource specs.
Runtime Env Patching `nemo_rl/utils/venvs.py`, `nemo_rl/distributed/worker_groups.py`	Added patch_transformers_module_dir to set PYTHONPATH for Transformers modules and invoked it during per-worker env setup.
DTensor Policy Worker Refactor `nemo_rl/models/policy/dtensor_policy_worker_v2.py`	Reworked initialization: explicit model/model_config, new distributed meshes (dp/tp/cp/pp/ep), optional parallelize_fn, target-based model instantiation, base-checkpoint broadcast loading, and helper methods _get_distributed/_setup_parallel.
Dependencies `pyproject.toml`	Added deep_gemm, deep_ep, grouped_gemm, transformer-engine[pytorch]==2.5.0 to automodel extras; added grouped_gemm to no-build-isolation-package.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Orchestrator as Orchestrator
  participant Worker as DTensorPolicyWorkerV2
  participant Dist as Distributed Setup
  participant Model as Model Factory
  participant Ckpt as Base Checkpoint Loader

  Orchestrator->>Worker: initialize(policy_cfg, dtensor_cfg, ...)
  Worker->>Dist: _get_distributed(world_size, dp,tp,cp,pp,ep, seq_parallel)
  Dist-->>Worker: distributed handles (meshes, ranks)
  Worker->>Worker: _setup_parallel(dp,tp,cp,pp,ep, seq_parallel)

  alt target provided in dtensor_cfg.model._target_
    Worker->>Model: resolve target (+optional backend), build model_kwargs
    Model-->>Worker: self.model (empty/skeleton from_config)
  else no explicit target
    Worker->>Model: create skeleton model (init_empty_weights/from_config)
    Model-->>Worker: self.model
  end

  Worker->>Ckpt: load_model_from_base_checkpoint(self.model, model_config, meshes)
  Ckpt-->>Worker: weights loaded/broadcast across ranks

  Worker->>Worker: tie weights, finalize setup (parallelize_fn or fsdp2_strategy_parallelize)
  Worker-->>Orchestrator: ready

sequenceDiagram
  autonumber
  participant Manager as WorkerGroups
  participant Env as Runtime Env Vars
  participant Venv as create_local_venv_on_each_node
  participant Patch as patch_transformers_module_dir
  participant Ray as Worker Creation

  Manager->>Venv: prepare venv per node
  Venv-->>Manager: paths (VIRTUAL_ENV, UV_PROJECT_ENVIRONMENT)
  Manager->>Env: set env vars for worker
  Manager->>Patch: patch_transformers_module_dir(env_vars)
  Patch-->>Manager: env_vars with PYTHONPATH modules dir
  Manager->>Ray: create worker actors with env_vars
  Ray-->>Manager: workers ready

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

yfw
terrykong
yaoyu-33

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	This PR includes major changes: a substantial refactor of DTensorPolicyWorkerV2’s initialization and distributed topology, new environment handling, new optional heavy dependencies, and new training configs that can affect numerics, convergence, and performance. The PR description is a template with no testing details, no validation of numerical parity or convergence, and no before/after performance numbers or run context. Given the scope and potential impact, testing evidence is required but absent, so the check fails.	Please update the PR description with concrete testing results: (1) unit/integration tests passing, (2) numerical parity/convergence checks versus a baseline for DTensorPolicyWorkerV2 (loss curves and final metrics), and (3) performance benchmarks before vs. after with exact configs, hardware, software versions, and seeds; include reproducible commands and links to logs/artifacts. If available, add a short smoke-test CI job that exercises the new configs to ensure launchability and basic correctness.
Title Check	❓ Inconclusive	The provided title "DSV3 feat branch" is short but too vague to convey the primary change(s) in this PR; while "DSV3" hints at DeepSeek/DSV3 work that appears in the diffs, the title does not summarize the main technical changes (DTensorPolicyWorkerV2 refactor, DeepSeek/DTensor integration, venv patching, new example configs, and dependency updates) or the intent for a reviewer scanning history. Because the title is generic and non-descriptive, the check is inconclusive.	Please rename the PR to a concise, single-sentence title that highlights the main change (for example: "Integrate DeepSeek v3 DTensor backend and refactor DTensorPolicyWorkerV2" or "Add DeepSeek v3 support, DTensorPolicyWorkerV2 refactor, example configs, and automodel dependency updates") and avoid generic words like "branch" or "feat"; this will make the purpose immediately clear to reviewers.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch user/joyang/dsv3_v2

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_rl/models/policy/dtensor_policy_worker_v2.py (1)

898-925: Fix potential UnboundLocalError for num_valid_samples with dummy microbatches.

When mb_idx >= iterator_len, num_valid_samples is never set but is referenced later.

-                        # skip the update for dummy batches
-                        if mb_idx < iterator_len:
+                        # skip the update for dummy batches
+                        num_valid_samples = 0
+                        if mb_idx < iterator_len:
                             ## scale by the number of global batches so we get the correct
                             ## value when summing metrics across all microbatches
                             for k in loss_metrics.keys():
                                 loss_metrics[k] /= num_global_batches
                             num_valid_samples = loss_metrics["num_valid_samples"]
                             loss_metrics["lr"] = self.optimizer.param_groups[0]["lr"]
                             loss_metrics["global_valid_seqs"] = global_valid_seqs.item()
                             loss_metrics["global_valid_toks"] = global_valid_toks.item()
                         else:
                             loss *= 0

🧹 Nitpick comments (8)

nemo_rl/distributed/worker_groups.py (1)
511-516: Use the function’s return to avoid relying on in‑place mutation; keeps intent explicit.

Minor clarity tweak; behavior unchanged.
-                patch_transformers_module_dir(runtime_env["env_vars"])
+                runtime_env["env_vars"] = patch_transformers_module_dir(runtime_env["env_vars"])
pyproject.toml (1)

185-185: no-build-isolation-package expansion: watch for build‑time tool leakage.

Keeping deep_gemm/deep_ep/grouped_gemm outside PEP‑517 isolation can surface host toolchain issues. If CI becomes flaky, consider per‑package overrides or pinning wheel artifacts.

examples/configs/sft_moonlight_16b_te_deepep.yaml (2)

36-61: Sanity on parallel dims vs cluster size.

ep=2, tp=1, cp=1, pp=1 is fine for the default 8‑GPU single node. Just note that examples should either document required GPUs or set ep=1 for portability.

166-175: W&B defaults in example.

wandb_enabled: true is great, but consider setting it to false by default in public examples to avoid accidental uploads.
nemo_rl/models/policy/dtensor_policy_worker_v2.py (4)
308-309: Use self.tokenizer for clarity.

Minor readability; avoids shadowing and future refactor hazards.
-        if self.model.config.pad_token_id is None:
-            self.model.config.pad_token_id = tokenizer.pad_token_id
+        if self.model.config.pad_token_id is None:
+            self.model.config.pad_token_id = self.tokenizer.pad_token_id
620-625: Ensure GBS divisible by DP size.

Avoid silent truncation/misbatching.
         if gbs is None:
             gbs = self.cfg["train_global_batch_size"]
         if mbs is None:
             mbs = self.cfg["train_micro_batch_size"]
-        local_gbs = gbs // self.dp_size
+        if gbs % self.dp_size != 0:
+            raise ValueError(f"train_global_batch_size ({gbs}) must be divisible by dp_size ({self.dp_size})")
+        local_gbs = gbs // self.dp_size
394-399: Reduce log spam: gate verbose param dumps to rank 0 or debug mode.

Printing every param on all ranks is noisy and slows startup.
-        print(f"[Rank {self.rank}] Loading state dict done.")
-        for k, v in self.model.named_parameters():
-            print(
-                f"Model param {k}: {v.shape}, {v.dtype}, {v.placements if isinstance(v, DTensor) else None}"
-            )
+        if self.rank == 0 and os.environ.get("NRL_DEBUG_MODEL_DUMP") == "1":
+            print(f"[Rank {self.rank}] Loading state dict done. Dumping parameter summaries...")
+            for k, v in self.model.named_parameters():
+                print(f"Model param {k}: {v.shape}, {v.dtype}, {v.placements if isinstance(v, DTensor) else None}")
1337-1344: Return a dict as documented or update docstring.

Current return_model_config returns the config object, not a dict.
-    def return_model_config(self) -> dict[str, Any]:
+    def return_model_config(self) -> dict[str, Any]:
@@
-        return self.model.config
+        return dict(self.model.config.to_dict()) if hasattr(self.model.config, "to_dict") else dict(self.model.config)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee8f5aa and 17aa295.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (8)

.gitmodules (1 hunks)
3rdparty/Automodel-workspace/Automodel (1 hunks)
examples/configs/grpo_math_moonlight_te_deepep.yaml (1 hunks)
examples/configs/sft_moonlight_16b_te_deepep.yaml (1 hunks)
nemo_rl/distributed/worker_groups.py (2 hunks)
nemo_rl/models/policy/dtensor_policy_worker_v2.py (8 hunks)
nemo_rl/utils/venvs.py (1 hunks)
pyproject.toml (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

nemo_rl/distributed/worker_groups.py (1)

nemo_rl/utils/venvs.py (2)

create_local_venv_on_each_node (152-189)

patch_transformers_module_dir (193-203)

nemo_rl/models/policy/dtensor_policy_worker_v2.py (2)

nemo_rl/models/policy/dtensor_policy_worker.py (1)

create_context_parallel_ctx (444-468)

nemo_rl/models/policy/utils.py (1)

resolve_model_class (73-77)

🪛 GitHub Actions: Automodel Integration and Submodule Checks

3rdparty/Automodel-workspace/Automodel

[error] 1-1: Step 'Checking submodules are fast-forwarded' failed. Submodule not fast-forwarded: Automodel. Current commit: c84b9c401b7633340846c77773b48a7b9be90d1f; Target commit: 71162c284d315193cbb4011081228da2ba943c27; Common ancestor: ce8b4b55ca9045fa8c7ad29028f3380d88552e76. This indicates parallel development; manual merge may be required.

🪛 Ruff (0.12.2)

nemo_rl/models/policy/dtensor_policy_worker_v2.py

217-217: Avoid specifying long messages outside the exception class

(TRY003)

543-545: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (5)

nemo_rl/distributed/worker_groups.py (1)

30-30: Import change looks good.

Importing patch_transformers_module_dir alongside create_local_venv_on_each_node is appropriate. No issues.

examples/configs/grpo_math_moonlight_te_deepep.yaml (1)

172-207: vLLM settings present while “use_deep_gemm: False”.

If DeepGEMM is desired given new deps, flip this when ready; otherwise leave as False to reduce variables in examples.
nemo_rl/models/policy/dtensor_policy_worker_v2.py (2)
342-357: Parallelizer call: honor PP enablement.

pp_size is read but pp_enabled=False is hardcoded. If PP>1 is configured, either enable here or assert pp_size==1.

[suggest_minor_issue]
-            parallelize_func(
+            parallelize_func(
                 self.model,
                 world_mesh=self.distributed.device_mesh,
                 moe_mesh=self.distributed.moe_mesh,
-                pp_enabled=False,
+                pp_enabled=(pp_size > 1),
67-68: Guard Gemma3 import with fallback sentinel (avoid import-time crash)

Gemma3 modeling was added in the special release tag v4.49.0-Gemma-3; wrap the import and provide a placeholder to avoid import failures on older/missing Transformers.

File: nemo_rl/models/policy/dtensor_policy_worker_v2.py
Lines: 67-68
-from transformers.models.gemma3.modeling_gemma3 import Gemma3ForCausalLM
+try:
+    from transformers.models.gemma3.modeling_gemma3 import Gemma3ForCausalLM  # type: ignore[attr-defined]
+except Exception:
+    class Gemma3ForCausalLM:  # fallback sentinel
+        pass
Likely an incorrect or invalid review comment.
pyproject.toml (1)
67-71: Pin git deps for reproducibility and CI stability; verify build compatibility in CUDA 12.8

Use immutable commit SHAs for all git URLs (grouped_gemm currently uses a tag).

These git deps are listed under no-build-isolation; confirm their build systems tolerate that.

Automation attempt failed: 'uv' CLI missing in the runner — cannot verify. Run the following in a fresh CUDA 12.8 image and paste the output:
#!/bin/bash
set -euo pipefail
python -V || true
pip --version || true
python -m venv .venv
# shellcheck disable=SC1091
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
python -m pip install ".[automodel]" --no-cache-dir
python - <<'PY'
import importlib
for m in ("deep_gemm","deep_ep","grouped_gemm"):
    print(f"{m}: {importlib.util.find_spec(m) is not None}")
PY

coderabbitai · 2025-09-18T15:34:07Z

.gitmodules

+	branch = hemil/dsv3-updates
+	shallow = false


⚠️ Potential issue

🧩 Analysis chain

Verify the branch exists and assess shallow clone impact.

The branch change from nemo-rl-submodule to hemil/dsv3-updates and disabling shallow clones will increase repository size and clone time. Ensure the new branch exists and contains the required DSv3 updates.

🏁 Script executed:

#!/bin/bash # Description: Verify the Automodel branch exists and assess the impact of disabling shallow clones echo "=== Checking if the new branch exists ===" git ls-remote --heads https://github.com/NVIDIA-NeMo/Automodel.git hemil/dsv3-updates echo -e "\n=== Checking if the old branch exists for comparison ===" git ls-remote --heads https://github.com/NVIDIA-NeMo/Automodel.git nemo-rl-submodule echo -e "\n=== Getting commit information for the updated submodule pointer ===" git ls-remote https://github.com/NVIDIA-NeMo/Automodel.git | grep c84b9c401b7633340846c77773b48a7b9be90d1f || echo "Commit not found in remote refs" echo -e "\n=== Checking repository size to understand shallow clone impact ===" curl -s https://api.github.com/repos/NVIDIA-NeMo/Automodel | jq '.size'

Length of output: 516

Action: ensure the submodule commit is reachable from the configured branch; shallow=false is OK.

hemil/dsv3-updates exists → refs/heads/hemil/dsv3-updates (20698c75f7cca86c60f426147dd50579fb00e18c).

The submodule commit c84b9c401b7633340846c77773b48a7b9be90d1f is NOT on that branch; it’s present at refs/heads/user/joyang/dsv3_v2 and refs/pull/476/head — update the submodule pointer to a commit on hemil/dsv3-updates or change the branch to the one containing this commit.

GitHub repo size: 5315 KB (~5.3 MB) — disabling shallow clones will have negligible impact.

.gitmodules (lines 14–15) still show:
branch = hemil/dsv3-updates
shallow = false

coderabbitai · 2025-09-18T15:34:07Z

3rdparty/Automodel-workspace/Automodel

@@ -1 +1 @@
-Subproject commit 71162c284d315193cbb4011081228da2ba943c27
+Subproject commit c84b9c401b7633340846c77773b48a7b9be90d1f


⚠️ Potential issue

🧩 Analysis chain

Submodule not fast‑forward to main’s pointer; fix by rebasing/merging Automodel and updating gitlink.

GitHub Action failed: current gitlink c84b9c4 is not a descendant of main’s 71162c2. Requirement: the new submodule commit must fast‑forward from main’s pointer. Rebase or merge in the Automodel repo, push, then update this gitlink to the resulting descendant commit. If .gitmodules was switched to a feature branch, confirm this is intentional or revert to a stable branch/tag.

Run to verify ancestry and config:

🏁 Script executed:

#!/bin/bash set -euo pipefail SUB_PATH="3rdparty/Automodel-workspace/Automodel" # Ensure we have the target branch pointer git fetch origin main:origin-main >/dev/null NEW_SHA=$(git ls-tree -z HEAD "$SUB_PATH" | awk -F'\t' '{print $1}' | awk '{print $3}') BASE_SHA=$(git ls-tree -z origin-main "$SUB_PATH" | awk -F'\t' '{print $1}' | awk '{print $3}') URL=$(git config -f .gitmodules "submodule.$SUB_PATH.url" || true) BRANCH=$(git config -f .gitmodules "submodule.$SUB_PATH.branch" || true) echo "Submodule: $SUB_PATH" echo "URL: ${URL:-<unset>}" echo "Branch: ${BRANCH:-<unset>}" echo "HEAD gitlink: $NEW_SHA" echo "main gitlink: $BASE_SHA" # Fetch the submodule’s objects to check ancestry tmpdir="$(mktemp -d)" git clone --no-checkout --filter=blob:none "${URL:?Missing .gitmodules url}" "$tmpdir/auto" >/dev/null git -C "$tmpdir/auto" fetch --no-tags --depth=1 origin "$NEW_SHA" >/dev/null || git -C "$tmpdir/auto" fetch origin "$NEW_SHA" git -C "$tmpdir/auto" fetch --no-tags --depth=1 origin "$BASE_SHA" >/dev/null || git -C "$tmpdir/auto" fetch origin "$BASE_SHA" if git -C "$tmpdir/auto" merge-base --is-ancestor "$BASE_SHA" "$NEW_SHA"; then echo "OK: New submodule commit fast-forwards from main." exit 0 else echo "FAIL: Not fast-forward. Rebase/merge in Automodel so NEW descends from BASE, then update gitlink." exit 1 fi

Length of output: 769

Submodule not fast‑forward to main — rebase/merge Automodel and update gitlink

HEAD gitlink c84b9c401b7633340846c77773b48a7b9be90d1f is NOT a descendant of main’s 71162c284d315193cbb4011081228da2ba943c27 (verification: FAIL).

Action: In the Automodel repo, rebase or merge hemil/dsv3-updates onto main (or otherwise create a commit that descends from 71162c2), push it, then update the submodule gitlink at 3rdparty/Automodel-workspace/Automodel to that descendant commit.

If .gitmodules’ branch=hemil/dsv3-updates was unintentional, revert to a stable branch/tag (e.g., main) and set the gitlink accordingly.

Submodule: 3rdparty/Automodel-workspace/Automodel
URL: https://github.com/NVIDIA-NeMo/Automodel.git
Branch: hemil/dsv3-updates
HEAD gitlink: c84b9c401b7633340846c77773b48a7b9be90d1f
main gitlink: 71162c284d315193cbb4011081228da2ba943c27

🧰 Tools

🪛 GitHub Actions: Automodel Integration and Submodule Checks

[error] 1-1: Step 'Checking submodules are fast-forwarded' failed. Submodule not fast-forwarded: Automodel. Current commit: c84b9c401b7633340846c77773b48a7b9be90d1f; Target commit: 71162c284d315193cbb4011081228da2ba943c27; Common ancestor: ce8b4b55ca9045fa8c7ad29028f3380d88552e76. This indicates parallel development; manual merge may be required.

🤖 Prompt for AI Agents

In 3rdparty/Automodel-workspace/Automodel around lines 1 to 1: the submodule gitlink (c84b9c4) is not a descendant of Automodel main (71162c2), so rebase or merge the hemil/dsv3-updates branch in the Automodel repository onto main (or create a new commit that descends from 71162c2), push that descendant commit to the remote, then update the submodule gitlink at 3rdparty/Automodel-workspace/Automodel to point to the new descendant commit; alternatively, if branch=hemil/dsv3-updates in .gitmodules was unintended, change the submodule to a stable branch/tag (e.g., main), push changes, and update the gitlink accordingly.

coderabbitai · 2025-09-18T15:34:07Z

examples/configs/grpo_math_moonlight_te_deepep.yaml

+  dtensor_cfg:
+    _v2: true
+    enabled: true
+    cpu_offload: False
+    sequence_parallel: false
+    activation_checkpointing: false
+    tensor_parallel_size: 1
+    context_parallel_size: 1
+    pipeline_parallel_size: 1
+    expert_parallel_size: 8
+    custom_parallel_plan: null
+


⚠️ Potential issue

EP=8 with a 1‑GPU cluster will not launch.

dtensor_cfg.expert_parallel_size: 8 with cluster.gpus_per_node: 1, num_nodes: 1 makes world_size (1) incompatible with dptpcpppep. Set ep=1 for a single‑GPU example, or bump cluster resources accordingly.

- expert_parallel_size: 8 + expert_parallel_size: 1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

dtensor_cfg:

_v2: true

enabled: true

cpu_offload: False

sequence_parallel: false

activation_checkpointing: false

tensor_parallel_size: 1

context_parallel_size: 1

pipeline_parallel_size: 1

expert_parallel_size: 8

custom_parallel_plan: null

dtensor_cfg:

_v2: true

enabled: true

cpu_offload: False

sequence_parallel: false

activation_checkpointing: false

tensor_parallel_size: 1

context_parallel_size: 1

pipeline_parallel_size: 1

expert_parallel_size: 1

custom_parallel_plan: null

🤖 Prompt for AI Agents

In examples/configs/grpo_math_moonlight_te_deepep.yaml around lines 49 to 60, dtensor_cfg.expert_parallel_size is set to 8 while the example cluster has cluster.gpus_per_node: 1 and num_nodes: 1, making the computed world_size incompatible with dp*tp*cp*pp*ep; change expert_parallel_size to 1 for this single‑GPU example (or alternatively update the example cluster resources by increasing gpus_per_node and/or num_nodes so that the total world_size is divisible by the product of data_parallel * tensor_parallel * context_parallel * pipeline_parallel * expert_parallel).

coderabbitai · 2025-09-18T15:34:07Z

nemo_rl/utils/venvs.py

+# Need to set PYTHONPATH to include transformers downloaded modules.
+# Assuming the cache directory is the same cross venvs.
+def patch_transformers_module_dir(env_vars: dict[str, str]):
+    from transformers.utils.hub import TRANSFORMERS_CACHE
+
+    module_dir =  os.path.join(TRANSFORMERS_CACHE, ".." , "modules")
+    assert os.path.exists(module_dir), f"TRANSFORMERS_CACHE is not set or does not exist: {module_dir}"
+    if "PYTHONPATH" not in env_vars:
+        env_vars["PYTHONPATH"] = module_dir
+    else:
+        env_vars["PYTHONPATH"] = f"{module_dir}:{env_vars['PYTHONPATH']}"
+
+    return env_vars


🛠️ Refactor suggestion

Make PYTHONPATH patch robust and cross‑platform; avoid brittle TRANSFORMERS_CACHE path math.

Use the official HF modules cache (HF_MODULES_CACHE) when available.

Fall back gracefully (warn/no‑op) instead of asserting and crashing.

Use os.pathsep (not hardcoded “:”) and dedupe entries.

Don’t mutate the input dict in place; return a new mapping.

Apply:

-def patch_transformers_module_dir(env_vars: dict[str, str]): - from transformers.utils.hub import TRANSFORMERS_CACHE - - module_dir = os.path.join(TRANSFORMERS_CACHE, ".." , "modules") - assert os.path.exists(module_dir), f"TRANSFORMERS_CACHE is not set or does not exist: {module_dir}" - if "PYTHONPATH" not in env_vars: - env_vars["PYTHONPATH"] = module_dir - else: - env_vars["PYTHONPATH"] = f"{module_dir}:{env_vars['PYTHONPATH']}" - - return env_vars +def patch_transformers_module_dir(env_vars: dict[str, str]) -> dict[str, str]: + # Prefer the HF modules cache; fall back to TRANSFORMERS_CACHE/../modules. + modules_dir: str | None = None + try: + from huggingface_hub.constants import HF_MODULES_CACHE # type: ignore + modules_dir = str(HF_MODULES_CACHE) + except Exception: + try: + from transformers.utils.hub import TRANSFORMERS_CACHE # type: ignore + modules_dir = os.path.normpath(os.path.join(str(TRANSFORMERS_CACHE), "..", "modules")) + except Exception: + modules_dir = None + + if not modules_dir or not os.path.isdir(modules_dir): + logger.warning("HF modules cache not found; skipping PYTHONPATH patch (looked at %s).", modules_dir) + return dict(env_vars) + + new_env = dict(env_vars) + current = new_env.get("PYTHONPATH", "") + parts = [modules_dir] + ([current] if current else []) + # Deduplicate while preserving order + seen = set() + dedup = [] + for p in parts: + if p and p not in seen: + dedup.append(p) + seen.add(p) + new_env["PYTHONPATH"] = os.pathsep.join(dedup) + return new_env

Committable suggestion skipped: line range outside the PR's diff.

Signed-off-by: Jonas Yang <[email protected]>

github-actions · 2025-09-18T15:41:58Z

⚠️ File Consistency Check

Check based on commit: 95be6ad (PR #1160 from user/joyang/dsv3_v2)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-09-18T15:42:13Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 95be6ad (PR #1160 from user/joyang/dsv3_v2)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/71162c284d315193cbb4011081228da2ba943c27/
CURRENT (PR #1160 from user/joyang/dsv3_v2): https://github.com/NVIDIA-NeMo/Automodel/commits/c84b9c401b7633340846c77773b48a7b9be90d1f/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

joyang-nv added 2 commits September 17, 2025 21:04

Fix can't find transformers_modules error.

2e1a98e

Signed-off-by: Jonas Yang <[email protected]>

Bring up moonlight te + deepep sft.

17aa295

Signed-off-by: Jonas Yang <[email protected]>

joyang-nv requested review from a team as code owners September 18, 2025 15:20

joyang-nv marked this pull request as draft September 18, 2025 15:21

coderabbitai bot reviewed Sep 18, 2025

View reviewed changes

Remove debug info.

95be6ad

Signed-off-by: Jonas Yang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DSV3 feat branch #1160

DSV3 feat branch #1160

Uh oh!

joyang-nv commented Sep 18, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Sep 18, 2025

Uh oh!

github-actions bot commented Sep 18, 2025

Uh oh!

coderabbitai bot commented Sep 18, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 18, 2025

Uh oh!

coderabbitai bot Sep 18, 2025

Uh oh!

coderabbitai bot Sep 18, 2025

Uh oh!

coderabbitai bot Sep 18, 2025

Uh oh!

github-actions bot commented Sep 18, 2025

Uh oh!

github-actions bot commented Sep 18, 2025

Uh oh!

Uh oh!

		@@ -1 +1 @@
		Subproject commit 71162c284d315193cbb4011081228da2ba943c27
		Subproject commit c84b9c401b7633340846c77773b48a7b9be90d1f

DSV3 feat branch #1160

Are you sure you want to change the base?

DSV3 feat branch #1160

Uh oh!

Conversation

joyang-nv commented Sep 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

github-actions bot commented Sep 18, 2025

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

github-actions bot commented Sep 18, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

coderabbitai bot commented Sep 18, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 18, 2025

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

github-actions bot commented Sep 18, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

Uh oh!

joyang-nv commented Sep 18, 2025 •

edited by coderabbitai bot

Loading