Skip to content

Conversation

joyang-nv
Copy link
Member

@joyang-nv joyang-nv commented Sep 18, 2025

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Added example configs for GRPO and SFT with Moonlight-16B, DTensor v2, sequence packing, and vLLM-based generation.
    • Auto-patching of runtime environments to locate Transformers modules across workers.
    • Expanded optional dependencies to enable advanced GEMM/EP kernels and Transformer Engine support.
  • Refactor

    • Reworked DTensor policy worker initialization and checkpoint loading for more reliable, scalable multi-parallel training.
  • Chores

    • Updated Automodel submodule branch and commit references.

@joyang-nv joyang-nv requested review from a team as code owners September 18, 2025 15:20
@joyang-nv joyang-nv marked this pull request as draft September 18, 2025 15:21
Copy link

⚠️ File Consistency Check

Check based on commit: 17aa295 (PR #1160 from user/joyang/dsv3_v2)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 17aa295 (PR #1160 from user/joyang/dsv3_v2)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/71162c284d315193cbb4011081228da2ba943c27/
CURRENT (PR #1160 from user/joyang/dsv3_v2): https://github.com/NVIDIA-NeMo/Automodel/commits/c84b9c401b7633340846c77773b48a7b9be90d1f/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Copy link
Contributor

coderabbitai bot commented Sep 18, 2025

📝 Walkthrough

Walkthrough

Submodule config and pointer updated. Two new YAML configs added. Runtime worker environment now patches Transformers module directory. DTensor policy worker v2 refactored for new distributed/parallel setup and base-checkpoint loading. New utility added for PYTHONPATH patching. Optional dependencies extended in pyproject and build settings updated.

Changes

Cohort / File(s) Summary of Changes
Git/Submodules
\.gitmodules, 3rdparty/Automodel-workspace/Automodel
Submodule branch changed to hemil/dsv3-updates; shallow clone disabled. Submodule pointer advanced to commit c84b9c4.
Training Configurations (new)
examples/configs/grpo_math_moonlight_te_deepep.yaml, examples/configs/sft_moonlight_16b_te_deepep.yaml
Added full GRPO and SFT YAML configs for Moonlight-16B with DTensor v2, vLLM generation (for GRPO), checkpointing, logging, and resource specs.
Runtime Env Patching
nemo_rl/utils/venvs.py, nemo_rl/distributed/worker_groups.py
Added patch_transformers_module_dir to set PYTHONPATH for Transformers modules and invoked it during per-worker env setup.
DTensor Policy Worker Refactor
nemo_rl/models/policy/dtensor_policy_worker_v2.py
Reworked initialization: explicit model/model_config, new distributed meshes (dp/tp/cp/pp/ep), optional parallelize_fn, target-based model instantiation, base-checkpoint broadcast loading, and helper methods _get_distributed/_setup_parallel.
Dependencies
pyproject.toml
Added deep_gemm, deep_ep, grouped_gemm, transformer-engine[pytorch]==2.5.0 to automodel extras; added grouped_gemm to no-build-isolation-package.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Orchestrator as Orchestrator
  participant Worker as DTensorPolicyWorkerV2
  participant Dist as Distributed Setup
  participant Model as Model Factory
  participant Ckpt as Base Checkpoint Loader

  Orchestrator->>Worker: initialize(policy_cfg, dtensor_cfg, ...)
  Worker->>Dist: _get_distributed(world_size, dp,tp,cp,pp,ep, seq_parallel)
  Dist-->>Worker: distributed handles (meshes, ranks)
  Worker->>Worker: _setup_parallel(dp,tp,cp,pp,ep, seq_parallel)

  alt target provided in dtensor_cfg.model._target_
    Worker->>Model: resolve target (+optional backend), build model_kwargs
    Model-->>Worker: self.model (empty/skeleton from_config)
  else no explicit target
    Worker->>Model: create skeleton model (init_empty_weights/from_config)
    Model-->>Worker: self.model
  end

  Worker->>Ckpt: load_model_from_base_checkpoint(self.model, model_config, meshes)
  Ckpt-->>Worker: weights loaded/broadcast across ranks

  Worker->>Worker: tie weights, finalize setup (parallelize_fn or fsdp2_strategy_parallelize)
  Worker-->>Orchestrator: ready
Loading
sequenceDiagram
  autonumber
  participant Manager as WorkerGroups
  participant Env as Runtime Env Vars
  participant Venv as create_local_venv_on_each_node
  participant Patch as patch_transformers_module_dir
  participant Ray as Worker Creation

  Manager->>Venv: prepare venv per node
  Venv-->>Manager: paths (VIRTUAL_ENV, UV_PROJECT_ENVIRONMENT)
  Manager->>Env: set env vars for worker
  Manager->>Patch: patch_transformers_module_dir(env_vars)
  Patch-->>Manager: env_vars with PYTHONPATH modules dir
  Manager->>Ray: create worker actors with env_vars
  Ray-->>Manager: workers ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • yfw
  • terrykong
  • yaoyu-33

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Results For Major Changes ⚠️ Warning This PR includes major changes: a substantial refactor of DTensorPolicyWorkerV2’s initialization and distributed topology, new environment handling, new optional heavy dependencies, and new training configs that can affect numerics, convergence, and performance. The PR description is a template with no testing details, no validation of numerical parity or convergence, and no before/after performance numbers or run context. Given the scope and potential impact, testing evidence is required but absent, so the check fails. Please update the PR description with concrete testing results: (1) unit/integration tests passing, (2) numerical parity/convergence checks versus a baseline for DTensorPolicyWorkerV2 (loss curves and final metrics), and (3) performance benchmarks before vs. after with exact configs, hardware, software versions, and seeds; include reproducible commands and links to logs/artifacts. If available, add a short smoke-test CI job that exercises the new configs to ensure launchability and basic correctness.
Title Check ❓ Inconclusive The provided title "DSV3 feat branch" is short but too vague to convey the primary change(s) in this PR; while "DSV3" hints at DeepSeek/DSV3 work that appears in the diffs, the title does not summarize the main technical changes (DTensorPolicyWorkerV2 refactor, DeepSeek/DTensor integration, venv patching, new example configs, and dependency updates) or the intent for a reviewer scanning history. Because the title is generic and non-descriptive, the check is inconclusive. Please rename the PR to a concise, single-sentence title that highlights the main change (for example: "Integrate DeepSeek v3 DTensor backend and refactor DTensorPolicyWorkerV2" or "Add DeepSeek v3 support, DTensorPolicyWorkerV2 refactor, example configs, and automodel dependency updates") and avoid generic words like "branch" or "feat"; this will make the purpose immediately clear to reviewers.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch user/joyang/dsv3_v2

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_rl/models/policy/dtensor_policy_worker_v2.py (1)

898-925: Fix potential UnboundLocalError for num_valid_samples with dummy microbatches.

When mb_idx >= iterator_len, num_valid_samples is never set but is referenced later.

-                        # skip the update for dummy batches
-                        if mb_idx < iterator_len:
+                        # skip the update for dummy batches
+                        num_valid_samples = 0
+                        if mb_idx < iterator_len:
                             ## scale by the number of global batches so we get the correct
                             ## value when summing metrics across all microbatches
                             for k in loss_metrics.keys():
                                 loss_metrics[k] /= num_global_batches
                             num_valid_samples = loss_metrics["num_valid_samples"]
                             loss_metrics["lr"] = self.optimizer.param_groups[0]["lr"]
                             loss_metrics["global_valid_seqs"] = global_valid_seqs.item()
                             loss_metrics["global_valid_toks"] = global_valid_toks.item()
                         else:
                             loss *= 0
🧹 Nitpick comments (8)
nemo_rl/distributed/worker_groups.py (1)

511-516: Use the function’s return to avoid relying on in‑place mutation; keeps intent explicit.

Minor clarity tweak; behavior unchanged.

-                patch_transformers_module_dir(runtime_env["env_vars"])
+                runtime_env["env_vars"] = patch_transformers_module_dir(runtime_env["env_vars"])
pyproject.toml (1)

185-185: no-build-isolation-package expansion: watch for build‑time tool leakage.

Keeping deep_gemm/deep_ep/grouped_gemm outside PEP‑517 isolation can surface host toolchain issues. If CI becomes flaky, consider per‑package overrides or pinning wheel artifacts.

examples/configs/sft_moonlight_16b_te_deepep.yaml (2)

36-61: Sanity on parallel dims vs cluster size.

ep=2, tp=1, cp=1, pp=1 is fine for the default 8‑GPU single node. Just note that examples should either document required GPUs or set ep=1 for portability.


166-175: W&B defaults in example.

wandb_enabled: true is great, but consider setting it to false by default in public examples to avoid accidental uploads.

nemo_rl/models/policy/dtensor_policy_worker_v2.py (4)

308-309: Use self.tokenizer for clarity.

Minor readability; avoids shadowing and future refactor hazards.

-        if self.model.config.pad_token_id is None:
-            self.model.config.pad_token_id = tokenizer.pad_token_id
+        if self.model.config.pad_token_id is None:
+            self.model.config.pad_token_id = self.tokenizer.pad_token_id

620-625: Ensure GBS divisible by DP size.

Avoid silent truncation/misbatching.

         if gbs is None:
             gbs = self.cfg["train_global_batch_size"]
         if mbs is None:
             mbs = self.cfg["train_micro_batch_size"]
-        local_gbs = gbs // self.dp_size
+        if gbs % self.dp_size != 0:
+            raise ValueError(f"train_global_batch_size ({gbs}) must be divisible by dp_size ({self.dp_size})")
+        local_gbs = gbs // self.dp_size

394-399: Reduce log spam: gate verbose param dumps to rank 0 or debug mode.

Printing every param on all ranks is noisy and slows startup.

-        print(f"[Rank {self.rank}] Loading state dict done.")
-        for k, v in self.model.named_parameters():
-            print(
-                f"Model param {k}: {v.shape}, {v.dtype}, {v.placements if isinstance(v, DTensor) else None}"
-            )
+        if self.rank == 0 and os.environ.get("NRL_DEBUG_MODEL_DUMP") == "1":
+            print(f"[Rank {self.rank}] Loading state dict done. Dumping parameter summaries...")
+            for k, v in self.model.named_parameters():
+                print(f"Model param {k}: {v.shape}, {v.dtype}, {v.placements if isinstance(v, DTensor) else None}")

1337-1344: Return a dict as documented or update docstring.

Current return_model_config returns the config object, not a dict.

-    def return_model_config(self) -> dict[str, Any]:
+    def return_model_config(self) -> dict[str, Any]:
@@
-        return self.model.config
+        return dict(self.model.config.to_dict()) if hasattr(self.model.config, "to_dict") else dict(self.model.config)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee8f5aa and 17aa295.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • .gitmodules (1 hunks)
  • 3rdparty/Automodel-workspace/Automodel (1 hunks)
  • examples/configs/grpo_math_moonlight_te_deepep.yaml (1 hunks)
  • examples/configs/sft_moonlight_16b_te_deepep.yaml (1 hunks)
  • nemo_rl/distributed/worker_groups.py (2 hunks)
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py (8 hunks)
  • nemo_rl/utils/venvs.py (1 hunks)
  • pyproject.toml (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_rl/distributed/worker_groups.py (1)
nemo_rl/utils/venvs.py (2)
  • create_local_venv_on_each_node (152-189)
  • patch_transformers_module_dir (193-203)
nemo_rl/models/policy/dtensor_policy_worker_v2.py (2)
nemo_rl/models/policy/dtensor_policy_worker.py (1)
  • create_context_parallel_ctx (444-468)
nemo_rl/models/policy/utils.py (1)
  • resolve_model_class (73-77)
🪛 GitHub Actions: Automodel Integration and Submodule Checks
3rdparty/Automodel-workspace/Automodel

[error] 1-1: Step 'Checking submodules are fast-forwarded' failed. Submodule not fast-forwarded: Automodel. Current commit: c84b9c401b7633340846c77773b48a7b9be90d1f; Target commit: 71162c284d315193cbb4011081228da2ba943c27; Common ancestor: ce8b4b55ca9045fa8c7ad29028f3380d88552e76. This indicates parallel development; manual merge may be required.

🪛 Ruff (0.12.2)
nemo_rl/models/policy/dtensor_policy_worker_v2.py

217-217: Avoid specifying long messages outside the exception class

(TRY003)


543-545: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (5)
nemo_rl/distributed/worker_groups.py (1)

30-30: Import change looks good.

Importing patch_transformers_module_dir alongside create_local_venv_on_each_node is appropriate. No issues.

examples/configs/grpo_math_moonlight_te_deepep.yaml (1)

172-207: vLLM settings present while “use_deep_gemm: False”.

If DeepGEMM is desired given new deps, flip this when ready; otherwise leave as False to reduce variables in examples.

nemo_rl/models/policy/dtensor_policy_worker_v2.py (2)

342-357: Parallelizer call: honor PP enablement.

pp_size is read but pp_enabled=False is hardcoded. If PP>1 is configured, either enable here or assert pp_size==1.

[suggest_minor_issue]

-            parallelize_func(
+            parallelize_func(
                 self.model,
                 world_mesh=self.distributed.device_mesh,
                 moe_mesh=self.distributed.moe_mesh,
-                pp_enabled=False,
+                pp_enabled=(pp_size > 1),

67-68: Guard Gemma3 import with fallback sentinel (avoid import-time crash)

Gemma3 modeling was added in the special release tag v4.49.0-Gemma-3; wrap the import and provide a placeholder to avoid import failures on older/missing Transformers.

File: nemo_rl/models/policy/dtensor_policy_worker_v2.py
Lines: 67-68

-from transformers.models.gemma3.modeling_gemma3 import Gemma3ForCausalLM
+try:
+    from transformers.models.gemma3.modeling_gemma3 import Gemma3ForCausalLM  # type: ignore[attr-defined]
+except Exception:
+    class Gemma3ForCausalLM:  # fallback sentinel
+        pass

Likely an incorrect or invalid review comment.

pyproject.toml (1)

67-71: Pin git deps for reproducibility and CI stability; verify build compatibility in CUDA 12.8

  • Use immutable commit SHAs for all git URLs (grouped_gemm currently uses a tag).
  • These git deps are listed under no-build-isolation; confirm their build systems tolerate that.

Automation attempt failed: 'uv' CLI missing in the runner — cannot verify. Run the following in a fresh CUDA 12.8 image and paste the output:

#!/bin/bash
set -euo pipefail
python -V || true
pip --version || true
python -m venv .venv
# shellcheck disable=SC1091
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
python -m pip install ".[automodel]" --no-cache-dir
python - <<'PY'
import importlib
for m in ("deep_gemm","deep_ep","grouped_gemm"):
    print(f"{m}: {importlib.util.find_spec(m) is not None}")
PY

Comment on lines +14 to +15
branch = hemil/dsv3-updates
shallow = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Verify the branch exists and assess shallow clone impact.

The branch change from nemo-rl-submodule to hemil/dsv3-updates and disabling shallow clones will increase repository size and clone time. Ensure the new branch exists and contains the required DSv3 updates.


🏁 Script executed:

#!/bin/bash
# Description: Verify the Automodel branch exists and assess the impact of disabling shallow clones

echo "=== Checking if the new branch exists ==="
git ls-remote --heads https://github.com/NVIDIA-NeMo/Automodel.git hemil/dsv3-updates

echo -e "\n=== Checking if the old branch exists for comparison ==="
git ls-remote --heads https://github.com/NVIDIA-NeMo/Automodel.git nemo-rl-submodule

echo -e "\n=== Getting commit information for the updated submodule pointer ==="
git ls-remote https://github.com/NVIDIA-NeMo/Automodel.git | grep c84b9c401b7633340846c77773b48a7b9be90d1f || echo "Commit not found in remote refs"

echo -e "\n=== Checking repository size to understand shallow clone impact ==="
curl -s https://api.github.com/repos/NVIDIA-NeMo/Automodel | jq '.size'

Length of output: 516


Action: ensure the submodule commit is reachable from the configured branch; shallow=false is OK.

  • hemil/dsv3-updates exists → refs/heads/hemil/dsv3-updates (20698c75f7cca86c60f426147dd50579fb00e18c).
  • The submodule commit c84b9c401b7633340846c77773b48a7b9be90d1f is NOT on that branch; it’s present at refs/heads/user/joyang/dsv3_v2 and refs/pull/476/head — update the submodule pointer to a commit on hemil/dsv3-updates or change the branch to the one containing this commit.
  • GitHub repo size: 5315 KB (~5.3 MB) — disabling shallow clones will have negligible impact.

.gitmodules (lines 14–15) still show:
branch = hemil/dsv3-updates
shallow = false

@@ -1 +1 @@
Subproject commit 71162c284d315193cbb4011081228da2ba943c27
Subproject commit c84b9c401b7633340846c77773b48a7b9be90d1f
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Submodule not fast‑forward to main’s pointer; fix by rebasing/merging Automodel and updating gitlink.

GitHub Action failed: current gitlink c84b9c4 is not a descendant of main’s 71162c2. Requirement: the new submodule commit must fast‑forward from main’s pointer. Rebase or merge in the Automodel repo, push, then update this gitlink to the resulting descendant commit. If .gitmodules was switched to a feature branch, confirm this is intentional or revert to a stable branch/tag.

Run to verify ancestry and config:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

SUB_PATH="3rdparty/Automodel-workspace/Automodel"

# Ensure we have the target branch pointer
git fetch origin main:origin-main >/dev/null

NEW_SHA=$(git ls-tree -z HEAD "$SUB_PATH" | awk -F'\t' '{print $1}' | awk '{print $3}')
BASE_SHA=$(git ls-tree -z origin-main "$SUB_PATH" | awk -F'\t' '{print $1}' | awk '{print $3}')

URL=$(git config -f .gitmodules "submodule.$SUB_PATH.url" || true)
BRANCH=$(git config -f .gitmodules "submodule.$SUB_PATH.branch" || true)

echo "Submodule: $SUB_PATH"
echo "URL: ${URL:-<unset>}"
echo "Branch: ${BRANCH:-<unset>}"
echo "HEAD gitlink: $NEW_SHA"
echo "main gitlink: $BASE_SHA"

# Fetch the submodule’s objects to check ancestry
tmpdir="$(mktemp -d)"
git clone --no-checkout --filter=blob:none "${URL:?Missing .gitmodules url}" "$tmpdir/auto" >/dev/null
git -C "$tmpdir/auto" fetch --no-tags --depth=1 origin "$NEW_SHA" >/dev/null || git -C "$tmpdir/auto" fetch origin "$NEW_SHA"
git -C "$tmpdir/auto" fetch --no-tags --depth=1 origin "$BASE_SHA" >/dev/null || git -C "$tmpdir/auto" fetch origin "$BASE_SHA"

if git -C "$tmpdir/auto" merge-base --is-ancestor "$BASE_SHA" "$NEW_SHA"; then
  echo "OK: New submodule commit fast-forwards from main."
  exit 0
else
  echo "FAIL: Not fast-forward. Rebase/merge in Automodel so NEW descends from BASE, then update gitlink."
  exit 1
fi

Length of output: 769


Submodule not fast‑forward to main — rebase/merge Automodel and update gitlink

HEAD gitlink c84b9c401b7633340846c77773b48a7b9be90d1f is NOT a descendant of main’s 71162c284d315193cbb4011081228da2ba943c27 (verification: FAIL).

  • Action: In the Automodel repo, rebase or merge hemil/dsv3-updates onto main (or otherwise create a commit that descends from 71162c2), push it, then update the submodule gitlink at 3rdparty/Automodel-workspace/Automodel to that descendant commit.
  • If .gitmodules’ branch=hemil/dsv3-updates was unintentional, revert to a stable branch/tag (e.g., main) and set the gitlink accordingly.

Submodule: 3rdparty/Automodel-workspace/Automodel
URL: https://github.com/NVIDIA-NeMo/Automodel.git
Branch: hemil/dsv3-updates
HEAD gitlink: c84b9c401b7633340846c77773b48a7b9be90d1f
main gitlink: 71162c284d315193cbb4011081228da2ba943c27

🧰 Tools
🪛 GitHub Actions: Automodel Integration and Submodule Checks

[error] 1-1: Step 'Checking submodules are fast-forwarded' failed. Submodule not fast-forwarded: Automodel. Current commit: c84b9c401b7633340846c77773b48a7b9be90d1f; Target commit: 71162c284d315193cbb4011081228da2ba943c27; Common ancestor: ce8b4b55ca9045fa8c7ad29028f3380d88552e76. This indicates parallel development; manual merge may be required.

🤖 Prompt for AI Agents
In 3rdparty/Automodel-workspace/Automodel around lines 1 to 1: the submodule
gitlink (c84b9c4) is not a descendant of Automodel main (71162c2), so rebase or
merge the hemil/dsv3-updates branch in the Automodel repository onto main (or
create a new commit that descends from 71162c2), push that descendant commit to
the remote, then update the submodule gitlink at
3rdparty/Automodel-workspace/Automodel to point to the new descendant commit;
alternatively, if branch=hemil/dsv3-updates in .gitmodules was unintended,
change the submodule to a stable branch/tag (e.g., main), push changes, and
update the gitlink accordingly.

Comment on lines +49 to +60
dtensor_cfg:
_v2: true
enabled: true
cpu_offload: False
sequence_parallel: false
activation_checkpointing: false
tensor_parallel_size: 1
context_parallel_size: 1
pipeline_parallel_size: 1
expert_parallel_size: 8
custom_parallel_plan: null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

EP=8 with a 1‑GPU cluster will not launch.

dtensor_cfg.expert_parallel_size: 8 with cluster.gpus_per_node: 1, num_nodes: 1 makes world_size (1) incompatible with dptpcpppep. Set ep=1 for a single‑GPU example, or bump cluster resources accordingly.

-    expert_parallel_size: 8
+    expert_parallel_size: 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
dtensor_cfg:
_v2: true
enabled: true
cpu_offload: False
sequence_parallel: false
activation_checkpointing: false
tensor_parallel_size: 1
context_parallel_size: 1
pipeline_parallel_size: 1
expert_parallel_size: 8
custom_parallel_plan: null
dtensor_cfg:
_v2: true
enabled: true
cpu_offload: False
sequence_parallel: false
activation_checkpointing: false
tensor_parallel_size: 1
context_parallel_size: 1
pipeline_parallel_size: 1
expert_parallel_size: 1
custom_parallel_plan: null
🤖 Prompt for AI Agents
In examples/configs/grpo_math_moonlight_te_deepep.yaml around lines 49 to 60,
dtensor_cfg.expert_parallel_size is set to 8 while the example cluster has
cluster.gpus_per_node: 1 and num_nodes: 1, making the computed world_size
incompatible with dp*tp*cp*pp*ep; change expert_parallel_size to 1 for this
single‑GPU example (or alternatively update the example cluster resources by
increasing gpus_per_node and/or num_nodes so that the total world_size is
divisible by the product of data_parallel * tensor_parallel * context_parallel *
pipeline_parallel * expert_parallel).

Comment on lines +191 to +203
# Need to set PYTHONPATH to include transformers downloaded modules.
# Assuming the cache directory is the same cross venvs.
def patch_transformers_module_dir(env_vars: dict[str, str]):
from transformers.utils.hub import TRANSFORMERS_CACHE

module_dir = os.path.join(TRANSFORMERS_CACHE, ".." , "modules")
assert os.path.exists(module_dir), f"TRANSFORMERS_CACHE is not set or does not exist: {module_dir}"
if "PYTHONPATH" not in env_vars:
env_vars["PYTHONPATH"] = module_dir
else:
env_vars["PYTHONPATH"] = f"{module_dir}:{env_vars['PYTHONPATH']}"

return env_vars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Make PYTHONPATH patch robust and cross‑platform; avoid brittle TRANSFORMERS_CACHE path math.

  • Use the official HF modules cache (HF_MODULES_CACHE) when available.
  • Fall back gracefully (warn/no‑op) instead of asserting and crashing.
  • Use os.pathsep (not hardcoded “:”) and dedupe entries.
  • Don’t mutate the input dict in place; return a new mapping.

Apply:

-def patch_transformers_module_dir(env_vars: dict[str, str]):
-    from transformers.utils.hub import TRANSFORMERS_CACHE
-
-    module_dir =  os.path.join(TRANSFORMERS_CACHE, ".." , "modules")
-    assert os.path.exists(module_dir), f"TRANSFORMERS_CACHE is not set or does not exist: {module_dir}"
-    if "PYTHONPATH" not in env_vars:
-        env_vars["PYTHONPATH"] = module_dir
-    else:
-        env_vars["PYTHONPATH"] = f"{module_dir}:{env_vars['PYTHONPATH']}"
-
-    return env_vars
+def patch_transformers_module_dir(env_vars: dict[str, str]) -> dict[str, str]:
+    # Prefer the HF modules cache; fall back to TRANSFORMERS_CACHE/../modules.
+    modules_dir: str | None = None
+    try:
+        from huggingface_hub.constants import HF_MODULES_CACHE  # type: ignore
+        modules_dir = str(HF_MODULES_CACHE)
+    except Exception:
+        try:
+            from transformers.utils.hub import TRANSFORMERS_CACHE  # type: ignore
+            modules_dir = os.path.normpath(os.path.join(str(TRANSFORMERS_CACHE), "..", "modules"))
+        except Exception:
+            modules_dir = None
+
+    if not modules_dir or not os.path.isdir(modules_dir):
+        logger.warning("HF modules cache not found; skipping PYTHONPATH patch (looked at %s).", modules_dir)
+        return dict(env_vars)
+
+    new_env = dict(env_vars)
+    current = new_env.get("PYTHONPATH", "")
+    parts = [modules_dir] + ([current] if current else [])
+    # Deduplicate while preserving order
+    seen = set()
+    dedup = []
+    for p in parts:
+        if p and p not in seen:
+            dedup.append(p)
+            seen.add(p)
+    new_env["PYTHONPATH"] = os.pathsep.join(dedup)
+    return new_env

Committable suggestion skipped: line range outside the PR's diff.

Signed-off-by: Jonas Yang <[email protected]>
Copy link

⚠️ File Consistency Check

Check based on commit: 95be6ad (PR #1160 from user/joyang/dsv3_v2)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 95be6ad (PR #1160 from user/joyang/dsv3_v2)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/71162c284d315193cbb4011081228da2ba943c27/
CURRENT (PR #1160 from user/joyang/dsv3_v2): https://github.com/NVIDIA-NeMo/Automodel/commits/c84b9c401b7633340846c77773b48a7b9be90d1f/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant