refactor: refactor env and data processor & add nemotron super 49b recipes #1506

yuki-97 · 2025-11-11T07:53:55Z

Follow up of #1472. Thanks @nv-mmanohara for adding this!

Add GRPO support for HelpSteer3 on LlamaNemotron 49B.
Add SFT support for tulu3 on LlamaNemotron 49B.
Add CodeJaccard environment.
Refactor env and data processor.
Introduce run_grpo.py, will [Refactor] Clear run_grpo_math.py and run_grpo_rm.py #1572 in a subsequent PR.

Test Result

grpo math before and after refactor

nemotron 49B

Known Issue

nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 cannot load from Hugging Face: [BUG] nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 cannot load from Hugging Face #1571
GRPO Nemotron HelpSteer3 recipe has very high logprob error: [BUG] GRPO Nemotron HelpSteer3 recipe has very high logprob error #1570

Design explaination

Purpose of `task_name = data.task_name if hasattr(data, "task_name") else task_spec.task_name`

(Answer to #1506 (review))
Relative doc is add to [docs/guides/grpo.md](docs/guides/grpo.md).

1.1 Enhanced Understandability

In the original run_grpo_math.py, the environment was hard-coded in the code. This file only supported one math environment, and the task_name of all datasets used was uniformly set to "math".
In this scenario, task_name, task_data_processors, and env were in a strict one-to-one binding. For example, the task_name of openmathinstruct2 was hard-coded as "math", the task_data_processors for the math task was bound to math_hf_data_processor, and the environment was bound to math_env.
Under this setup, one dataset could only be paired with one processor and one environment. We could interpret the task of run_grpo_math.py as "math", and the task of run_grpo_rm.py as "reward model".
Currently, we have abstracted run_grpo.py—the environment is no longer hard-coded but specified via configuration. This makes the binding between datasets, environments, and processors more flexible. For instance, openmathinstruct2 can use either the math environment or the reward model environment.
In this flexible setup, forcing task_name to "math" for all environments would cause confusion.
Our current design is dataset-centric: the dataset name serves as the task_name, and the task corresponding to the dataset can specify its own environment and processor.

1.2 Compatibility with Future Multi-Dataset and Multi-Environment Support

Consider a multi-dataset scenario where we use two datasets: openmathinstruct2 and dapo_math. Both are math-related datasets.
Suppose we want openmathinstruct2 (see: [openmathinstruct2.py#L38](

RL/nemo_rl/data/datasets/response_datasets/openmathinstruct2.py

Line 38 in 859a89a

"task_name": "math",

)) to use the math environment, and dapo_math (see: [dapo_math.py#L37](

RL/nemo_rl/data/datasets/response_datasets/dapo_math.py

Line 37 in 859a89a

"task_name": "math",

)) to use the reward model environment. We could theoretically specify the environment for each task in task_to_env (see: [run_grpo_math.py#L123](

RL/examples/run_grpo_math.py

Line 123 in 859a89a

task_to_env["math"] = math_env

)).
However, since the task_name for both datasets is hard-coded as "task_name": "math" in the code, this multi-environment configuration cannot be implemented.
But in current design, we can specify different task_name across datasets allowing them to use different env.

Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: ruit <[email protected]>

Signed-off-by: ruit <[email protected]>

Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: ruit <[email protected]>

… processors. Added raw_dataset.py and path.py for improved dataset processing. Updated project-includes in pyrefly.toml and modified grpo.md to reflect new task-dataset mapping. Cleaned up unused code and configurations in various YAML files. Signed-off-by: ruit <[email protected]>

…or handling - Introduced documentation for the new Code Jaccard Environment, detailing its functionality, usage, and configuration. - Updated RawDataset class to provide a default processor if none is specified in the data configuration. - Enhanced test coverage for the helpsteer3 data processor to ensure correct functionality and output. Signed-off-by: ruit <[email protected]> Signed-off-by: ruit <[email protected]>

- Updated CLEVRCoGenTDataset, OpenAIFormatDataset, and SquadDataset to inherit from the RawDataset class for improved dataset handling. - Added necessary imports for RawDataset in the respective files. Signed-off-by: ruit <[email protected]>

…up for vlm grpo - Added `env_name` to `vlm_grpo_3B_megatron.yaml` and `vlm_grpo_3B.yaml` for environment specification. - Modified `setup_data` function in `run_vlm_grpo.py` to use `env_name` for environment configuration, enhancing flexibility in dataset processing. Signed-off-by: ruit <[email protected]>

…tion Signed-off-by: ruit <[email protected]>

Signed-off-by: ruit <[email protected]>

github-actions · 2025-12-02T01:19:55Z

⚠️ File Consistency Check

Check based on commit: a435ccf (PR #1506 from yukih/pr-1472)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but neither 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py nor 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py was updated.

Why this matters:
These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py or 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py
Update the appropriate related file(s) if necessary to maintain functional consistency
Request access to the NVIDIA-NeMo/Automodel repository, create a PR against the nemo-rl-submodule branch, and update the Automodel submodule in the nemo-rl index
Add @ffrujeri as a reviewer of this PR if you have any questions about the consistency requirements
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/dtensor/parallelize.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

Signed-off-by: ruit <[email protected]>

github-actions · 2025-12-02T04:36:30Z

⚠️ File Consistency Check

Check based on commit: 4f4a092 (PR #1506 from yukih/pr-1472)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but neither 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py nor 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py was updated.

Why this matters:
These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py or 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py
Update the appropriate related file(s) if necessary to maintain functional consistency
Request access to the NVIDIA-NeMo/Automodel repository, create a PR against the nemo-rl-submodule branch, and update the Automodel submodule in the nemo-rl index
Add @ffrujeri as a reviewer of this PR if you have any questions about the consistency requirements
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/dtensor/parallelize.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

terrykong

thanks @RayenTian . left comments, but i didn't fully finish reviewing. i need a little more time to give feedback on the task spec/dataset change. One high level feedback is it does seem a little complicated at first glance since we have task_names now plumbed throughout

and we allow some flexibility that i'm not sure we want to allow

task_name = data.task_name if hasattr(data, "task_name") else task_spec.task_name

terrykong · 2025-12-02T08:09:09Z

tests/functional/distillation/metrics.json

did you intend to commit this?

This is for enhance understandability and compatibility with future multi-dataset and multi-env support. More details are here #1506 (comment).

terrykong · 2025-12-02T08:11:45Z

tests/unit/test_recipes_and_test_suites.py



-def test_nightly_compute_stays_below_1100_hours(nightly_test_suite, tracker):
+def test_nightly_compute_stays_below_1300_hours(nightly_test_suite, tracker):


this is a pretty big jump: ~20% more compute to do every night. is it possible to get the same signal with fewer steps if we need to test nightly? alternatively we could test only on release, but first would like to see if we can: (a) shorten the test (b) reduce the model size (c) scale down the experiment

terrykong · 2025-12-02T08:13:10Z

tests/functional/grpo_math_env.sh

+uv run tests/json_dump_tb_logs.py $LOG_DIR --output_path $JSON_METRICS
+
+uv run tests/check_metrics.py $JSON_METRICS \
+    'max(data["train/token_mult_prob_error"]) < 1.05'


could we switch to gen_kl_error

terrykong · 2025-12-02T08:14:10Z

nemo_rl/utils/path.py

+from typing import Any
+
+
+def import_class_from_path(name: str) -> Any:


isn't there a "get_class" offered by hydra? could we use that instead?

terrykong · 2025-12-02T08:15:59Z

nemo_rl/environments/utils.py

    return chunks
+
+
+def get_env(env_name: str, env_configs: dict) -> EnvironmentInterface:


it's probably better to rename to create_env since it conveys that you're creating remotes

terrykong · 2025-12-02T08:17:00Z

nemo_rl/environments/utils.py

+    return env
+
+
+def register_env(env_name: str, actor_class_fqn: str) -> None:


i didn't see this used anywhere. is that intentional?

terrykong · 2025-12-02T08:20:27Z

nemo_rl/data/processors.py

+                    max_seq_length // len(message_log), len(chat_message["token_ids"])
+                )
+            ]
+        loss_multiplier = 0.1  # Reduce loss for truncated sequences


this deviates from what we usually do where we just set to 0. what's the reason to set to 0.1 for this processor?

@joyang-nv @nv-mmanohara

terrykong · 2025-12-02T08:22:55Z

examples/run_grpo.py

+    # load dataset
+    data: Any = load_response_dataset(data_config, seed)
+    task_spec = data.task_spec
+    task_name = data.task_name if hasattr(data, "task_name") else task_spec.task_name


is there a reason to have this fallback as opposed to just asserting where to find the task_name in all cases?

github-actions bot added the documentation Improvements or additions to documentation label Nov 11, 2025

yuki-97 force-pushed the yukih/pr-1472 branch from 75f3d5c to 5ebbc73 Compare November 11, 2025 07:54

yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Nov 11, 2025

yuki-97 temporarily deployed to nemo-ci November 11, 2025 07:56 — with GitHub Actions Inactive

yuki-97 force-pushed the yukih/pr-1472 branch 2 times, most recently from c9335d4 to a872ed6 Compare November 11, 2025 09:27

yuki-97 removed the CI:L1 Run doctests, unit tests, and functional tests label Nov 11, 2025

RayenTian added the CI:L1 Run doctests, unit tests, and functional tests label Nov 16, 2025

RayenTian temporarily deployed to nemo-ci November 16, 2025 03:31 — with GitHub Actions Inactive

RayenTian removed the CI:L1 Run doctests, unit tests, and functional tests label Nov 16, 2025

RayenTian had a problem deploying to nemo-ci November 16, 2025 03:35 — with GitHub Actions Error

RayenTian force-pushed the yukih/pr-1472 branch 2 times, most recently from b7fedb9 to 9078e33 Compare November 16, 2025 03:37

RayenTian added the CI:L1 Run doctests, unit tests, and functional tests label Nov 16, 2025

RayenTian temporarily deployed to nemo-ci November 16, 2025 03:38 — with GitHub Actions Inactive

RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 16, 2025

RayenTian temporarily deployed to nemo-ci November 16, 2025 08:50 — with GitHub Actions Inactive

RayenTian force-pushed the yukih/pr-1472 branch from c0bfaa6 to ab0ac80 Compare November 17, 2025 08:44

RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 17, 2025

RayenTian temporarily deployed to nemo-ci November 17, 2025 08:58 — with GitHub Actions Inactive

RayenTian temporarily deployed to nemo-ci November 17, 2025 08:59 — with GitHub Actions Inactive

RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 17, 2025

RayenTian temporarily deployed to nemo-ci November 17, 2025 14:21 — with GitHub Actions Inactive

nv-mmanohara and others added 15 commits December 1, 2025 17:17

MergeGRPO to main

b85112c

Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: ruit <[email protected]>

SFT update

97386b0

Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: ruit <[email protected]>

Resolving comments

2fa8305

Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: ruit <[email protected]>

lint

d2b1f63

Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: ruit <[email protected]>

refactor yaml

72aafff

Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: ruit <[email protected]>

update custom parallel plan doc

e7ef523

Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: ruit <[email protected]>

revert logger.py

6761089

Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: ruit <[email protected]>

unify run_grpo with multiple env

16c28c7

Signed-off-by: ruit <[email protected]>

remove useless code

a2297f3

Signed-off-by: Yuki Huang <[email protected]> Signed-off-by: ruit <[email protected]>

Remove unused base model parallel plan from custom parallel configura…

3c0a43f

…tion Signed-off-by: ruit <[email protected]>

fix doc

c20342c

Signed-off-by: ruit <[email protected]>

RayenTian force-pushed the yukih/pr-1472 branch from d77b56a to a435ccf Compare December 2, 2025 01:19

RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Dec 2, 2025

RayenTian temporarily deployed to nemo-ci December 2, 2025 01:20 — with GitHub Actions Inactive

Update nightly compute test to allow for up to 1300 GPU hours

4f4a092

Signed-off-by: ruit <[email protected]>

RayenTian force-pushed the yukih/pr-1472 branch from a435ccf to 4f4a092 Compare December 2, 2025 04:35

RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Dec 2, 2025

RayenTian temporarily deployed to nemo-ci December 2, 2025 04:36 — with GitHub Actions Inactive

RayenTian temporarily deployed to nemo-ci December 2, 2025 05:11 — with GitHub Actions Inactive

RayenTian temporarily deployed to nemo-ci December 2, 2025 07:59 — with GitHub Actions Inactive

terrykong reviewed Dec 2, 2025

View reviewed changes



		def test_nightly_compute_stays_below_1100_hours(nightly_test_suite, tracker):
		def test_nightly_compute_stays_below_1300_hours(nightly_test_suite, tracker):

		from typing import Any


		def import_class_from_path(name: str) -> Any:

		return chunks


		def get_env(env_name: str, env_configs: dict) -> EnvironmentInterface:

		return env


		def register_env(env_name: str, actor_class_fqn: str) -> None:

refactor: refactor env and data processor & add nemotron super 49b recipes #1506

Are you sure you want to change the base?

refactor: refactor env and data processor & add nemotron super 49b recipes #1506

Uh oh!

Conversation

yuki-97 commented Nov 11, 2025 • edited by RayenTian Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Result

grpo math before and after refactor

nemotron 49B

Known Issue

Design explaination

Purpose of task_name = data.task_name if hasattr(data, "task_name") else task_spec.task_name

1.1 Enhanced Understandability

1.2 Compatibility with Future Multi-Dataset and Multi-Environment Support

Uh oh!

github-actions bot commented Dec 2, 2025

⚠️ File Consistency Check

⚠️ Parallel Plans Synchronization Warning

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions bot commented Dec 2, 2025

⚠️ File Consistency Check

⚠️ Parallel Plans Synchronization Warning

✅ DTensor Policy Worker Synchronization Check

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yuki-97 commented Nov 11, 2025 •

edited by RayenTian

Loading

Purpose of `task_name = data.task_name if hasattr(data, "task_name") else task_spec.task_name`