Skip to content

Conversation

@yuki-97
Copy link
Contributor

@yuki-97 yuki-97 commented Nov 11, 2025

Follow up of #1472. Thanks @nv-mmanohara for adding this!

  1. Add GRPO support for HelpSteer3 on LlamaNemotron 49B.
  2. Add SFT support for tulu3 on LlamaNemotron 49B.
  3. Add CodeJaccard environment.
  4. Refactor env and data processor.
  5. Introduce run_grpo.py, will [Refactor] Clear run_grpo_math.py and run_grpo_rm.py #1572 in a subsequent PR.

Test Result

grpo math before and after refactor

image

nemotron 49B

image

Known Issue

  1. nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 cannot load from Hugging Face: [BUG] nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 cannot load from Hugging Face #1571
  2. GRPO Nemotron HelpSteer3 recipe has very high logprob error: [BUG] GRPO Nemotron HelpSteer3 recipe has very high logprob error #1570

Design explaination

Purpose of task_name = data.task_name if hasattr(data, "task_name") else task_spec.task_name

(Answer to #1506 (review))
Relative doc is add to [docs/guides/grpo.md](docs/guides/grpo.md).


1.1 Enhanced Understandability

  1. In the original run_grpo_math.py, the environment was hard-coded in the code. This file only supported one math environment, and the task_name of all datasets used was uniformly set to "math".
  2. In this scenario, task_name, task_data_processors, and env were in a strict one-to-one binding. For example, the task_name of openmathinstruct2 was hard-coded as "math", the task_data_processors for the math task was bound to math_hf_data_processor, and the environment was bound to math_env.
  3. Under this setup, one dataset could only be paired with one processor and one environment. We could interpret the task of run_grpo_math.py as "math", and the task of run_grpo_rm.py as "reward model".
  4. Currently, we have abstracted run_grpo.py—the environment is no longer hard-coded but specified via configuration. This makes the binding between datasets, environments, and processors more flexible. For instance, openmathinstruct2 can use either the math environment or the reward model environment.
  5. In this flexible setup, forcing task_name to "math" for all environments would cause confusion.
  6. Our current design is dataset-centric: the dataset name serves as the task_name, and the task corresponding to the dataset can specify its own environment and processor.

1.2 Compatibility with Future Multi-Dataset and Multi-Environment Support

  1. Consider a multi-dataset scenario where we use two datasets: openmathinstruct2 and dapo_math. Both are math-related datasets.
  2. Suppose we want openmathinstruct2 (see: [openmathinstruct2.py#L38]( )) to use the math environment, and dapo_math (see: [dapo_math.py#L37]( )) to use the reward model environment. We could theoretically specify the environment for each task in task_to_env (see: [run_grpo_math.py#L123](
    task_to_env["math"] = math_env
    )).
  3. However, since the task_name for both datasets is hard-coded as "task_name": "math" in the code, this multi-environment configuration cannot be implemented.
  4. But in current design, we can specify different task_name across datasets allowing them to use different env.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 11, 2025
@yuki-97 yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Nov 11, 2025
@yuki-97 yuki-97 force-pushed the yukih/pr-1472 branch 2 times, most recently from c9335d4 to a872ed6 Compare November 11, 2025 09:27
@yuki-97 yuki-97 removed the CI:L1 Run doctests, unit tests, and functional tests label Nov 11, 2025
@RayenTian RayenTian added the CI:L1 Run doctests, unit tests, and functional tests label Nov 16, 2025
@RayenTian RayenTian removed the CI:L1 Run doctests, unit tests, and functional tests label Nov 16, 2025
@RayenTian RayenTian force-pushed the yukih/pr-1472 branch 2 times, most recently from b7fedb9 to 9078e33 Compare November 16, 2025 03:37
@RayenTian RayenTian added the CI:L1 Run doctests, unit tests, and functional tests label Nov 16, 2025
@RayenTian RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 16, 2025
@RayenTian RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 17, 2025
@RayenTian RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 17, 2025
nv-mmanohara and others added 15 commits December 1, 2025 17:17
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: ruit <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: ruit <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: ruit <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: ruit <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: ruit <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: ruit <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: ruit <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: ruit <[email protected]>
… processors. Added raw_dataset.py and path.py for improved dataset processing. Updated project-includes in pyrefly.toml and modified grpo.md to reflect new task-dataset mapping. Cleaned up unused code and configurations in various YAML files.

Signed-off-by: ruit <[email protected]>
…or handling

    - Introduced documentation for the new Code Jaccard Environment, detailing its functionality, usage, and configuration.
    - Updated RawDataset class to provide a default processor if none is specified in the data configuration.
    - Enhanced test coverage for the helpsteer3 data processor to ensure correct functionality and output.

    Signed-off-by: ruit <[email protected]>

Signed-off-by: ruit <[email protected]>
- Updated CLEVRCoGenTDataset, OpenAIFormatDataset, and SquadDataset to inherit from the RawDataset class for improved dataset handling.
- Added necessary imports for RawDataset in the respective files.

Signed-off-by: ruit <[email protected]>
…up for vlm grpo

- Added `env_name` to `vlm_grpo_3B_megatron.yaml` and `vlm_grpo_3B.yaml` for environment specification.
- Modified `setup_data` function in `run_vlm_grpo.py` to use `env_name` for environment configuration, enhancing flexibility in dataset processing.

Signed-off-by: ruit <[email protected]>
Signed-off-by: ruit <[email protected]>
@RayenTian RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Dec 2, 2025
@github-actions
Copy link

github-actions bot commented Dec 2, 2025

⚠️ File Consistency Check

Check based on commit: a435ccf (PR #1506 from yukih/pr-1472)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but neither 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py nor 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py was updated.

Why this matters:
These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

  • Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py or 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py
  • Update the appropriate related file(s) if necessary to maintain functional consistency
  • Request access to the NVIDIA-NeMo/Automodel repository, create a PR against the nemo-rl-submodule branch, and update the Automodel submodule in the nemo-rl index
  • Add @ffrujeri as a reviewer of this PR if you have any questions about the consistency requirements
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/dtensor/parallelize.py
  • Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py
  • Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@RayenTian RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Dec 2, 2025
@github-actions
Copy link

github-actions bot commented Dec 2, 2025

⚠️ File Consistency Check

Check based on commit: 4f4a092 (PR #1506 from yukih/pr-1472)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but neither 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py nor 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py was updated.

Why this matters:
These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

  • Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py or 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py
  • Update the appropriate related file(s) if necessary to maintain functional consistency
  • Request access to the NVIDIA-NeMo/Automodel repository, create a PR against the nemo-rl-submodule branch, and update the Automodel submodule in the nemo-rl index
  • Add @ffrujeri as a reviewer of this PR if you have any questions about the consistency requirements
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/dtensor/parallelize.py
  • Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py
  • Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @RayenTian . left comments, but i didn't fully finish reviewing. i need a little more time to give feedback on the task spec/dataset change. One high level feedback is it does seem a little complicated at first glance since we have task_names now plumbed throughout

image

and we allow some flexibility that i'm not sure we want to allow

task_name = data.task_name if hasattr(data, "task_name") else task_spec.task_name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you intend to commit this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for enhance understandability and compatibility with future multi-dataset and multi-env support. More details are here #1506 (comment).



def test_nightly_compute_stays_below_1100_hours(nightly_test_suite, tracker):
def test_nightly_compute_stays_below_1300_hours(nightly_test_suite, tracker):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty big jump: ~20% more compute to do every night. is it possible to get the same signal with fewer steps if we need to test nightly? alternatively we could test only on release, but first would like to see if we can: (a) shorten the test (b) reduce the model size (c) scale down the experiment

uv run tests/json_dump_tb_logs.py $LOG_DIR --output_path $JSON_METRICS

uv run tests/check_metrics.py $JSON_METRICS \
'max(data["train/token_mult_prob_error"]) < 1.05'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we switch to gen_kl_error

from typing import Any


def import_class_from_path(name: str) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there a "get_class" offered by hydra? could we use that instead?

return chunks


def get_env(env_name: str, env_configs: dict) -> EnvironmentInterface:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably better to rename to create_env since it conveys that you're creating remotes

return env


def register_env(env_name: str, actor_class_fqn: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't see this used anywhere. is that intentional?

max_seq_length // len(message_log), len(chat_message["token_ids"])
)
]
loss_multiplier = 0.1 # Reduce loss for truncated sequences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this deviates from what we usually do where we just set to 0. what's the reason to set to 0.1 for this processor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# load dataset
data: Any = load_response_dataset(data_config, seed)
task_spec = data.task_spec
task_name = data.task_name if hasattr(data, "task_name") else task_spec.task_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to have this fallback as opposed to just asserting where to find the task_name in all cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants