[megatron] feat: add script for qwen3_235b_grpo training on npu platform#5242
[megatron] feat: add script for qwen3_235b_grpo training on npu platform#5242wangshuyang31 wants to merge 2 commits intoverl-project:mainfrom
Conversation
|
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new training script for qwen3_235b_grpo on an NPU platform. The script configures various environment variables and hyperparameters for a verl.trainer.main_ppo training run. My review identified two critical issues in the shell script that would prevent it from executing correctly. One is a typo in a Hydra override argument, and the other is an incorrect placement of shell redirection, which would lead to some arguments being ignored. I have provided suggestions to resolve these issues.
| trainer.device=npu 2>&1 | tee "logs/$(date +%Y%m%d_%H%M).log" \ | ||
| +actor_rollout_ref.rollout.engine_kwargs.vllm.compilation_config.cudagraph_capture_sizes="[8, 16, 32, 64, 128]" \ | ||
| +actor_rollout_ref.rollout.engine_kwargs.vllm.compilation_config.cudagraph_mode="FULL_DECODE_ONLY" \ |
There was a problem hiding this comment.
The shell redirection 2>&1 | tee ... is misplaced in the middle of the python3 command's arguments. This will cause all subsequent arguments (lines 154-155) to be piped to tee instead of being passed to the Python script. Additionally, the trailing backslash \ on the last line is a syntax error. The redirection should be moved to the end of the entire command to ensure all arguments are correctly passed to the script.
| trainer.device=npu 2>&1 | tee "logs/$(date +%Y%m%d_%H%M).log" \ | |
| +actor_rollout_ref.rollout.engine_kwargs.vllm.compilation_config.cudagraph_capture_sizes="[8, 16, 32, 64, 128]" \ | |
| +actor_rollout_ref.rollout.engine_kwargs.vllm.compilation_config.cudagraph_mode="FULL_DECODE_ONLY" \ | |
| trainer.device=npu \ | |
| +actor_rollout_ref.rollout.engine_kwargs.vllm.compilation_config.cudagraph_capture_sizes="[8, 16, 32, 64, 128]" \ | |
| +actor_rollout_ref.rollout.engine_kwargs.vllm.compilation_config.cudagraph_mode="FULL_DECODE_ONLY" 2>&1 | tee "logs/$(date +%Y%m%d_%H%M).log" |
| trainer.save_freq=100 \ | ||
| trainer.total_epochs=1 \ | ||
| trainer.default_local_dir="${CKPTS_DIR}" \ | ||
| trainer.device=npu 2>&1 | tee "logs/$(date +%Y%m%d_%H%M).log" \ |
There was a problem hiding this comment.
trainer.device=npu 2>&1 | tee "logs/$(date +%Y%m%d_%H%M).log" \这一行 2>&1 | tee "logs/$(date +%Y%m%d_%H%M).log应该放在结尾
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.