[recipe] feat: Qwen3-235B-A22B on Ascend NPU #3628

johnjunjun7 · 2025-09-25T15:37:04Z

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

Signed-off-by: xuyujun <[email protected]>

gemini-code-assist

Code Review

This pull request adds a recipe for running Qwen3-235B-A22B on Ascend NPUs. It includes new documentation and several shell scripts for setting up the environment and running the training. My review focuses on the correctness and robustness of these scripts. I've identified several critical issues, including unvalidated user inputs, incorrect process lifecycle management in a distributed setup, and a likely incorrect module path that would prevent the main training script from running. There are also high-severity issues like conflicting environment variable definitions and security concerns with Docker container privileges. Addressing these points is crucial for the recipe to be usable and reliable.

gemini-code-assist · 2025-09-25T15:38:58Z

recipe/qwen3_ascend/ray_start_grpo_npu.sh

+export HCCL_BUFFSIZE=300                        # the buffer size of HCCL
+
+
+if [ "$MASTER_ADDR" = "$CURRENT_IP" ]; then


This comparison is fragile. If the user forgets to change MASTER_ADDR from its placeholder value "IP FOR MASTER NODE", this condition will likely always evaluate to false. This would cause a node intended as a master to incorrectly act as a worker, leading to cluster setup failure. It is critical to add validation at the beginning of the script to ensure that MASTER_ADDR and SOCKET_IFNAME have been set to valid, non-placeholder values.

gemini-code-assist · 2025-09-25T15:38:59Z

recipe/qwen3_ascend/ray_start_grpo_npu.sh

+  done
+fi
+
+exit 127


The script unconditionally exits with code 127, which is a critical flaw. For worker nodes, this causes the script to terminate immediately after connecting to the Ray cluster, effectively removing the worker. For the master node, the exit code 127 is misleading, as it usually indicates 'command not found'. Worker nodes must remain running to be part of the cluster. This line should be removed to allow worker processes to persist.

gemini-code-assist · 2025-09-25T15:38:59Z

recipe/qwen3_ascend/run_grpo_qwen3_235b_npu_megatron.sh

+
+# Currently, it is necessary to enable `enable_chunked_prefill` in the script. 
+# However, in vLLM ascend, this configuration is off by default and does not take effect.
+python3 -m recipe.r1_ascend.main_ppo \


The module path recipe.r1_ascend.main_ppo appears to be incorrect. This pull request adds files under recipe/qwen3_ascend/, and there is no r1_ascend directory visible. This will likely result in a ModuleNotFoundError at runtime. Please verify and correct the module path. It might need to be verl.trainer.main_ppo or another path corresponding to your project structure.

gemini-code-assist · 2025-09-25T15:38:59Z

recipe/qwen3_ascend/ray_start_grpo_npu.sh

+
+export HCCL_CONNECT_TIMEOUT=600
+export HCCL_EXEC_TIMEOUT=600
+export HCCL_IF_BASE_PORT=64247


The environment variable HCCL_IF_BASE_PORT is defined here again, overwriting the value set on line 10. This is likely an error and can lead to unpredictable behavior in your network configuration. You should remove one of the two declarations to ensure consistency.

gemini-code-assist · 2025-09-25T15:38:59Z

recipe/qwen3_ascend/run_container.sh

+# See LICENSE in the root of the software repository for the full text of the license.
+
+#!/bin/bash
+container_name=$1


The script requires a container name as an argument but doesn't validate its presence. If run without an argument, container_name will be empty, causing the docker run command to fail with an error about the --name flag. You should enforce that this argument is provided.

Suggested change

container_name=$1

container_name=${1:?"Error: Container name not provided. Usage: $0 <container_name>"}

gemini-code-assist · 2025-09-25T15:38:59Z

recipe/qwen3_ascend/run_container.sh

+--device=/dev/hisi_hdc \
+--net=host \
+--name ${container_name} \
+--privileged quay.io/ascend/vllm-ascend:v0.10.1rc1-a3 /bin/bash


Using the --privileged flag grants the container unrestricted root access to the host machine, which poses a significant security risk. It's strongly recommended to avoid this and instead grant only the specific capabilities required by the application (e.g., using --cap-add). If --privileged is absolutely necessary for hardware access, this should be clearly documented with a security warning.

glowwormX · 2025-11-05T07:16:22Z

@johnjunjun7 May I ask what is the current progress of this PR? Can the latest version of the Verl run with this script?
In addition, I have two more questions:

Why is gen_tp set to 1?
Can it run on 376T? If so, how should the parallelism be set up?
Looking forward to your answer. Thank you very much.

[recipe] feat: Qwen3-235B-A22B on Ascend NPU

4dcbf0a

Signed-off-by: xuyujun <[email protected]>

gemini-code-assist bot reviewed Sep 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[recipe] feat: Qwen3-235B-A22B on Ascend NPU #3628

[recipe] feat: Qwen3-235B-A22B on Ascend NPU #3628

johnjunjun7 commented Sep 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Uh oh!

gemini-code-assist bot Sep 25, 2025

Uh oh!

gemini-code-assist bot Sep 25, 2025

Uh oh!

gemini-code-assist bot Sep 25, 2025

Uh oh!

gemini-code-assist bot Sep 25, 2025

Uh oh!

gemini-code-assist bot Sep 25, 2025

Uh oh!

glowwormX commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		export HCCL_BUFFSIZE=300 # the buffer size of HCCL


		if [ "$MASTER_ADDR" = "$CURRENT_IP" ]; then

	container_name=$1
	container_name=${1:?"Error: Container name not provided. Usage: $0 <container_name>"}

[recipe] feat: Qwen3-235B-A22B on Ascend NPU #3628

Are you sure you want to change the base?

[recipe] feat: Qwen3-235B-A22B on Ascend NPU #3628

Conversation

johnjunjun7 commented Sep 25, 2025

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

glowwormX commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants