-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[recipe] feat: Qwen3-235B-A22B on Ascend NPU #3628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: xuyujun <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a recipe for running Qwen3-235B-A22B on Ascend NPUs. It includes new documentation and several shell scripts for setting up the environment and running the training. My review focuses on the correctness and robustness of these scripts. I've identified several critical issues, including unvalidated user inputs, incorrect process lifecycle management in a distributed setup, and a likely incorrect module path that would prevent the main training script from running. There are also high-severity issues like conflicting environment variable definitions and security concerns with Docker container privileges. Addressing these points is crucial for the recipe to be usable and reliable.
| export HCCL_BUFFSIZE=300 # the buffer size of HCCL | ||
|
|
||
|
|
||
| if [ "$MASTER_ADDR" = "$CURRENT_IP" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comparison is fragile. If the user forgets to change MASTER_ADDR from its placeholder value "IP FOR MASTER NODE", this condition will likely always evaluate to false. This would cause a node intended as a master to incorrectly act as a worker, leading to cluster setup failure. It is critical to add validation at the beginning of the script to ensure that MASTER_ADDR and SOCKET_IFNAME have been set to valid, non-placeholder values.
| done | ||
| fi | ||
|
|
||
| exit 127 No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script unconditionally exits with code 127, which is a critical flaw. For worker nodes, this causes the script to terminate immediately after connecting to the Ray cluster, effectively removing the worker. For the master node, the exit code 127 is misleading, as it usually indicates 'command not found'. Worker nodes must remain running to be part of the cluster. This line should be removed to allow worker processes to persist.
|
|
||
| # Currently, it is necessary to enable `enable_chunked_prefill` in the script. | ||
| # However, in vLLM ascend, this configuration is off by default and does not take effect. | ||
| python3 -m recipe.r1_ascend.main_ppo \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The module path recipe.r1_ascend.main_ppo appears to be incorrect. This pull request adds files under recipe/qwen3_ascend/, and there is no r1_ascend directory visible. This will likely result in a ModuleNotFoundError at runtime. Please verify and correct the module path. It might need to be verl.trainer.main_ppo or another path corresponding to your project structure.
|
|
||
| export HCCL_CONNECT_TIMEOUT=600 | ||
| export HCCL_EXEC_TIMEOUT=600 | ||
| export HCCL_IF_BASE_PORT=64247 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # See LICENSE in the root of the software repository for the full text of the license. | ||
|
|
||
| #!/bin/bash | ||
| container_name=$1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script requires a container name as an argument but doesn't validate its presence. If run without an argument, container_name will be empty, causing the docker run command to fail with an error about the --name flag. You should enforce that this argument is provided.
| container_name=$1 | |
| container_name=${1:?"Error: Container name not provided. Usage: $0 <container_name>"} |
| --device=/dev/hisi_hdc \ | ||
| --net=host \ | ||
| --name ${container_name} \ | ||
| --privileged quay.io/ascend/vllm-ascend:v0.10.1rc1-a3 /bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the --privileged flag grants the container unrestricted root access to the host machine, which poses a significant security risk. It's strongly recommended to avoid this and instead grant only the specific capabilities required by the application (e.g., using --cap-add). If --privileged is absolutely necessary for hardware access, this should be clearly documented with a security warning.
|
@johnjunjun7 May I ask what is the current progress of this PR? Can the latest version of the Verl run with this script?
|
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)