Skip to content

Conversation

@johnjunjun7
Copy link

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a recipe for running Qwen3-235B-A22B on Ascend NPUs. It includes new documentation and several shell scripts for setting up the environment and running the training. My review focuses on the correctness and robustness of these scripts. I've identified several critical issues, including unvalidated user inputs, incorrect process lifecycle management in a distributed setup, and a likely incorrect module path that would prevent the main training script from running. There are also high-severity issues like conflicting environment variable definitions and security concerns with Docker container privileges. Addressing these points is crucial for the recipe to be usable and reliable.

export HCCL_BUFFSIZE=300 # the buffer size of HCCL


if [ "$MASTER_ADDR" = "$CURRENT_IP" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This comparison is fragile. If the user forgets to change MASTER_ADDR from its placeholder value "IP FOR MASTER NODE", this condition will likely always evaluate to false. This would cause a node intended as a master to incorrectly act as a worker, leading to cluster setup failure. It is critical to add validation at the beginning of the script to ensure that MASTER_ADDR and SOCKET_IFNAME have been set to valid, non-placeholder values.

done
fi

exit 127 No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The script unconditionally exits with code 127, which is a critical flaw. For worker nodes, this causes the script to terminate immediately after connecting to the Ray cluster, effectively removing the worker. For the master node, the exit code 127 is misleading, as it usually indicates 'command not found'. Worker nodes must remain running to be part of the cluster. This line should be removed to allow worker processes to persist.


# Currently, it is necessary to enable `enable_chunked_prefill` in the script.
# However, in vLLM ascend, this configuration is off by default and does not take effect.
python3 -m recipe.r1_ascend.main_ppo \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The module path recipe.r1_ascend.main_ppo appears to be incorrect. This pull request adds files under recipe/qwen3_ascend/, and there is no r1_ascend directory visible. This will likely result in a ModuleNotFoundError at runtime. Please verify and correct the module path. It might need to be verl.trainer.main_ppo or another path corresponding to your project structure.


export HCCL_CONNECT_TIMEOUT=600
export HCCL_EXEC_TIMEOUT=600
export HCCL_IF_BASE_PORT=64247
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The environment variable HCCL_IF_BASE_PORT is defined here again, overwriting the value set on line 10. This is likely an error and can lead to unpredictable behavior in your network configuration. You should remove one of the two declarations to ensure consistency.

# See LICENSE in the root of the software repository for the full text of the license.

#!/bin/bash
container_name=$1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The script requires a container name as an argument but doesn't validate its presence. If run without an argument, container_name will be empty, causing the docker run command to fail with an error about the --name flag. You should enforce that this argument is provided.

Suggested change
container_name=$1
container_name=${1:?"Error: Container name not provided. Usage: $0 <container_name>"}

--device=/dev/hisi_hdc \
--net=host \
--name ${container_name} \
--privileged quay.io/ascend/vllm-ascend:v0.10.1rc1-a3 /bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using the --privileged flag grants the container unrestricted root access to the host machine, which poses a significant security risk. It's strongly recommended to avoid this and instead grant only the specific capabilities required by the application (e.g., using --cap-add). If --privileged is absolutely necessary for hardware access, this should be clearly documented with a security warning.

@glowwormX
Copy link

@johnjunjun7 May I ask what is the current progress of this PR? Can the latest version of the Verl run with this script?
In addition, I have two more questions:

  1. Why is gen_tp set to 1?
  2. Can it run on 376T? If so, how should the parallelism be set up?
    Looking forward to your answer. Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants