[ci] fix: occasional CI failures caused by sglang server port conflicts#5310
[ci] fix: occasional CI failures caused by sglang server port conflicts#5310pengwu22 wants to merge 8 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request aims to fix CI failures from sglang server port conflicts by attempting to bind to a random port from a predefined range before falling back to letting the OS pick a port. This is a good approach to reduce port collision in concurrent environments. The implementation in async_sglang_server.py correctly uses a unique seed for each process. However, I've found a critical issue in net_utils.py where the use of SO_REUSEPORT could lead to multiple processes sharing a port, which undermines the goal of finding a unique free port.
verl/utils/net_utils.py
Outdated
| port = sock.getsockname()[1] | ||
| return port, sock | ||
| # Fallback: let OS choose | ||
| with socket.socket(family=family, type=socket.SOCK_STREAM) as sock: |
There was a problem hiding this comment.
The open sock will be closed when return, then other process may choose the same port.
There was a problem hiding this comment.
Yes, and originally the reuse port flag was turned on for that. Now different async server proc will random choose from completely different port sequences to avoid collision. But still, Looks like still one ci remaining https://github.com/verl-project/verl/actions/runs/21977053742/job/63490946387?pr=5310
There was a problem hiding this comment.
This only ensures multiple processes random choose port using this get_free_port, while there're other places which don't use this function, they may choose port conflict with port choose by this function.
For example, ray worker process random choose port for grpc. So we still need to bind the free port.
There was a problem hiding this comment.
sounds good. changed.
| def _start_zmq_server(self): | ||
| self.ip = ray.util.get_node_ip_address().strip("[]") | ||
| self.listen_port, self.listen_sock = get_free_port(self.ip) | ||
| self.listen_port, _ = get_free_port(self.ip) |
There was a problem hiding this comment.
we should keep sock alive here
|
|
||
| # Only set dist_init_addr for multi-node; for single-node, let SGLang | ||
| # handle port selection internally via nccl_port to avoid conflicts. | ||
| if self.nnodes > 1: |
There was a problem hiding this comment.
Can we unify single/multi node, always choose free port for it?
| for i in range(max_retries): | ||
| try: | ||
| server_port, sock = get_free_port(server_address) | ||
| server_port, _ = get_free_port(server_address) |
There was a problem hiding this comment.
we should keep sock alive here
What does this PR do?
Fix CI failures caused by occasional port conflicts
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.