Skip to content

[ci] fix: occasional CI failures caused by sglang server port conflicts#5310

Open
pengwu22 wants to merge 8 commits intoverl-project:mainfrom
pengwu22:pw/fix-ci-0212
Open

[ci] fix: occasional CI failures caused by sglang server port conflicts#5310
pengwu22 wants to merge 8 commits intoverl-project:mainfrom
pengwu22:pw/fix-ci-0212

Conversation

@pengwu22
Copy link
Collaborator

What does this PR do?

Fix CI failures caused by occasional port conflicts

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@pengwu22 pengwu22 requested a review from chenhaiq as a code owner February 13, 2026 03:29
@pengwu22 pengwu22 requested a review from wuxibin89 February 13, 2026 03:29
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix CI failures from sglang server port conflicts by attempting to bind to a random port from a predefined range before falling back to letting the OS pick a port. This is a good approach to reduce port collision in concurrent environments. The implementation in async_sglang_server.py correctly uses a unique seed for each process. However, I've found a critical issue in net_utils.py where the use of SO_REUSEPORT could lead to multiple processes sharing a port, which undermines the goal of finding a unique free port.

wuxibin89
wuxibin89 previously approved these changes Feb 13, 2026
port = sock.getsockname()[1]
return port, sock
# Fallback: let OS choose
with socket.socket(family=family, type=socket.SOCK_STREAM) as sock:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The open sock will be closed when return, then other process may choose the same port.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and originally the reuse port flag was turned on for that. Now different async server proc will random choose from completely different port sequences to avoid collision. But still, Looks like still one ci remaining https://github.com/verl-project/verl/actions/runs/21977053742/job/63490946387?pr=5310

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only ensures multiple processes random choose port using this get_free_port, while there're other places which don't use this function, they may choose port conflict with port choose by this function.

For example, ray worker process random choose port for grpc. So we still need to bind the free port.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. changed.

def _start_zmq_server(self):
self.ip = ray.util.get_node_ip_address().strip("[]")
self.listen_port, self.listen_sock = get_free_port(self.ip)
self.listen_port, _ = get_free_port(self.ip)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should keep sock alive here


# Only set dist_init_addr for multi-node; for single-node, let SGLang
# handle port selection internally via nccl_port to avoid conflicts.
if self.nnodes > 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unify single/multi node, always choose free port for it?

for i in range(max_retries):
try:
server_port, sock = get_free_port(server_address)
server_port, _ = get_free_port(server_address)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should keep sock alive here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants