Skip to content

Conversation

@burgerkingeater
Copy link

@burgerkingeater burgerkingeater commented Dec 2, 2025

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

add a reusable port allocation method, avoiding port conflict when port is created in a worker and later grabbed by other process. Intentionally leave the port occupied so that it can be used by whoever really need the port with the port reusable option on.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@burgerkingeater burgerkingeater changed the title add reusable port [worker] add reusable port Dec 2, 2025
@burgerkingeater burgerkingeater changed the title [worker] add reusable port [worker] feat: add reusable port Dec 2, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new method _get_reusable_free_port to find a reusable free port, intended to prevent port conflicts. While the goal is valid, the current implementation has a critical resource leak due to an unclosed socket. Additionally, it suffers from a race condition that undermines its purpose of avoiding port conflicts, as the port can be taken by another process after it's been identified but before it's used. My review includes a critical comment to fix the resource leak and a high-severity comment explaining the design flaw related to the race condition.

Comment on lines +66 to +71
listen_sock: socket.socket = socket.socket()
listen_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
listen_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)
listen_sock.bind(("", 0))
_, port, *_ = listen_sock.getsockname()
return port
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The socket listen_sock is not closed after use, which will lead to a resource leak. This can exhaust file descriptors if the function is called frequently. The socket should be created within a with statement to ensure it is properly closed, similar to the existing _get_free_port method.

Suggested change
listen_sock: socket.socket = socket.socket()
listen_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
listen_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)
listen_sock.bind(("", 0))
_, port, *_ = listen_sock.getsockname()
return port
with socket.socket() as listen_sock:
listen_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
listen_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)
listen_sock.bind(("", 0))
return listen_sock.getsockname()[1]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment on Gemini's review?

def _get_reusable_free_port():
listen_sock: socket.socket = socket.socket()
listen_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
listen_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also need torch set socket.SO_REUSEPORT flag when listening on MASTER_PORT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants