-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Add a task gpu profile to align with the other profiling functions #59994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Alan Guo <[email protected]>
Signed-off-by: Alan Guo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new endpoint for GPU profiling of a specific task, which is a great addition to the profiling capabilities. The implementation is consistent with existing profiling functions for CPU and tracebacks.
My review focuses on improving code maintainability by identifying several areas of code duplication that have become more apparent with the addition of this new function. I've suggested refactoring these duplicated blocks into helper methods. I also have a minor suggestion to improve logging consistency.
Signed-off-by: Alan Guo <[email protected]>
| reply = await reporter_stub.GpuProfiling( | ||
| reporter_pb2.GpuProfilingRequest( | ||
| pid=pid, num_iterations=num_iterations | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing null check causes unclear error for nonexistent task
Medium Severity
When get_worker_details_for_running_task doesn't find a matching task, it returns (None, None) rather than raising an exception. The code only catches ValueError (raised when task exists but isn't running), so when the task doesn't exist at all, pid becomes None. This None value is then passed to GpuProfiling, which will fail at the gRPC layer with an unclear error instead of returning a helpful "task not found" message. The PR description states the endpoint should "fail if the task is no longer running" but this case isn't properly handled.
🔬 Verification Test
Why verification test was not possible: This bug requires a running Ray cluster with the dashboard to test the HTTP endpoint behavior. The issue is in the control flow logic - when get_worker_details_for_running_task returns (None, None) for a non-existent task, the code path proceeds to call GpuProfiling with pid=None rather than returning an appropriate error response. The bug can be verified by code inspection: line 204-205 shows the function returns (None, None) when no tasks found, and lines 467-471 only catch ValueError, allowing None to flow through to line 479.
Signed-off-by: Alan Guo <[email protected]>
Description
This function accepts a task id instead of a worker id and fails if the task is no longer running, unlike the worker id which will gladly profile whatever is currently running on the worker.
Related issues
Additional information