Skip to content

Conversation

@alanwguo
Copy link
Contributor

@alanwguo alanwguo commented Jan 9, 2026

Description

This function accepts a task id instead of a worker id and fails if the task is no longer running, unlike the worker id which will gladly profile whatever is currently running on the worker.

Related issues

Additional information

@alanwguo alanwguo requested a review from a team as a code owner January 9, 2026 05:26
Signed-off-by: Alan Guo <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new endpoint for GPU profiling of a specific task, which is a great addition to the profiling capabilities. The implementation is consistent with existing profiling functions for CPU and tracebacks.

My review focuses on improving code maintainability by identifying several areas of code duplication that have become more apparent with the addition of this new function. I've suggested refactoring these duplicated blocks into helper methods. I also have a minor suggestion to improve logging consistency.

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Jan 9, 2026
Signed-off-by: Alan Guo <[email protected]>
reply = await reporter_stub.GpuProfiling(
reporter_pb2.GpuProfilingRequest(
pid=pid, num_iterations=num_iterations
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing null check causes unclear error for nonexistent task

Medium Severity

When get_worker_details_for_running_task doesn't find a matching task, it returns (None, None) rather than raising an exception. The code only catches ValueError (raised when task exists but isn't running), so when the task doesn't exist at all, pid becomes None. This None value is then passed to GpuProfiling, which will fail at the gRPC layer with an unclear error instead of returning a helpful "task not found" message. The PR description states the endpoint should "fail if the task is no longer running" but this case isn't properly handled.

🔬 Verification Test

Why verification test was not possible: This bug requires a running Ray cluster with the dashboard to test the HTTP endpoint behavior. The issue is in the control flow logic - when get_worker_details_for_running_task returns (None, None) for a non-existent task, the code path proceeds to call GpuProfiling with pid=None rather than returning an appropriate error response. The bug can be verified by code inspection: line 204-205 shows the function returns (None, None) when no tasks found, and lines 467-471 only catch ValueError, allowing None to flow through to line 479.

Fix in Cursor Fix in Web

Signed-off-by: Alan Guo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants