Skip to content

Conversation

@codope
Copy link
Contributor

@codope codope commented Dec 18, 2025

Description

When the autoscaler commits to draining a node, immediately update ClusterResourceManager so the scheduler sees the node as unavailable.

Why

There's a race condition where work can be scheduled to nodes that are committed for draining:

  1. Autoscaler calls HandleDrainNode() -> SetNodeDraining() updates GcsNodeManager::draining_nodes_.
  2. Scheduler checks ClusterResourceManager::IsNodeDraining() which reads from NodeResources::is_draining.
  3. But is_draining is only updated when the raylet broadcasts via RaySyncer.

During this window, scheduler sees is_draining=false and schedules work to the draining node. To fix this, GcsResourceManager now registers a listener with GcsNodeManager that immediately updates ClusterResourceManager when a node is set to draining. This closes the race window where work could be scheduled to draining nodes.

This test demonstrates the race condition where SetNodeDraining() is called
on GcsNodeManager but ClusterResourceManager::IsNodeDraining() returns false
because the drain state is only propagated asynchronously via RaySyncer.

Timeline from the bug:
- t=14:38:18: Autoscaler decides to drain 81 nodes
- t=14:38:28: Work gets scheduled to draining nodes
- t=14:38:48: Nodes are drained, killing the scheduled work

The fix will ensure ClusterResourceManager is updated synchronously when
SetNodeDraining() is called.

Signed-off-by: Sagar Sumit <[email protected]>
When autoscaler commits to draining a node via SetNodeDraining(),
immediately update the ClusterResourceManager so the scheduler sees
the node as unavailable for scheduling.

The fix adds:
1. ClusterResourceManager::SetNodeDraining() method to directly set
   the draining state of a node
2. GcsNodeManager::AddNodeDrainingListener() to register a callback
   when a node is set to draining
3. GcsResourceManager registers the listener in its constructor to
   update ClusterResourceManager synchronously

This closes the race window where work could be scheduled to nodes
that were committed for draining but hadn't yet broadcast the state
via RaySyncer.

Timeline from the original bug:
- t=14:38:18: Autoscaler decides to drain 81 nodes
- t=14:38:28: Work gets scheduled to draining nodes (BUG)
- t=14:38:48: Nodes are drained, killing the scheduled work

With this fix, the scheduler immediately sees draining nodes as
unavailable, preventing work from being scheduled to them.

Signed-off-by: Sagar Sumit <[email protected]>
@codope codope requested a review from a team as a code owner December 18, 2025 10:29
@codope codope added the go add ONLY when ready to merge, run all tests label Dec 18, 2025
@codope codope changed the title Core 2613 scheduler race [core] Fix drain state propagation race condition Dec 18, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a race condition where the scheduler could assign work to a node that is in the process of being drained. The solution, which involves introducing a listener in GcsNodeManager to allow GcsResourceManager to immediately update ClusterResourceManager, is well-implemented and closes the race window as described. The changes are logical, and a new test case has been added to verify the fix. I've included one suggestion to enhance the new test's coverage.

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Dec 18, 2025
Signed-off-by: Sagar Sumit <[email protected]>
@edoakes
Copy link
Collaborator

edoakes commented Dec 22, 2025

@ZacAttack PTAL

if (!local_view.is_draining) {
local_view.is_draining = resource_view_sync_message.is_draining();
local_view.draining_deadline_timestamp_ms =
resource_view_sync_message.draining_deadline_timestamp_ms();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a tangent, but why does class care about the draining_deadline anyway? I'm wondering if we're bloating protocol somehow with this...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler only uses is_draining for scheduling decisions. The deadline is stored but only used for reporting in GcsResourceManager::HandleGetDrainingNodes(). We could potentially separate the scheduling-relevant fields (is_draining) from reporting-only fields (deadline). But that would require larger refactoring of NodeResources.

Copy link
Contributor

@ZacAttack ZacAttack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So some thoughts. I don't think this actually fixes the described race condition. It does tighten up the loop (probably significantly) but my understanding of the issue implies to me that this can still happen.

I'm wondering if the io_context pattern here should apply or not. It seems like there's only a single listener. Why not just do the work to make the resource manager thread safe and then update it synchronously? That'd be more complicated then what we've got here, but seems like a better solution to the described problem. What do you think?

@codope
Copy link
Contributor Author

codope commented Jan 6, 2026

So some thoughts. I don't think this actually fixes the described race condition. It does tighten up the loop (probably significantly) but my understanding of the issue implies to me that this can still happen.

I'm wondering if the io_context pattern here should apply or not. It seems like there's only a single listener. Why not just do the work to make the resource manager thread safe and then update it synchronously? That'd be more complicated then what we've got here, but seems like a better solution to the described problem. What do you think?

You're right that race window still exists due to the Post pattern. My fixes protect against stale syncer messages and timer resets, but not the queuing delay. I had thought about adding mutex to the resource manager but was not strongly inclined to do so because of performance overhead (mutex acquisition on every scheduling operation). Also, if we introduce mutex we must audit all call paths to eliminate deadlock risk.

Looking at the code, everything in GCS runs on the same io_context:

  • grpc handlers run on io_context
  • ClusterResourceManager is only accessed from io_context
  • listener callback targets the same io_context

Given this, I believe we don't need to add thread-safety to ClusterResourceManager. Instead, we can simply call the listener callback synchronously rather than using Post. This eliminates the race window while keeping the code simple:

  • HandleDrainNode calls SetNodeDraining
  • SetNodeDraining directly invokes the callback (not Post)
  • ClusterResourceManager::SetNodeDraining executes
  • Control returns to HandleDrainNode
  • Reply sent to autoscaler

Now when the next scheduling request arrives, IsNodeDraining() returns the correct value. The callback doesn't re-enter GcsNodeManager, so holding the mutex while calling it is safe.

However, if you'd prefer the thread-safety approach for future-proofing and performance overhead is fine, I can add a mutex to ClusterResourceManager::SetNodeDraining and IsNodeDraining. What do you think?

@ZacAttack
Copy link
Contributor

So some thoughts. I don't think this actually fixes the described race condition. It does tighten up the loop (probably significantly) but my understanding of the issue implies to me that this can still happen.
I'm wondering if the io_context pattern here should apply or not. It seems like there's only a single listener. Why not just do the work to make the resource manager thread safe and then update it synchronously? That'd be more complicated then what we've got here, but seems like a better solution to the described problem. What do you think?

You're right that race window still exists due to the Post pattern. My fixes protect against stale syncer messages and timer resets, but not the queuing delay. I had thought about adding mutex to the resource manager but was not strongly inclined to do so because of performance overhead (mutex acquisition on every scheduling operation). Also, if we introduce mutex we must audit all call paths to eliminate deadlock risk.

Looking at the code, everything in GCS runs on the same io_context:

  • grpc handlers run on io_context
  • ClusterResourceManager is only accessed from io_context
  • listener callback targets the same io_context

Given this, I believe we don't need to add thread-safety to ClusterResourceManager. Instead, we can simply call the listener callback synchronously rather than using Post. This eliminates the race window while keeping the code simple:

  • HandleDrainNode calls SetNodeDraining
  • SetNodeDraining directly invokes the callback (not Post)
  • ClusterResourceManager::SetNodeDraining executes
  • Control returns to HandleDrainNode
  • Reply sent to autoscaler

Now when the next scheduling request arrives, IsNodeDraining() returns the correct value. The callback doesn't re-enter GcsNodeManager, so holding the mutex while calling it is safe.

However, if you'd prefer the thread-safety approach for future-proofing and performance overhead is fine, I can add a mutex to ClusterResourceManager::SetNodeDraining and IsNodeDraining. What do you think?

So if you'd prefer to keep this PR small we can just do the synchronous call. There are already plenty of other places which use this pattern. That said, it is a 'true north' objective to make all these managers thread safe. So if you can manage it it would be swell if we could make some headway to that goal. But be mindful of your time, I won't insist on it for this fix.

@codope
Copy link
Contributor Author

codope commented Jan 9, 2026

So if you'd prefer to keep this PR small we can just do the synchronous call. There are already plenty of other places which use this pattern. That said, it is a 'true north' objective to make all these managers thread safe. So if you can manage it it would be swell if we could make some headway to that goal. But be mindful of your time, I won't insist on it for this fix.

Thanks, i looked into what thread-safety for the resource manager would require. There are ~20 public methods need locking (and two methods return refs which would need some API changes). Periodic timer callback also accesses nodes_ and needs coordination. I think it deserves a dedicated follow-up PR with proper testing for deadlocks. For this PR, I'll go with the synchronous call approach. I have updated it. PTAL again.

@codope codope requested a review from ZacAttack January 9, 2026 10:07
Copy link
Contributor

@ZacAttack ZacAttack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit left, but I think after that's addressed can merge.

Co-authored-by: Zac Policzer <[email protected]>
Signed-off-by: Sagar Sumit <[email protected]>
@codope codope requested a review from ZacAttack January 11, 2026 01:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants