[core] Fix drain state propagation race condition #59536

codope · 2025-12-18T10:29:05Z

Description

When the autoscaler commits to draining a node, immediately update ClusterResourceManager so the scheduler sees the node as unavailable.

Why

There's a race condition where work can be scheduled to nodes that are committed for draining:

Autoscaler calls HandleDrainNode() -> SetNodeDraining() updates GcsNodeManager::draining_nodes_.
Scheduler checks ClusterResourceManager::IsNodeDraining() which reads from NodeResources::is_draining.
But is_draining is only updated when the raylet broadcasts via RaySyncer.

During this window, scheduler sees is_draining=false and schedules work to the draining node. To fix this, GcsResourceManager now registers a listener with GcsNodeManager that immediately updates ClusterResourceManager when a node is set to draining. This closes the race window where work could be scheduled to draining nodes.

This test demonstrates the race condition where SetNodeDraining() is called on GcsNodeManager but ClusterResourceManager::IsNodeDraining() returns false because the drain state is only propagated asynchronously via RaySyncer. Timeline from the bug: - t=14:38:18: Autoscaler decides to drain 81 nodes - t=14:38:28: Work gets scheduled to draining nodes - t=14:38:48: Nodes are drained, killing the scheduled work The fix will ensure ClusterResourceManager is updated synchronously when SetNodeDraining() is called. Signed-off-by: Sagar Sumit <[email protected]>

When autoscaler commits to draining a node via SetNodeDraining(), immediately update the ClusterResourceManager so the scheduler sees the node as unavailable for scheduling. The fix adds: 1. ClusterResourceManager::SetNodeDraining() method to directly set the draining state of a node 2. GcsNodeManager::AddNodeDrainingListener() to register a callback when a node is set to draining 3. GcsResourceManager registers the listener in its constructor to update ClusterResourceManager synchronously This closes the race window where work could be scheduled to nodes that were committed for draining but hadn't yet broadcast the state via RaySyncer. Timeline from the original bug: - t=14:38:18: Autoscaler decides to drain 81 nodes - t=14:38:28: Work gets scheduled to draining nodes (BUG) - t=14:38:48: Nodes are drained, killing the scheduled work With this fix, the scheduler immediately sees draining nodes as unavailable, preventing work from being scheduled to them. Signed-off-by: Sagar Sumit <[email protected]>

gemini-code-assist

Code Review

This pull request effectively addresses a race condition where the scheduler could assign work to a node that is in the process of being drained. The solution, which involves introducing a listener in GcsNodeManager to allow GcsResourceManager to immediately update ClusterResourceManager, is well-implemented and closes the race window as described. The changes are logical, and a new test case has been added to verify the fix. I've included one suggestion to enhance the new test's coverage.

src/ray/gcs/tests/gcs_resource_manager_test.cc

Signed-off-by: Sagar Sumit <[email protected]>

src/ray/raylet/scheduling/cluster_resource_manager.cc

Signed-off-by: Sagar Sumit <[email protected]>

src/ray/raylet/scheduling/cluster_resource_manager.cc

Signed-off-by: Sagar Sumit <[email protected]>

edoakes · 2025-12-22T22:26:21Z

@ZacAttack PTAL

src/ray/gcs/gcs_resource_manager.cc

ZacAttack · 2026-01-05T19:07:11Z

src/ray/raylet/scheduling/cluster_resource_manager.cc

+  if (!local_view.is_draining) {
+    local_view.is_draining = resource_view_sync_message.is_draining();
+    local_view.draining_deadline_timestamp_ms =
+        resource_view_sync_message.draining_deadline_timestamp_ms();


Maybe a tangent, but why does class care about the draining_deadline anyway? I'm wondering if we're bloating protocol somehow with this...

The scheduler only uses is_draining for scheduling decisions. The deadline is stored but only used for reporting in GcsResourceManager::HandleGetDrainingNodes(). We could potentially separate the scheduling-relevant fields (is_draining) from reporting-only fields (deadline). But that would require larger refactoring of NodeResources.

ZacAttack

So some thoughts. I don't think this actually fixes the described race condition. It does tighten up the loop (probably significantly) but my understanding of the issue implies to me that this can still happen.

I'm wondering if the io_context pattern here should apply or not. It seems like there's only a single listener. Why not just do the work to make the resource manager thread safe and then update it synchronously? That'd be more complicated then what we've got here, but seems like a better solution to the described problem. What do you think?

codope · 2026-01-06T06:42:30Z

So some thoughts. I don't think this actually fixes the described race condition. It does tighten up the loop (probably significantly) but my understanding of the issue implies to me that this can still happen.

I'm wondering if the io_context pattern here should apply or not. It seems like there's only a single listener. Why not just do the work to make the resource manager thread safe and then update it synchronously? That'd be more complicated then what we've got here, but seems like a better solution to the described problem. What do you think?

You're right that race window still exists due to the Post pattern. My fixes protect against stale syncer messages and timer resets, but not the queuing delay. I had thought about adding mutex to the resource manager but was not strongly inclined to do so because of performance overhead (mutex acquisition on every scheduling operation). Also, if we introduce mutex we must audit all call paths to eliminate deadlock risk.

Looking at the code, everything in GCS runs on the same io_context:

grpc handlers run on io_context
ClusterResourceManager is only accessed from io_context
listener callback targets the same io_context

Given this, I believe we don't need to add thread-safety to ClusterResourceManager. Instead, we can simply call the listener callback synchronously rather than using Post. This eliminates the race window while keeping the code simple:

HandleDrainNode calls SetNodeDraining
SetNodeDraining directly invokes the callback (not Post)
ClusterResourceManager::SetNodeDraining executes
Control returns to HandleDrainNode
Reply sent to autoscaler

Now when the next scheduling request arrives, IsNodeDraining() returns the correct value. The callback doesn't re-enter GcsNodeManager, so holding the mutex while calling it is safe.

However, if you'd prefer the thread-safety approach for future-proofing and performance overhead is fine, I can add a mutex to ClusterResourceManager::SetNodeDraining and IsNodeDraining. What do you think?

ZacAttack · 2026-01-06T16:38:36Z

So some thoughts. I don't think this actually fixes the described race condition. It does tighten up the loop (probably significantly) but my understanding of the issue implies to me that this can still happen.
I'm wondering if the io_context pattern here should apply or not. It seems like there's only a single listener. Why not just do the work to make the resource manager thread safe and then update it synchronously? That'd be more complicated then what we've got here, but seems like a better solution to the described problem. What do you think?

You're right that race window still exists due to the Post pattern. My fixes protect against stale syncer messages and timer resets, but not the queuing delay. I had thought about adding mutex to the resource manager but was not strongly inclined to do so because of performance overhead (mutex acquisition on every scheduling operation). Also, if we introduce mutex we must audit all call paths to eliminate deadlock risk.

Looking at the code, everything in GCS runs on the same io_context:

grpc handlers run on io_context

ClusterResourceManager is only accessed from io_context

listener callback targets the same io_context

Given this, I believe we don't need to add thread-safety to ClusterResourceManager. Instead, we can simply call the listener callback synchronously rather than using Post. This eliminates the race window while keeping the code simple:

HandleDrainNode calls SetNodeDraining

SetNodeDraining directly invokes the callback (not Post)

ClusterResourceManager::SetNodeDraining executes

Control returns to HandleDrainNode

Reply sent to autoscaler

Now when the next scheduling request arrives, IsNodeDraining() returns the correct value. The callback doesn't re-enter GcsNodeManager, so holding the mutex while calling it is safe.

However, if you'd prefer the thread-safety approach for future-proofing and performance overhead is fine, I can add a mutex to ClusterResourceManager::SetNodeDraining and IsNodeDraining. What do you think?

So if you'd prefer to keep this PR small we can just do the synchronous call. There are already plenty of other places which use this pattern. That said, it is a 'true north' objective to make all these managers thread safe. So if you can manage it it would be swell if we could make some headway to that goal. But be mindful of your time, I won't insist on it for this fix.

…_race

Signed-off-by: Sagar Sumit <[email protected]>

codope · 2026-01-09T05:11:44Z

So if you'd prefer to keep this PR small we can just do the synchronous call. There are already plenty of other places which use this pattern. That said, it is a 'true north' objective to make all these managers thread safe. So if you can manage it it would be swell if we could make some headway to that goal. But be mindful of your time, I won't insist on it for this fix.

Thanks, i looked into what thread-safety for the resource manager would require. There are ~20 public methods need locking (and two methods return refs which would need some API changes). Periodic timer callback also accesses nodes_ and needs coordination. I think it deserves a dedicated follow-up PR with proper testing for deadlocks. For this PR, I'll go with the synchronous call approach. I have updated it. PTAL again.

src/ray/gcs/gcs_node_manager.h

ZacAttack

One nit left, but I think after that's addressed can merge.

Co-authored-by: Zac Policzer <[email protected]> Signed-off-by: Sagar Sumit <[email protected]>

codope added 2 commits December 18, 2025 12:17

codope requested a review from a team as a code owner December 18, 2025 10:29

codope added the go add ONLY when ready to merge, run all tests label Dec 18, 2025

codope changed the title ~~Core 2613 scheduler race~~ [core] Fix drain state propagation race condition Dec 18, 2025

gemini-code-assist bot reviewed Dec 18, 2025

View reviewed changes

src/ray/gcs/tests/gcs_resource_manager_test.cc Show resolved Hide resolved

ray-gardener bot added the core Issues that should be addressed in Ray Core label Dec 18, 2025

improve test

f3bde88

Signed-off-by: Sagar Sumit <[email protected]>

cursor bot reviewed Dec 22, 2025

View reviewed changes

src/ray/raylet/scheduling/cluster_resource_manager.cc Show resolved Hide resolved

also update received_node_resources_

8b9d50f

Signed-off-by: Sagar Sumit <[email protected]>

cursor bot reviewed Dec 22, 2025

View reviewed changes

src/ray/raylet/scheduling/cluster_resource_manager.cc Show resolved Hide resolved

prevent stale syncer from resetting is_draining=true

0a3a954

Signed-off-by: Sagar Sumit <[email protected]>

codope assigned ZacAttack Jan 5, 2026

ZacAttack reviewed Jan 5, 2026

View reviewed changes

src/ray/gcs/gcs_resource_manager.cc Outdated Show resolved Hide resolved

ZacAttack reviewed Jan 5, 2026

View reviewed changes

codope added 2 commits January 9, 2026 09:42

Merge remote-tracking branch 'origin/master' into core_2613_scheduler…

7e96f3e

…_race

synchronous listener invocation

b05683f

Signed-off-by: Sagar Sumit <[email protected]>

codope requested a review from ZacAttack January 9, 2026 10:07

ZacAttack reviewed Jan 10, 2026

View reviewed changes

src/ray/gcs/gcs_node_manager.h Outdated Show resolved Hide resolved

ZacAttack requested changes Jan 10, 2026

View reviewed changes

update header comment

55eba8c

Co-authored-by: Zac Policzer <[email protected]> Signed-off-by: Sagar Sumit <[email protected]>

codope requested a review from ZacAttack January 11, 2026 01:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Fix drain state propagation race condition #59536

[core] Fix drain state propagation race condition #59536

Uh oh!

codope commented Dec 18, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edoakes commented Dec 22, 2025

Uh oh!

Uh oh!

ZacAttack Jan 5, 2026

Uh oh!

codope Jan 6, 2026

Uh oh!

ZacAttack left a comment

Uh oh!

codope commented Jan 6, 2026

Uh oh!

ZacAttack commented Jan 6, 2026

Uh oh!

codope commented Jan 9, 2026

Uh oh!

Uh oh!

ZacAttack left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[core] Fix drain state propagation race condition #59536

Are you sure you want to change the base?

[core] Fix drain state propagation race condition #59536

Uh oh!

Conversation

codope commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edoakes commented Dec 22, 2025

Uh oh!

Uh oh!

ZacAttack Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

codope Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

ZacAttack left a comment

Choose a reason for hiding this comment

Uh oh!

codope commented Jan 6, 2026

Uh oh!

ZacAttack commented Jan 6, 2026

Uh oh!

codope commented Jan 9, 2026

Uh oh!

Uh oh!

ZacAttack left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codope commented Dec 18, 2025 •

edited

Loading