-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[core] Fix drain state propagation race condition #59536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This test demonstrates the race condition where SetNodeDraining() is called on GcsNodeManager but ClusterResourceManager::IsNodeDraining() returns false because the drain state is only propagated asynchronously via RaySyncer. Timeline from the bug: - t=14:38:18: Autoscaler decides to drain 81 nodes - t=14:38:28: Work gets scheduled to draining nodes - t=14:38:48: Nodes are drained, killing the scheduled work The fix will ensure ClusterResourceManager is updated synchronously when SetNodeDraining() is called. Signed-off-by: Sagar Sumit <[email protected]>
When autoscaler commits to draining a node via SetNodeDraining(), immediately update the ClusterResourceManager so the scheduler sees the node as unavailable for scheduling. The fix adds: 1. ClusterResourceManager::SetNodeDraining() method to directly set the draining state of a node 2. GcsNodeManager::AddNodeDrainingListener() to register a callback when a node is set to draining 3. GcsResourceManager registers the listener in its constructor to update ClusterResourceManager synchronously This closes the race window where work could be scheduled to nodes that were committed for draining but hadn't yet broadcast the state via RaySyncer. Timeline from the original bug: - t=14:38:18: Autoscaler decides to drain 81 nodes - t=14:38:28: Work gets scheduled to draining nodes (BUG) - t=14:38:48: Nodes are drained, killing the scheduled work With this fix, the scheduler immediately sees draining nodes as unavailable, preventing work from being scheduled to them. Signed-off-by: Sagar Sumit <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a race condition where the scheduler could assign work to a node that is in the process of being drained. The solution, which involves introducing a listener in GcsNodeManager to allow GcsResourceManager to immediately update ClusterResourceManager, is well-implemented and closes the race window as described. The changes are logical, and a new test case has been added to verify the fix. I've included one suggestion to enhance the new test's coverage.
Signed-off-by: Sagar Sumit <[email protected]>
Signed-off-by: Sagar Sumit <[email protected]>
Signed-off-by: Sagar Sumit <[email protected]>
|
@ZacAttack PTAL |
| if (!local_view.is_draining) { | ||
| local_view.is_draining = resource_view_sync_message.is_draining(); | ||
| local_view.draining_deadline_timestamp_ms = | ||
| resource_view_sync_message.draining_deadline_timestamp_ms(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a tangent, but why does class care about the draining_deadline anyway? I'm wondering if we're bloating protocol somehow with this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheduler only uses is_draining for scheduling decisions. The deadline is stored but only used for reporting in GcsResourceManager::HandleGetDrainingNodes(). We could potentially separate the scheduling-relevant fields (is_draining) from reporting-only fields (deadline). But that would require larger refactoring of NodeResources.
ZacAttack
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So some thoughts. I don't think this actually fixes the described race condition. It does tighten up the loop (probably significantly) but my understanding of the issue implies to me that this can still happen.
I'm wondering if the io_context pattern here should apply or not. It seems like there's only a single listener. Why not just do the work to make the resource manager thread safe and then update it synchronously? That'd be more complicated then what we've got here, but seems like a better solution to the described problem. What do you think?
You're right that race window still exists due to the Post pattern. My fixes protect against stale syncer messages and timer resets, but not the queuing delay. I had thought about adding mutex to the resource manager but was not strongly inclined to do so because of performance overhead (mutex acquisition on every scheduling operation). Also, if we introduce mutex we must audit all call paths to eliminate deadlock risk. Looking at the code, everything in GCS runs on the same io_context:
Given this, I believe we don't need to add thread-safety to ClusterResourceManager. Instead, we can simply call the listener callback synchronously rather than using Post. This eliminates the race window while keeping the code simple:
Now when the next scheduling request arrives, IsNodeDraining() returns the correct value. The callback doesn't re-enter GcsNodeManager, so holding the mutex while calling it is safe. However, if you'd prefer the thread-safety approach for future-proofing and performance overhead is fine, I can add a mutex to ClusterResourceManager::SetNodeDraining and IsNodeDraining. What do you think? |
So if you'd prefer to keep this PR small we can just do the synchronous call. There are already plenty of other places which use this pattern. That said, it is a 'true north' objective to make all these managers thread safe. So if you can manage it it would be swell if we could make some headway to that goal. But be mindful of your time, I won't insist on it for this fix. |
Signed-off-by: Sagar Sumit <[email protected]>
Thanks, i looked into what thread-safety for the resource manager would require. There are ~20 public methods need locking (and two methods return refs which would need some API changes). Periodic timer callback also accesses |
ZacAttack
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit left, but I think after that's addressed can merge.
Co-authored-by: Zac Policzer <[email protected]> Signed-off-by: Sagar Sumit <[email protected]>
Description
When the autoscaler commits to draining a node, immediately update
ClusterResourceManagerso the scheduler sees the node as unavailable.Why
There's a race condition where work can be scheduled to nodes that are committed for draining:
HandleDrainNode()->SetNodeDraining()updatesGcsNodeManager::draining_nodes_.ClusterResourceManager::IsNodeDraining()which reads fromNodeResources::is_draining.is_drainingis only updated when the raylet broadcasts via RaySyncer.During this window, scheduler sees
is_draining=falseand schedules work to the draining node. To fix this,GcsResourceManagernow registers a listener withGcsNodeManagerthat immediately updatesClusterResourceManagerwhen a node is set to draining. This closes the race window where work could be scheduled to draining nodes.