-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[core] Fix drain state propagation race condition #59536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
e9ebc9a
c592d42
f3bde88
8b9d50f
0a3a954
7e96f3e
b05683f
55eba8c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -101,9 +101,11 @@ bool ClusterResourceManager::UpdateNode( | |
| // Last update time to the local node resources view. | ||
| local_view.last_resource_update_time = absl::Now(); | ||
|
|
||
| local_view.is_draining = resource_view_sync_message.is_draining(); | ||
| local_view.draining_deadline_timestamp_ms = | ||
| resource_view_sync_message.draining_deadline_timestamp_ms(); | ||
| if (!local_view.is_draining) { | ||
| local_view.is_draining = resource_view_sync_message.is_draining(); | ||
| local_view.draining_deadline_timestamp_ms = | ||
| resource_view_sync_message.draining_deadline_timestamp_ms(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe a tangent, but why does class care about the draining_deadline anyway? I'm wondering if we're bloating protocol somehow with this...
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The scheduler only uses is_draining for scheduling decisions. The deadline is stored but only used for reporting in GcsResourceManager::HandleGetDrainingNodes(). We could potentially separate the scheduling-relevant fields (is_draining) from reporting-only fields (deadline). But that would require larger refactoring of NodeResources. |
||
| } | ||
|
|
||
| AddOrUpdateNode(node_id, local_view); | ||
| received_node_resources_[node_id] = std::move(local_view); | ||
|
|
@@ -115,6 +117,26 @@ bool ClusterResourceManager::RemoveNode(scheduling::NodeID node_id) { | |
| return nodes_.erase(node_id) != 0; | ||
| } | ||
|
|
||
| bool ClusterResourceManager::SetNodeDraining(const scheduling::NodeID &node_id, | ||
ZacAttack marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| bool is_draining, | ||
| int64_t draining_deadline_timestamp_ms) { | ||
| auto it = nodes_.find(node_id); | ||
| if (it == nodes_.end()) { | ||
| return false; | ||
| } | ||
|
|
||
| auto *local_view = it->second.GetMutableLocalView(); | ||
| local_view->is_draining = is_draining; | ||
| local_view->draining_deadline_timestamp_ms = draining_deadline_timestamp_ms; | ||
| auto rnr_it = received_node_resources_.find(node_id); | ||
| if (rnr_it != received_node_resources_.end()) { | ||
| rnr_it->second.is_draining = is_draining; | ||
| rnr_it->second.draining_deadline_timestamp_ms = draining_deadline_timestamp_ms; | ||
| } | ||
|
|
||
| return true; | ||
| } | ||
cursor[bot] marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| bool ClusterResourceManager::GetNodeResources(scheduling::NodeID node_id, | ||
| NodeResources *ret_resources) const { | ||
| auto it = nodes_.find(node_id); | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.