-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid deleting etcd member before node is drained #434
Conversation
Control plane machines should not be removed from etcd cluster until the node is drained, otherwise kubelet would loose connectivity to the API, as it relies on local api-server, wich in turn relied on local etcd pod.
Hi @zioc, thanks for the PR. I’m afraid simply removing ETCD management logic will cause regression on #263. This logic is consistent with upstream |
@@ -151,20 +151,6 @@ func (r *RKE2ControlPlaneReconciler) scaleDownControlPlane( | |||
return ctrl.Result{}, errors.New("failed to pick control plane Machine to delete") | |||
} | |||
|
|||
// If etcd leadership is on machine that is about to be deleted, move it to the newest member available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially this change can also check that more than 1 CP machine is not currently tagged for deletion, and perform immediate etcd leadership transfer only on scaling down to 1 CP. Then the regular memebership removal can be done in a periodic. But I can’t tell if the leader removed on >2 replicas can be transferred at the moment of scale down to 1.
We should be testing this scenario in e2e, unless there were changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it would probably be preferable to keep the immediate etcd leadership transfer part, as it'll prevent transcient Leader Failures documented here.
There could indeed be some regression on #263, but on the other hand, I tend to believe that this issue is way less critical than having rolling updates of control planes blocked by node drain.
To that extent, current e2e test are not representative of typical deployments, because as nodeDrainTimeout is set to 30s, the described issue has no chance to appear, as nodes will be deleted quickly by capi controller. But in that case we give only few changes to statefull applications to handover in a clean way, and podDisruptionBudgets won't be honored.
That being said, from what I understand, etcd will become read-only when control plane is scaled down from 2 to 1 as it will loose its chorum. But in that case, API server should be still available, couldn't we just figure out a way to ensure that rke2-controller will success to remove the member corresponding to the deleted machine anyway in that case?
Otherwise, the only solution that I can forsee to preserve this scale-down capability would be to add a pre-terminate hook to the control plane machines. This way controller could:
- Wait for machine to expose DrainingSucceededCondition or have nodeDrainTimeoutExceeded
- Then remove corresponding member from etcd cluster
- Finally remove the pre-delete hook annotation to enable capi-controller to delete infra machine and correspoding node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While searching for pre-terminate hook in cluster-api code, I found that PR that landed a few days ago: kubernetes-sigs/cluster-api#11137.
It seems to be implementing something similar to the second proposition above, it could be a nice example to follow!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing that PR in flavor of #435 as it is adressing the issue in a better way (following kubernetes-sigs/cluster-api#11137 as discussed above)
Control plane machines should not be removed from etcd cluster until the node is drained, otherwise kubelet would loose connectivity to the API, as it relies on local api-server, which in turn relies on local etcd pod.
If this etcd member is removed from the cluster, kubelet won't be able to report status to the API, and node won't be properly drained.
The cleanup of removed node will be done by periodic call of reconcileEtcdMembers once the node is removed from api by capi controller.
We've observed significant improvements while testing the proposed change in https://gitlab.com/sylva-projects/sylva-core/-/merge_requests/2845.
kind/bug
fixes #431