Avoid deleting etcd member before node is drained #434

zioc · 2024-09-09T09:14:50Z

Control plane machines should not be removed from etcd cluster until the node is drained, otherwise kubelet would loose connectivity to the API, as it relies on local api-server, which in turn relies on local etcd pod.

If this etcd member is removed from the cluster, kubelet won't be able to report status to the API, and node won't be properly drained.

The cleanup of removed node will be done by periodic call of reconcileEtcdMembers once the node is removed from api by capi controller.

We've observed significant improvements while testing the proposed change in https://gitlab.com/sylva-projects/sylva-core/-/merge_requests/2845.

kind/bug
fixes #431

Control plane machines should not be removed from etcd cluster until the node is drained, otherwise kubelet would loose connectivity to the API, as it relies on local api-server, wich in turn relied on local etcd pod.

Danil-Grigorev · 2024-09-09T09:18:30Z

Hi @zioc, thanks for the PR. I’m afraid simply removing ETCD management logic will cause regression on #263. This logic is consistent with upstream kubeadm implementation, and so it needs to be handled as it is done upstream.

Danil-Grigorev · 2024-09-09T09:44:18Z

controlplane/internal/controllers/scale.go

@@ -151,20 +151,6 @@ func (r *RKE2ControlPlaneReconciler) scaleDownControlPlane(
 		return ctrl.Result{}, errors.New("failed to pick control plane Machine to delete")
 	}

-	// If etcd leadership is on machine that is about to be deleted, move it to the newest member available.


Potentially this change can also check that more than 1 CP machine is not currently tagged for deletion, and perform immediate etcd leadership transfer only on scaling down to 1 CP. Then the regular memebership removal can be done in a periodic. But I can’t tell if the leader removed on >2 replicas can be transferred at the moment of scale down to 1.

We should be testing this scenario in e2e, unless there were changes.

You're right, it would probably be preferable to keep the immediate etcd leadership transfer part, as it'll prevent transcient Leader Failures documented here.

There could indeed be some regression on #263, but on the other hand, I tend to believe that this issue is way less critical than having rolling updates of control planes blocked by node drain.

To that extent, current e2e test are not representative of typical deployments, because as nodeDrainTimeout is set to 30s, the described issue has no chance to appear, as nodes will be deleted quickly by capi controller. But in that case we give only few changes to statefull applications to handover in a clean way, and podDisruptionBudgets won't be honored.

That being said, from what I understand, etcd will become read-only when control plane is scaled down from 2 to 1 as it will loose its chorum. But in that case, API server should be still available, couldn't we just figure out a way to ensure that rke2-controller will success to remove the member corresponding to the deleted machine anyway in that case?

Otherwise, the only solution that I can forsee to preserve this scale-down capability would be to add a pre-terminate hook to the control plane machines. This way controller could:

Wait for machine to expose DrainingSucceededCondition or have nodeDrainTimeoutExceeded

Then remove corresponding member from etcd cluster

Finally remove the pre-delete hook annotation to enable capi-controller to delete infra machine and correspoding node.

While searching for pre-terminate hook in cluster-api code, I found that PR that landed a few days ago: kubernetes-sigs/cluster-api#11137.
It seems to be implementing something similar to the second proposition above, it could be a nice example to follow!

Closing that PR in flavor of #435 as it is adressing the issue in a better way (following kubernetes-sigs/cluster-api#11137 as discussed above)

Avoid deleting etcd member before node is drained

687b5de

Control plane machines should not be removed from etcd cluster until the node is drained, otherwise kubelet would loose connectivity to the API, as it relies on local api-server, wich in turn relied on local etcd pod.

zioc requested a review from a team as a code owner September 9, 2024 09:14

Danil-Grigorev reviewed Sep 9, 2024

View reviewed changes

zioc closed this Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid deleting etcd member before node is drained #434

Avoid deleting etcd member before node is drained #434

zioc commented Sep 9, 2024

Danil-Grigorev commented Sep 9, 2024 •

edited

Loading

Danil-Grigorev Sep 9, 2024

zioc Sep 9, 2024 •

edited

Loading

zioc Sep 9, 2024

zioc Sep 11, 2024

Avoid deleting etcd member before node is drained #434

Avoid deleting etcd member before node is drained #434

Conversation

zioc commented Sep 9, 2024

Danil-Grigorev commented Sep 9, 2024 • edited Loading

Danil-Grigorev Sep 9, 2024

Choose a reason for hiding this comment

zioc Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

zioc Sep 9, 2024

Choose a reason for hiding this comment

zioc Sep 11, 2024

Choose a reason for hiding this comment

Danil-Grigorev commented Sep 9, 2024 •

edited

Loading

zioc Sep 9, 2024 •

edited

Loading