Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid deleting etcd member before node is drained #434

Closed
wants to merge 1 commit into from

Conversation

zioc
Copy link

@zioc zioc commented Sep 9, 2024

Control plane machines should not be removed from etcd cluster until the node is drained, otherwise kubelet would loose connectivity to the API, as it relies on local api-server, which in turn relies on local etcd pod.

If this etcd member is removed from the cluster, kubelet won't be able to report status to the API, and node won't be properly drained.

The cleanup of removed node will be done by periodic call of reconcileEtcdMembers once the node is removed from api by capi controller.

We've observed significant improvements while testing the proposed change in https://gitlab.com/sylva-projects/sylva-core/-/merge_requests/2845.

kind/bug
fixes #431

Control plane machines should not be removed from etcd cluster until
the node is drained, otherwise kubelet would loose connectivity to the
API, as it relies on local api-server, wich in turn relied on local etcd pod.
@zioc zioc requested a review from a team as a code owner September 9, 2024 09:14
@Danil-Grigorev
Copy link
Contributor

Danil-Grigorev commented Sep 9, 2024

Hi @zioc, thanks for the PR. I’m afraid simply removing ETCD management logic will cause regression on #263. This logic is consistent with upstream kubeadm implementation, and so it needs to be handled as it is done upstream.

@@ -151,20 +151,6 @@ func (r *RKE2ControlPlaneReconciler) scaleDownControlPlane(
return ctrl.Result{}, errors.New("failed to pick control plane Machine to delete")
}

// If etcd leadership is on machine that is about to be deleted, move it to the newest member available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially this change can also check that more than 1 CP machine is not currently tagged for deletion, and perform immediate etcd leadership transfer only on scaling down to 1 CP. Then the regular memebership removal can be done in a periodic. But I can’t tell if the leader removed on >2 replicas can be transferred at the moment of scale down to 1.

We should be testing this scenario in e2e, unless there were changes.

Copy link
Author

@zioc zioc Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it would probably be preferable to keep the immediate etcd leadership transfer part, as it'll prevent transcient Leader Failures documented here.

There could indeed be some regression on #263, but on the other hand, I tend to believe that this issue is way less critical than having rolling updates of control planes blocked by node drain.

To that extent, current e2e test are not representative of typical deployments, because as nodeDrainTimeout is set to 30s, the described issue has no chance to appear, as nodes will be deleted quickly by capi controller. But in that case we give only few changes to statefull applications to handover in a clean way, and podDisruptionBudgets won't be honored.

That being said, from what I understand, etcd will become read-only when control plane is scaled down from 2 to 1 as it will loose its chorum. But in that case, API server should be still available, couldn't we just figure out a way to ensure that rke2-controller will success to remove the member corresponding to the deleted machine anyway in that case?

Otherwise, the only solution that I can forsee to preserve this scale-down capability would be to add a pre-terminate hook to the control plane machines. This way controller could:

  • Wait for machine to expose DrainingSucceededCondition or have nodeDrainTimeoutExceeded
  • Then remove corresponding member from etcd cluster
  • Finally remove the pre-delete hook annotation to enable capi-controller to delete infra machine and correspoding node.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While searching for pre-terminate hook in cluster-api code, I found that PR that landed a few days ago: kubernetes-sigs/cluster-api#11137.
It seems to be implementing something similar to the second proposition above, it could be a nice example to follow!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing that PR in flavor of #435 as it is adressing the issue in a better way (following kubernetes-sigs/cluster-api#11137 as discussed above)

@zioc zioc closed this Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rolling upgrades are blocked by nodes that are not properly drained
2 participants