Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistently propagate down timeouts from MD => MS => Machines #10753

Open
4 tasks
sbueringer opened this issue Jun 13, 2024 · 8 comments
Open
4 tasks

Consistently propagate down timeouts from MD => MS => Machines #10753

sbueringer opened this issue Jun 13, 2024 · 8 comments
Assignees
Labels
area/machinedeployment Issues or PRs related to machinedeployments area/machineset Issues or PRs related to machinesets help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@sbueringer
Copy link
Member

sbueringer commented Jun 13, 2024

Goal

Goal of this issue is to consistently propagate down timeouts (NodeDrainTimeout, NodeDeletionTimeout, ...) from MDs to MSs to Machines. This is desirable so that users can still change timeouts even if a Machine is e.g. stuck in draining.

We had a first PR which ensures a MachineSet propagates down the timeouts to Machines which are in deleting: #10589

But there are a few other cases, as described here: #10589 (inlining below for convenience)

The following specifically focuses on cases where Machines are deleted by the MS controller.

Case 1. MD is deleted

The following happens:

  • MD goes away
  • ownerRef triggers MS deletion
  • MS goes away
  • ownerRef triggers Machine deletion

=> The MS will already be gone when the deletionTimestamp is set on the Machines. In this case folks would have to modify the timestamps on each Machine individually. Because the MS doesn't exist anymore it's not possible to propagate down timeouts from the MS to Machines

Case 2. MD is scaled down to 0

The following happens:

  • MD scales down MS to 0
  • MS deletes Machine

This use case was addressed by: #10589

Case 3. MD rollout

The following happens:

  • Someone updates the MD (e.g. bump the Kubernetes version)
  • MD creates a new MS and scales it up
  • In parallel MD scales down the old MS to 0

=> In this scenario today the MD controller does not propagate the timeouts from MD to all MS (only to the new/current one, not to the old ones). So the Machines of the old MS won't get new timeouts set in the MD

Implementation

To address all scenarios I would propose to always propagate timeouts from MD => MS => Machine. To make that happen we have to implement the following:

Follow-up:

  • We should also check other objects like Cluster (topology), KCP, ...
@sbueringer sbueringer added the area/machinedeployment Issues or PRs related to machinedeployments label Jun 13, 2024
@k8s-ci-robot k8s-ci-robot added needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer sbueringer added area/machineset Issues or PRs related to machinesets and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2024
@sbueringer
Copy link
Member Author

@enxebre @fabriziopandini @chrischdi @vincepri
Would be good if you can double check if I missed something and if my proposed implementation a) solves the entire problem b) sounds good to you.

@vincepri
Copy link
Member

So the Machines of the old MS won't get new timeouts set in the MD

Isn't this a good default behavior? Ideally, if a MachineSet is stuck in deleting, I should be able to change only that specific one without impacting the new one?

@sbueringer
Copy link
Member Author

Good question. Timeouts today don't trigger a rollout (i.e. a new MS) and also don't trigger a separate revision.

Might be annoying to have to manually go to the MS and update the timeouts there when you otherwise primarily/only interact with the MD

@sbueringer
Copy link
Member Author

Maybe one important point to consider. Going forward we won't have old MachineSets that stay around forever. All of them will be in "scaling down" until they are gone (once we dropped revision management).

So I think our only goal for them is to properly scale them down. There probably is no reason that they have to preserve their previous timeout configuration

@enxebre
Copy link
Member

enxebre commented Jun 19, 2024

The proposed steps sgtm Stefan.

Isn't this a good default behavior? Ideally, if a MachineSet is stuck in deleting, I should be able to change only that specific one without impacting the new one?

I think it would be suboptimal UX for the use cases where MachineDeloyment is the consumer facing API to force user interaction with the machineSets that are still managed by it.

@vincepri
Copy link
Member

(once we dropped revision management).

I missed this part, what are we doing for rollbacks?

@vincepri
Copy link
Member

Found #10479 — which I wasn't aware of, and I guess now this issue makes a bit more sense to me on the reasoning 😄

@fabriziopandini fabriziopandini added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jul 3, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Jul 3, 2024
@sbueringer sbueringer self-assigned this Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machinedeployment Issues or PRs related to machinedeployments area/machineset Issues or PRs related to machinesets help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants