Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploying your own VPA with leader election enabled in GKE conflicts with the GKE system component #7461

Open
raywainman opened this issue Nov 4, 2024 · 17 comments
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@raywainman
Copy link
Contributor

raywainman commented Nov 4, 2024

Which component are you using?:

vertical-pod-autoscaler

What version of the component are you using?:

Component version: vertical-pod-autoscaler-1.2.0+

Leader election functionality was added in #6985 and is turned off by default

What k8s version are you using (kubectl version)?:

Any version 1.27+

What environment is this in?:

GKE

What did you expect to happen?:

The self-deployed VPA recommender and the GKE implementation of HPA to continue working.

What happened instead?:

Both the self-deployed VPA recommender and the GKE version use a lease called vpa-recommender in kube-system.

If you deploy your own VPA recommender, it might "steal" the lease and prevent the GKE implementation of HPA.

How to reproduce it (as minimally and precisely as possible):

  • Create a cluster
  • Deploy your own vpa-recommender (eg. using vpa-up.sh). Make sure leader election is enabled (leader-elect=true).
  • Do something to disrupt the control plane (for example upgrade the version).
  • See if your self deployed version of the vpa-recommender has grabbed the lease, if so HPA could stop working.

Anything else?

This is due to the unfortunate naming collision between GKE's system controller (also called vpa-recommender and the one provided here)

@raywainman raywainman added the kind/bug Categorizes issue or PR as related to a bug. label Nov 4, 2024
@raywainman
Copy link
Contributor Author

cc @adrianmoisey

Some thoughts on mitigating this:

  • Document the flags very carefully to encourage folks not to use the default when deploying on GKE.
  • Change the default here which could possibly break some users.

@adrianmoisey
Copy link
Member

I see two paths forward to fixing this:

  1. GKE stops using vpa-recommender as a lease name
  2. VPA stops using vpa-recommender as a lease name

I can't speak for what that lease is being used for in GKE, but I can only assume that changing that lease is difficult or impossible in GKE.

Given that the lease(s) in VPA are only used for VPA components, and running multiple recommenders and updaters for a brief period isn't that worst thing in the world, my vote is that we change the default lease name in the VPA.

Any VPA configured with the lease enabled will only be running multiple pods for a short period of time, which should be fine.

It's obviously not an amazing path forward, but may be worth doing.

I'm curios what @voelzmo and @kwiesmueller think, as they may be the ones approving that controversial PR.

@adrianmoisey
Copy link
Member

adrianmoisey commented Nov 5, 2024

my vote is that we change the default lease name in the VPA.

If we do go this path, i suggest we also make PRs into 3rd party Helm charts to ensure they support the new default name. Some of them hardcode the lease name:

@kwiesmueller
Copy link
Member

I'm not sure if changing the default in OSS is the right approach because GKE claims it too.
Having documentation (GKE should maybe have a docs page for that too) seems like it should be enough.
Is there anything preventing GKE users from deploying a custom VPA with a different lease name?

cc. @raywainman

@adrianmoisey
Copy link
Member

I'm not sure if changing the default in OSS is the right approach because GKE claims it too.

My concern is that if someone follows the happy path of deployment, and enables the leases without reading documentation, then GKE breaks in ways that are not obvious.

It took us 4 days and countless messages with GCP support before we found what GCP was doing with the vpa-recommender lease (being that the GKE HPA uses it).
It's also not possible (from what I understand) for a user to self-troubleshoot this problem (ie: GKE users can't see the logs of the HPA controller).

The pain of renaming the default OSS lease is pretty small.

@adrianmoisey
Copy link
Member

/area vertical-pod-autoscaler

@raywainman
Copy link
Contributor Author

Just thinking out loud - what is the worst that can happen when changing a lease name?

On upgrade, we would have two vpa-recommender pods (one from old version, one from new version) that believes they are the leader.

For a short period of time, both of them will try to compute and apply VPA recommendations. Off the top of my head, these should all generally be the same recommendations, we will just have the possibility of racing/duplication.

Is there any other risk anyone else can think of?

@adrianmoisey
Copy link
Member

Just thinking out loud - what is the worst that can happen when changing a lease name?

Yup! That is my thought too.

For a short period of time, both of them will try to compute and apply VPA recommendations. Off the top of my head, these should all generally be the same recommendations, we will just have the possibility of racing/duplication.

Correct! My understanding is that the bad thing that could happen is that 2 VPA recommenders should be giving roughly the same recommendation, for a very short period of time, after which only 1 will be active.

And, the VPA recommender's --recommender-interval flag is set to 1 minute by default, lowering that risk. There may be a few other parts I may be forgetting, such as the OOM detector, but I don't think that's a big deal either.

@raywainman
Copy link
Contributor Author

@ialidzhikov what do you think?

My vote is to change it now before the next release (we could even consider a cherrypick here) given that:

  • This feature is fairly new, we have an opportunity to fix this before it gains wider adoption.
  • Leader election is off by default.
  • We can try and be loud in the release notes about this change to warn users in case they see strange behavior.
  • The theoretical impact of having two leaders running temporarily seems quite minimal.

@raywainman
Copy link
Contributor Author

cc @voelzmo as well

@raywainman
Copy link
Contributor Author

raywainman commented Nov 6, 2024

Can maybe something go wrong with the checkpoint?

If the new version leader writes a checkpoint that the old version leader doesn't understand perhaps?

EDIT: Looks like we have support for this case.

@adrianmoisey
Copy link
Member

@ialidzhikov what do you think?

My vote is to change it now before the next release (we could even consider a cherrypick here) given that:

  • This feature is fairly new, we have an opportunity to fix this before it gains wider adoption.
  • Leader election is off by default.
  • We can try and be loud in the release notes about this change to warn users in case they see strange behavior.
  • The theoretical impact of having two leaders running temporarily seems quite minimal.

I agree with all these points.
Also to note, the VPA documentation isn't very good about telling people how to run the VPA, the upside being that no documentation was really updated explaining the lease feature (besides the update to the parameters: 211ecdc). I recently added #7463, but that's a few days old.

If we're going to change the lease name, I think now is the perfect time to do so.

@ialidzhikov
Copy link
Contributor

ialidzhikov commented Nov 7, 2024

Hi folks,

% docker run registry.k8s.io/autoscaling/vpa-recommender:1.2.1 --help

      --leader-elect                                         Start a leader election client and gain leadership before executing the main loop. Enable this when running replicated components for high availability.
      --leader-elect-lease-duration duration                 The duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. (default 15s)
      --leader-elect-renew-deadline duration                 The interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than the lease duration. This is only applicable if leader election is enabled. (default 10s)
      --leader-elect-resource-lock string                    The type of resource object that is used for locking during leader election. Supported options are 'leases', 'endpointsleases' and 'configmapsleases'. (default "leases")
      --leader-elect-resource-name string                    The name of resource object that is used for locking during leader election. (default "vpa-recommender")
      --leader-elect-resource-namespace string               The namespace of resource object that is used for locking during leader election. (default "kube-system")
      --leader-elect-retry-period duration                   The duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. (default 2s)

The leader election resource name and namespace are configurable. If you are installing a custom VPA to a GKE cluster, please use the --leader-elect-resource-namespace and --leader-elect-resource-name flags to avoid using the same lease the GKE's VPA is using.

@ialidzhikov
Copy link
Contributor

I am also open to rename the default lease name in the VPA repo to avoid this issue.

@kwiesmueller
Copy link
Member

Alright, as it's still new and it's an easy way to avoid issues before too many people use it. I guess changing it now is fine.
I am a little worried about breaking people that already use it, so not sure about backporting it (doesn't seem correct under semantic versioning).

@raywainman
Copy link
Contributor Author

We'll need to update this: https://github.com/cowboysysop/charts/blob/master/charts/vertical-pod-autoscaler/templates/recommender/clusterrole.yaml#L141-L142 which was added in cowboysysop/charts#705.

If we combine it with a release then this should be less painful.

@raywainman
Copy link
Contributor Author

Opened cowboysysop/charts#781 to update the Helm chart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants