-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Scale from Zero does not work on managed nodegroups even with propagateASGTags enabled #7543
Comments
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
@rwilson-release However, ClusterAutoscaler (CAS) does not have any hard requirement to propagate these tags to Nodes.. CAS only requires the tags to be present on ASG so that it can scale from 0 and after launching the node, CAS will add those taints and labels. Having On the other hand for unmanaged nodes the Update:
Test Deployment:
Scale up log:
@rwilson-release |
That code is for tagging the backing ASGs for managed nodegroups and For managed nodegroups, EKS does not propagate any tags to the ASG resource, they only apply to the EKS Nodegroup resource and to the EC2 instances launched as part of the nodegroup.
You are viewing tags for the ASG resource itself, those tags do not need to be propagated anywhere for scale-from-zero to work. As @punkwalker noted, this is not an eksctl bug and you might be facing other issues. Can you try upgrading CAS? Additionally, if the IAM role for Cluster Autoscaler has the |
I created a brand new cluster on 1.29 with the CAS version 1.29.0, I am pretty sure that is relatively recent. There was a new version released a few days ago, so maybe there is a long shot there.
I was initially excited about this possibility since we had a few clusters of different ages and different eksctl versions and this seemed like a very simple fix! Unfortunately, we have updated the policy and use the helm charts with the correct policy and verifying all affected clusters confirmed the policy was correct, including the one you mentioned. This policy has been correctly updated in our configs since 2022-Dec.
I almost agreed this was not related to eksctl but all roads lead back to the label or tags -- and those are created by eksctl so please bear with me. If you investigate the errors that I posted, you will find several github issues -- in particular this thread which describes almost identical problems related to scaling from 0 for the purposes of github self-hosted runners (our use case). See kubernetes/autoscaler#3780 (comment) but none of the fixes in that thread help. Another clue is here in the README: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#auto-discovery-setup (scroll down to the sections labled
The actual tag I see is
Notice it might need to be
This is a bit of a reach -- but is it possible the unmanaged node groups are labeled correctly vs the managed? |
@rwilson-release
Even for unmanaged nodegroups, the taint effect was never added to the ASG tag 🙂. Ref
I think it has to be changed to something like this:
@cPu1 What do you think? |
What were you trying to accomplish?
Creating a managed nodegroup to support autoscaling from zero with taints and labels does not appear to work properly even with the recommended taints, labels, and
propagateASGTags: true
. Creating such a managed nodegroup does not scale up from zero with an error in the cluster autoscaler.What happened?
This used to work with the same configuration, except that we were previously using unmanaged nodegroups. We switched to managed nodegroups across our fleet until this error was discovered long after we had already made the transition.
Further investigation showed that the correct autoscaling taint and label tags are missing from the instances and are not being propagated correctly.
How to reproduce it?
We have created a cluster and two nodegroups (one for regular workloads, one with the taints to support a github actions runner workload with taints and labels):
Here is the output from the autoscaler group:
Logs
Anything else we need to know?
Notice in particular
Should be
"PropagateAtLaunch": true
I'll update as I get this information.
Versions
The text was updated successfully, but these errors were encountered: