-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPA metric got stuck at a random value and not scaling down after reaching max replica count #597
Comments
It's intentional that the HPA doesn't scale down if one metric is invalid. This is a safety feature to avoid that it scales down if the metric source is unavailable and the HPA couldn't know if it's safe or not to scale down. Imagine the issue is a network problem between the kube-metrics-adapter and prometheus, or prometheus is temporarily unavailable. In this case you don't want to scale down just because the metric was not available. This feature has been in the HPA since the beginning but was temporarily broken in v1.16. I made a PR to fix it here which fixed it from v1.21 and on wards: kubernetes/kubernetes#99514 Ideally you should construct your Prometheus query such that it returns 0 if there is no 502 or 503 if you are sure that means there was none. You also want to avoid a situation where Prometheus simply didn't collect the metrics from whatever component exposes these metrics as that could be the same issue as I described above. |
Hi @mikkeloscar, thanks for the explanation. When the Prometheus query is modified to return zero if the metric is unavailable fixes the issue. Now one thing I want to understand is, why the metric value is stuck with a stale value that is slightly above or below the target in kubectl describe of hpa. Also, this issue seems to happen only after replica counts reached a maximum and stayed there for some time. But at the same time able to see correct metric values in Prometheus, in hpa it is stuck this is solved only when hpa is manually edited. can you help in getting some clarity on this? |
I'm not sure I understand your question. If one metric is unavailable what you describe sounds like what I mentioned above. If you mean it's stuck even with both metrics working, then it's some other issue and we likely need to look at logs of kube-metrics-adapter to understand it. |
Hi @mikkeloscar , I just wanted some clarification. Now during this issue, the value of A metric got stuck at a random value as shown below. But if I run the same query directly on Prometheus, I am correctly getting the metric. This behavior seems strange, hence want to understand what is happening here. Ignore the B metric, as in any way the value of this should not affect the other metrics in HPA right ?
I want to understand why the value of A metric is stuck at a random value below the target (say 457) or above the target (843) and this was observed consistently when reproduced using the steps I mentioned in the issue description. Could you help with this? |
Can you share the logs of kube-metrics-adapter when this happens? I think that would be helpful to maybe understand it? |
Hi @mikkeloscar , Do you want kube-metrics-adapter logs in the issue timeframe alone or do you also want logs before and after a specific hour time frame? I will check with my team and share the requested logs for troubleshooting. |
More context is better, but the logs from when the problem started to happen and some 10-15 minutes forward is probably enough to get an idea. |
Hi @mikkeloscar , I reproduced the issue as mentioned in the description, this time the sla-metric value stuck at 375m/ 500. But when I queried on Prometheus, it showed a zero(0). For "kubectl describe hpa my-hpa" command , the output of hpa the value is still stuck at 375m. Attaching the kube-metrics-adapter-logs for your reference: Prometheus query output: (metric value)
Let me know if any other details is needed. Sorry for the delay in response , as I had to check with my team before sharing the logs. |
@Naveen-oops Thanks for sharing the logs. It looks like kube-metrics-adapter is getting the new metrics all the time, so that doesn't look wrong. I would be curious if you could also share the output of |
Hi @mikkeloscar sure kubectl describe hpa
kubectl get hpa -o yaml
|
@Naveen-oops Thanks for sharing all the information. It looks like an HPA issue. I think I need to replicate it to better understand where this happens, but I will need to find time to do that, can't promise when I will have the time. What version of Kubernetes are you running? Maybe there are upstream issues about this? 🤔 |
Thanks for the support @mikkeloscar. We are currently using Kube version v1.24.7. I am also not able to find the relevant issues in the Kubernetes project. Anyway, I have already created this same issue in the Kube project also kubernetes/kubernetes#119788 , which can be used to track it there. Just let me know if any fix is provided for this issue, I am curious about the fix and eager to know what is happening behind the scenes. |
What happened?
I utilize two customized metrics, A and B, in my HPA system. A is a gauge-based metric called SLA Metric, while B is a count-based metric that tracks failed requests with HTTP status code 502 or 503 from Istio. These metrics are scraped by Prometheus.
To use custom metrics in HPA, we're employing Kube Metrics Adapter link. When the application load increases, the value of the SLA Metric also increases, and then the pods scale up until they reach the maximum replica count as expected.
However, the problem arises when the load dissipates, and the pods never scale down. Despite the SLA Metric's value being below the target in Prometheus, the HPA description still displays the metric value with a stale value that can be above or below the target.
One possible reason for this is that Metric B, which relies on Istio requests, shows up as unknown since there have been no failed requests with the 502 or 503 status codes. Thus, the Prometheus query fails.
We have noticed this behavior after upgrading Kube from version 1.21 to 1.24 , changing the HPA version from autoscaling/v2beta2 to autoscaling/v2, and changing the kube-metrics-adapter version from v0.1.16 to v0.1.19.
kubectl describe hpa my-hpa
To troubleshoot this further we checked the metrics value using
Output:
Although the metric value appears as zero, we can observe that the HPA description displays a stagnant value.
Workaround :
HPA behaves as expected when the second metric B is completely removed or modified to return 0 when the query fails.
What did you expect to happen?
HPA should scale down properly based on one of the metrics, even when the other metric value is not available.
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
Does anyone faced similar issues in hpa or is this behavior of hpa for multiple metrics changed recently, especially in scaling down events? Can anyone from the community look into the issue and give some clarity?
Kubernetes version
Cloud provider
OS version
The text was updated successfully, but these errors were encountered: