-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-metrics-adapter (with pod collector enabled) returns stale data #97
Comments
Thanks for the detailed bug report. I would suggest we add the changes you propose first and then think about a more robust solution. I'm surprised the HPA controller is asking for metrics of old pods, even if the metrics server is storing them it seems like the wrong behavior to get those. I would have to check how the official "metrics-server" behaves for CPU and Memory metrics because I also believe it would have similar TTLs. |
Thanks @mikkeloscar for your reply! I will create a PR for this. I looked into why HPA was getting old metrics. In my case, at least, it was querying using labels. So, kube-metrics-adapter was correctly returning objects that matched the labels. Whether they should be there or not is the issue here. I have not looked into metrics-server, but will check it out after I prepare a PR. |
Hello, Just providing an update on the issue and the fix. I am currently working on an approval from my employer to contribute to this project. As soon as that's done, I will make a PR with the fix. If anyone else wants to do it, feel free to create a PR. Thanks, |
Thanks for the update, did you have a chance to look at how the metrics-server handles this issue? |
Hello!! Thanks, |
@lsievert can you provide more data, please? |
Following up on the original issue. The metrics-server has a list of active pods and only serve metrics for those pods: https://github.com/kubernetes-sigs/metrics-server/blob/2a1d1385123b9b7eed36f8b3efe8b78175db5b28/pkg/api/pod.go#L77-L124 We could do something similar to avoid this issue. |
Hello, when do you plan to release these changes? |
@ermakov-oleg Hi, we're just testing another feature for ScheduledScaling and will release it all together. We aim to make the release next week. |
@ermakov-oleg as Mikkel said we plan to release it soon, although, if you want you can test the pre-release version with the same image version we are running in our clusters:
|
Thank you! I already looked at the diff with the latest version and used |
I am experiencing the same issue at the newest |
@wildermesser I think the only thing we have here is this: #309 |
@mikkeloscar thank you! I saw these arguments before. But I could not realize its purpose. After reading this issue I can set proper values for these arguments. And it resolves this issue for me. I think a small description of all arguments in README will be useful for people like me :) |
Some general guidelines for configuring these options:
|
Hello,
I have been testing kube-metrics-adapter (with pod collector enabled) and noticed that it returns stale data during specific conditions, which results in HPA misbehaving. The conditions occur when HPA scales up a deployment due to high load, then load subsides and it scales the deployment back to min-replica. However, if I increase the load again that should result in a scale-up, the HPA would not scale up because the data it is acting upon is stale. When it queries kube-metrics-adapter for custom metrics, the adapter is returning old data that contains non-existent pods and old values. This results in incorrect averages for the HPA.
I tracked it down to the in-memory cache that kube-metrics-adapter uses to track pods and metrics. The metrics have TTL of 15 mins and the GC is run every 10 mins. The issue mentioned above disappears after old metrics are cleaned up by GC. To fix the issue permanently, I modified TTL to 30s and GC to run every minute. HPA is able to act on new conditions much better. This should cover other edge cases, such as if HPA is in a middle of scale-down and there is a new load, it should able to act on it faster.
I think this is a simple solution for now that solves the issue and there is a better way of dealing with it. The cache can be updated on each run, so HPA can have an up-to-date view of the cluster without a delay of up to 1 minute.
Expected Behavior
kube-metrics-adapter doesn't return stale data to HPA that contains information about non-existent pods and their values.
Actual Behavior
kube-metrics-adapter returns stale data to HPA due to 15min TTL and 10 mins garbage collection. This causes HPA to calculate wrong averages based on custom metrics.
Steps to Reproduce the Problem
Specifications
I have a fix for this, but would like to discuss if it is appropriate. I propose to make TTL and GC run configurable and set the default values to 30s and 1m respectively. A better long term solution can be keeping the the cache up-to-date on each run.
The text was updated successfully, but these errors were encountered: