-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics for realtime investigation of disconnected wl scenarios #7147
Comments
Would it be better to raise a metric for this instead? Using logs to capture metrics seems like a strange approach. |
... logs work in small clusters that dont have full blown monitoring . in my case, that is likely the norm as opposed to the exception.
maybe we can capture in a prometheus metric and just print the metric out :) |
I understand that not small clusters don't have full-blown monitoring, but at the same time I don't think we should build into our controllers/API a system to count failures across different reconcile because this comes with some immediate drawbacks
Accordingly, I'm +1 to address this using metrics |
+1 Sounds reasonable. Apparently there is an integration between grpc and Prometheus and etcd has an example on how to combine them for the etcd client: https://github.com/etcd-io/etcd/blob/main/tests/integration/clientv3/examples/example_metrics_test.go |
it doesnt require persistent state really, its just a heuristic. and a really its just a small hashtable at that :), and i dont think it needs to be perfect. But "metrics" is fine to - ok if we just "print" the metrics histogram out periodically ? Then everyone wins :) |
To be honest, printing metrics really sounds like an anti-pattern to me. |
Touche but it has a low cost... like... maybe ten lines of code.... no functional changes to the API... Agree nothing beats a lavish grafana dashboard but in chaotic situations a "dumb" alternative is needed.... But fair enough: What's the workaround for
Open to other creative ideas herethat don't require running extra tools to get a quick topology of etcd polling failures for large / edge like scenarios. Any thing come to mind ? |
@fabriziopandini prometheus histograms.... dont they already give us this for free in memory ? I don't see the persistent state corrallary. But I agree this shouldn't be an overblown crd / api level thing. I didn't mean to suggest that ... it's more like a logging point in time heuristic. |
They don't need Prometheus specifically, but running Kubernetes clusters at that scale in production without monitoring doesn't sound like a sane approach to me.
Yup they should. I think Fabrizio meant if we build our own instead of just using metrics.
If solved with normal metrics that's usually fine. For normal counters usually rates are used (e.g. error per minute vs absolute error count) and Prometheus handles it usually well if a counter starts from 0 after a restart. Not exactly sure how it works with histograms, but I don't expect problems there if we implement normal metrics as it's a very common case.
I think the first step should be to implement the regular metrics. Then we can think of how users could consume them if they don't want to use Prometheus / a regular metrics tool. |
The answer to any logging problem can be better metrics and alerting, but logs are just a batteries included solution that work anywhere. Why dont we split the difference:
yup |
renamed since there is agreement on the first step, which is implementing metrics helping to investigate connection problems |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
/priority important-longterm |
/assign @sbueringer |
/close in favor of #11272 |
@sbueringer: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
User Story
As a developer running CAPI in far separated networks, id like kubeadm controlplane manager to give me a delta between healthy nodes thats easy to read, i.e.
if i do a simple disconnect experiment on a WL cluster (w CP node w/ etcd on it),
I can see that disconnect is logged easily... and i cant easily ascertain how many nodes are / arent making that etcd connection (and yes, in general, i understand there are higher level declarative constructs and that using logs for everythign are an antipattern, but... in the real world, being able to see etcd client statistics, in realtime, is much more useful to quickly hypothesize a failure mode).
So my suggestion would be i think some kind of
in the logs ...
Desired output
Current output
Detailed Description
[A clear and concise description of what you want to happen.]
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
/kind feature
The text was updated successfully, but these errors were encountered: