Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics for realtime investigation of disconnected wl scenarios #7147

Closed
jayunit100 opened this issue Sep 1, 2022 · 16 comments
Closed
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jayunit100
Copy link
Contributor

jayunit100 commented Sep 1, 2022

User Story

As a developer running CAPI in far separated networks, id like kubeadm controlplane manager to give me a delta between healthy nodes thats easy to read, i.e.

if i do a simple disconnect experiment on a WL cluster (w CP node w/ etcd on it),

sudo iptables -i eth0 -A INPUT -s 10.180.83.152 -j DROP 

I can see that disconnect is logged easily... and i cant easily ascertain how many nodes are / arent making that etcd connection (and yes, in general, i understand there are higher level declarative constructs and that using logs for everythign are an antipattern, but... in the real world, being able to see etcd client statistics, in realtime, is much more useful to quickly hypothesize a failure mode).

So my suggestion would be i think some kind of

failures := 0
for each etcd in my WL children:
   passed := check the etcd
   ip := etcd.ip 
   stats[ip] = passed 
   if !passed:
      failures += 1

log.Infof("etcd polling results: %v, failures %v", stats, failures)

in the logs ...

Desired output

I0901 14:00:29.382055       1 round_trippers.go:553] GET https://10.180.81.244:6443/api?timeout=10s  in 10000 milliseconds
... 
E0901 14:00:29.384611       1 controller.go:317] controller/kubeadmcontrolplane "msg"="Reconciler error" "error"="[cannot get remote client... 
... 
I0911 14:00:29.382055       1 controller.go:553] etcd polling results 
[1.2.3.4: true, 
 1.2.3.5: false, 
 1.2.3.6: true
], 
 failures: 1 

Current output

I0901 14:00:29.382055       1 round_trippers.go:553] GET https://10.180.81.244:6443/api?timeout=10s  in 10000 milliseconds
E0901 14:00:29.382397       1 controller.go:188] controller/kubeadmcontrolplane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"default/tkg-vc-antrea\": Get \"https://10.180.81.244:6443/api?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" "cluster"="tkg-vc-antrea" "name"="tkg-vc-antrea-control-plane" "namespace"="default" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"
E0901 14:00:29.384611       1 controller.go:317] controller/kubeadmcontrolplane "msg"="Reconciler error" "error"="[cannot get remote client to workload cluster: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"default/tkg-vc-antrea\": Get \"https://10.180.81.244:6443/api?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers), failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"default/tkg-vc-antrea\": Get \"https://10.180.81.244:6443/api?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)]" "name"="tkg-vc-antrea-control-plane" "namespace"="default" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"
I0901 14:00:29.385855       1 round_trippers.go:553] GET https://10.180.80.169:6443/?timeout=5s 200 OK in 3 milliseconds
I0901 14:00:29.388442       1 controller.go:251] controller/kubeadmcontrolplane "msg"="Reconcile KubeadmControlPlane" "cluster"="tkg-vc-antrea" "name"="tkg-vc-antrea-control-plane" "namespace"="default" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"

Detailed Description

[A clear and concise description of what you want to happen.]

Anything else you would like to add:

[Miscellaneous information that will assist in solving the issue.]

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 1, 2022
@killianmuldoon
Copy link
Contributor

Would it be better to raise a metric for this instead? Using logs to capture metrics seems like a strange approach.

@jayunit100
Copy link
Contributor Author

jayunit100 commented Sep 1, 2022

... logs work in small clusters that dont have full blown monitoring . in my case, that is likely the norm as opposed to the exception.

  • but i guess a conditions rollup of metadata work ?,... especially in cases of 1000s of clusters... dont want to rely on dashboards for everything.
  • just being able to see something in realtime about how capi->etcd connectivity is doing at scale , quickly, without needing dashboard access etc, is what im realling looking for.

maybe we can capture in a prometheus metric and just print the metric out :)

@jayunit100 jayunit100 changed the title Add a polling summary to the kubeadm-controlplane logs for realtime investigation of disconnected wl scenarios Add a polling summary of some sort to the kubeadm-controlplane-manager, for realtime investigation of disconnected wl scenarios Sep 1, 2022
@fabriziopandini
Copy link
Member

fabriziopandini commented Sep 6, 2022

I understand that not small clusters don't have full-blown monitoring, but at the same time I don't think we should build into our controllers/API a system to count failures across different reconcile because this comes with some immediate drawbacks

  • it requires some persistent state to ensure counters survive controller restarts
  • if we use existing API objects as API state, this will impact the system performance (a reconcile triggered in many controllers as soon as we increase the counters)
  • if we use some other sort of persistent storage, it will impact how we deploy things/add more constraints

Accordingly, I'm +1 to address this using metrics

@sbueringer
Copy link
Member

sbueringer commented Sep 6, 2022

+1 Sounds reasonable.

Apparently there is an integration between grpc and Prometheus and etcd has an example on how to combine them for the etcd client: https://github.com/etcd-io/etcd/blob/main/tests/integration/clientv3/examples/example_metrics_test.go

@jayunit100
Copy link
Contributor Author

jayunit100 commented Oct 25, 2022

it doesnt require persistent state really, its just a heuristic. and a really its just a small hashtable at that :), and i dont think it needs to be perfect.

But "metrics" is fine to - ok if we just "print" the metrics histogram out periodically ? Then everyone wins :)

@sbueringer
Copy link
Member

But "metrics" is fine to - ok if we just "print" the metrics histogram out periodically ? Then everyone wins :)

To be honest, printing metrics really sounds like an anti-pattern to me.

@jayunit100
Copy link
Contributor Author

jayunit100 commented Nov 5, 2022

antipattern

Touche but it has a low cost... like... maybe ten lines of code.... no functional changes to the API...

Agree nothing beats a lavish grafana dashboard but in chaotic situations a "dumb" alternative is needed....

But fair enough: What's the workaround for
Admin people

  • experiencing a netsplit from wL etcd ...
  • running hundreds of clusters
  • not enabling prometheus at the mgmt layer?

Open to other creative ideas here

that don't require running extra tools to get a quick topology of etcd polling failures for large / edge like scenarios. Any thing come to mind ?

@jayunit100
Copy link
Contributor Author

jayunit100 commented Nov 6, 2022

@fabriziopandini prometheus histograms.... dont they already give us this for free in memory ? I don't see the persistent state corrallary. But I agree this shouldn't be an overblown crd / api level thing. I didn't mean to suggest that ... it's more like a logging point in time heuristic.

@sbueringer
Copy link
Member

sbueringer commented Nov 29, 2022

running hundreds of clusters
not enabling prometheus at the mgmt layer?

They don't need Prometheus specifically, but running Kubernetes clusters at that scale in production without monitoring doesn't sound like a sane approach to me.

prometheus histograms.... dont they already give us this for free in memory ?

Yup they should. I think Fabrizio meant if we build our own instead of just using metrics.

it requires some persistent state to ensure counters survive controller restarts

If solved with normal metrics that's usually fine. For normal counters usually rates are used (e.g. error per minute vs absolute error count) and Prometheus handles it usually well if a counter starts from 0 after a restart. Not exactly sure how it works with histograms, but I don't expect problems there if we implement normal metrics as it's a very common case.

that don't require running extra tools to get a quick topology of etcd polling failures for large / edge like scenarios. Any thing come to mind ?

I think the first step should be to implement the regular metrics.

Then we can think of how users could consume them if they don't want to use Prometheus / a regular metrics tool.
I could imagine folks could add a sidecar container which does a curl against the metrics endpoint regularly and prints the results. I'm not a big fan of it, but as that would be outside of the scope of Cluster API itself .... I'm fine with that.

@jayunit100
Copy link
Contributor Author

jayunit100 commented Nov 29, 2022

The answer to any logging problem can be better metrics and alerting, but logs are just a batteries included solution that work anywhere.

Why dont we split the difference:

  • collect prometheus metrics
  • print the in memory histogram (basically a one liner)

first step should be to implement regular metrics

yup

@fabriziopandini fabriziopandini changed the title Add a polling summary of some sort to the kubeadm-controlplane-manager, for realtime investigation of disconnected wl scenarios Add metrics for realtime investigation of disconnected wl scenarios Dec 28, 2022
@fabriziopandini
Copy link
Member

renamed since there is agreement on the first step, which is implementing metrics helping to investigate connection problems
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 28, 2022
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 20, 2024
@fabriziopandini
Copy link
Member

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 12, 2024
@fabriziopandini
Copy link
Member

/assign @sbueringer
To figure it out how we can implement custom metrics for cluster cache tracker (probably health checking + connection creation)

@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jun 5, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 5, 2024
@sbueringer
Copy link
Member

/close

in favor of #11272

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Closing this issue.

In response to this:

/close

in favor of #11272

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants