Add metrics for realtime investigation of disconnected wl scenarios #7147

jayunit100 · 2022-09-01T14:07:34Z

User Story

As a developer running CAPI in far separated networks, id like kubeadm controlplane manager to give me a delta between healthy nodes thats easy to read, i.e.

if i do a simple disconnect experiment on a WL cluster (w CP node w/ etcd on it),

sudo iptables -i eth0 -A INPUT -s 10.180.83.152 -j DROP

I can see that disconnect is logged easily... and i cant easily ascertain how many nodes are / arent making that etcd connection (and yes, in general, i understand there are higher level declarative constructs and that using logs for everythign are an antipattern, but... in the real world, being able to see etcd client statistics, in realtime, is much more useful to quickly hypothesize a failure mode).

So my suggestion would be i think some kind of

failures := 0
for each etcd in my WL children:
   passed := check the etcd
   ip := etcd.ip 
   stats[ip] = passed 
   if !passed:
      failures += 1

log.Infof("etcd polling results: %v, failures %v", stats, failures)

in the logs ...

Desired output

I0901 14:00:29.382055       1 round_trippers.go:553] GET https://10.180.81.244:6443/api?timeout=10s  in 10000 milliseconds
... 
E0901 14:00:29.384611       1 controller.go:317] controller/kubeadmcontrolplane "msg"="Reconciler error" "error"="[cannot get remote client... 
... 
I0911 14:00:29.382055       1 controller.go:553] etcd polling results 
[1.2.3.4: true, 
 1.2.3.5: false, 
 1.2.3.6: true
], 
 failures: 1

Current output

I0901 14:00:29.382055       1 round_trippers.go:553] GET https://10.180.81.244:6443/api?timeout=10s  in 10000 milliseconds
E0901 14:00:29.382397       1 controller.go:188] controller/kubeadmcontrolplane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"default/tkg-vc-antrea\": Get \"https://10.180.81.244:6443/api?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" "cluster"="tkg-vc-antrea" "name"="tkg-vc-antrea-control-plane" "namespace"="default" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"
E0901 14:00:29.384611       1 controller.go:317] controller/kubeadmcontrolplane "msg"="Reconciler error" "error"="[cannot get remote client to workload cluster: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"default/tkg-vc-antrea\": Get \"https://10.180.81.244:6443/api?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers), failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"default/tkg-vc-antrea\": Get \"https://10.180.81.244:6443/api?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)]" "name"="tkg-vc-antrea-control-plane" "namespace"="default" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"
I0901 14:00:29.385855       1 round_trippers.go:553] GET https://10.180.80.169:6443/?timeout=5s 200 OK in 3 milliseconds
I0901 14:00:29.388442       1 controller.go:251] controller/kubeadmcontrolplane "msg"="Reconcile KubeadmControlPlane" "cluster"="tkg-vc-antrea" "name"="tkg-vc-antrea-control-plane" "namespace"="default" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"

Detailed Description

[A clear and concise description of what you want to happen.]

Anything else you would like to add:

[Miscellaneous information that will assist in solving the issue.]

/kind feature

The text was updated successfully, but these errors were encountered:

killianmuldoon · 2022-09-01T14:20:01Z

Would it be better to raise a metric for this instead? Using logs to capture metrics seems like a strange approach.

jayunit100 · 2022-09-01T14:26:20Z

... logs work in small clusters that dont have full blown monitoring . in my case, that is likely the norm as opposed to the exception.

but i guess a conditions rollup of metadata work ?,... especially in cases of 1000s of clusters... dont want to rely on dashboards for everything.
just being able to see something in realtime about how capi->etcd connectivity is doing at scale , quickly, without needing dashboard access etc, is what im realling looking for.

maybe we can capture in a prometheus metric and just print the metric out :)

fabriziopandini · 2022-09-06T11:14:29Z

I understand that not small clusters don't have full-blown monitoring, but at the same time I don't think we should build into our controllers/API a system to count failures across different reconcile because this comes with some immediate drawbacks

it requires some persistent state to ensure counters survive controller restarts
if we use existing API objects as API state, this will impact the system performance (a reconcile triggered in many controllers as soon as we increase the counters)
if we use some other sort of persistent storage, it will impact how we deploy things/add more constraints

Accordingly, I'm +1 to address this using metrics

sbueringer · 2022-09-06T11:33:32Z

+1 Sounds reasonable.

Apparently there is an integration between grpc and Prometheus and etcd has an example on how to combine them for the etcd client: https://github.com/etcd-io/etcd/blob/main/tests/integration/clientv3/examples/example_metrics_test.go

jayunit100 · 2022-10-25T01:04:13Z

it doesnt require persistent state really, its just a heuristic. and a really its just a small hashtable at that :), and i dont think it needs to be perfect.

But "metrics" is fine to - ok if we just "print" the metrics histogram out periodically ? Then everyone wins :)

sbueringer · 2022-11-04T08:51:19Z

But "metrics" is fine to - ok if we just "print" the metrics histogram out periodically ? Then everyone wins :)

To be honest, printing metrics really sounds like an anti-pattern to me.

jayunit100 · 2022-11-05T23:50:51Z

antipattern

Touche but it has a low cost... like... maybe ten lines of code.... no functional changes to the API...

Agree nothing beats a lavish grafana dashboard but in chaotic situations a "dumb" alternative is needed....

But fair enough: What's the workaround for
Admin people

experiencing a netsplit from wL etcd ...
running hundreds of clusters
not enabling prometheus at the mgmt layer?

Open to other creative ideas here

that don't require running extra tools to get a quick topology of etcd polling failures for large / edge like scenarios. Any thing come to mind ?

jayunit100 · 2022-11-06T00:10:08Z

@fabriziopandini prometheus histograms.... dont they already give us this for free in memory ? I don't see the persistent state corrallary. But I agree this shouldn't be an overblown crd / api level thing. I didn't mean to suggest that ... it's more like a logging point in time heuristic.

sbueringer · 2022-11-29T07:24:52Z

running hundreds of clusters
not enabling prometheus at the mgmt layer?

They don't need Prometheus specifically, but running Kubernetes clusters at that scale in production without monitoring doesn't sound like a sane approach to me.

prometheus histograms.... dont they already give us this for free in memory ?

Yup they should. I think Fabrizio meant if we build our own instead of just using metrics.

it requires some persistent state to ensure counters survive controller restarts

If solved with normal metrics that's usually fine. For normal counters usually rates are used (e.g. error per minute vs absolute error count) and Prometheus handles it usually well if a counter starts from 0 after a restart. Not exactly sure how it works with histograms, but I don't expect problems there if we implement normal metrics as it's a very common case.

that don't require running extra tools to get a quick topology of etcd polling failures for large / edge like scenarios. Any thing come to mind ?

I think the first step should be to implement the regular metrics.

Then we can think of how users could consume them if they don't want to use Prometheus / a regular metrics tool.
I could imagine folks could add a sidecar container which does a curl against the metrics endpoint regularly and prints the results. I'm not a big fan of it, but as that would be outside of the scope of Cluster API itself .... I'm fine with that.

jayunit100 · 2022-11-29T15:31:49Z

The answer to any logging problem can be better metrics and alerting, but logs are just a batteries included solution that work anywhere.

Why dont we split the difference:

collect prometheus metrics
print the in memory histogram (basically a one liner)

first step should be to implement regular metrics

yup

fabriziopandini · 2022-12-28T11:28:19Z

renamed since there is agreement on the first step, which is implementing metrics helping to investigate connection problems
/triage accepted

k8s-triage-robot · 2024-01-20T01:08:46Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

fabriziopandini · 2024-04-12T13:55:56Z

/priority important-longterm

fabriziopandini · 2024-04-30T13:10:08Z

/assign @sbueringer
To figure it out how we can implement custom metrics for cluster cache tracker (probably health checking + connection creation)

sbueringer · 2024-10-08T15:26:35Z

/close

in favor of #11272

k8s-ci-robot · 2024-10-08T15:26:40Z

@sbueringer: Closing this issue.

In response to this:

/close

in favor of #11272

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 1, 2022

jayunit100 changed the title ~~Add a polling summary to the kubeadm-controlplane logs for realtime investigation of disconnected wl scenarios~~ Add a polling summary of some sort to the kubeadm-controlplane-manager, for realtime investigation of disconnected wl scenarios Sep 1, 2022

fabriziopandini changed the title ~~Add a polling summary of some sort to the kubeadm-controlplane-manager, for realtime investigation of disconnected wl scenarios~~ Add metrics for realtime investigation of disconnected wl scenarios Dec 28, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 28, 2022

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 20, 2024

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 12, 2024

k8s-ci-robot assigned sbueringer Apr 30, 2024

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jun 5, 2024

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 5, 2024

sbueringer mentioned this issue Oct 8, 2024

Improve ClusterCacheTracker to address fundamental limitations #11272

Open

12 tasks

k8s-ci-robot closed this as completed Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics for realtime investigation of disconnected wl scenarios #7147

Add metrics for realtime investigation of disconnected wl scenarios #7147

jayunit100 commented Sep 1, 2022 •

edited

Loading

killianmuldoon commented Sep 1, 2022

jayunit100 commented Sep 1, 2022 •

edited

Loading

fabriziopandini commented Sep 6, 2022 •

edited

Loading

sbueringer commented Sep 6, 2022 •

edited

Loading

jayunit100 commented Oct 25, 2022 •

edited

Loading

sbueringer commented Nov 4, 2022

jayunit100 commented Nov 5, 2022 •

edited

Loading

jayunit100 commented Nov 6, 2022 •

edited

Loading

sbueringer commented Nov 29, 2022 •

edited

Loading

jayunit100 commented Nov 29, 2022 •

edited

Loading

fabriziopandini commented Dec 28, 2022

k8s-triage-robot commented Jan 20, 2024

fabriziopandini commented Apr 12, 2024

fabriziopandini commented Apr 30, 2024

sbueringer commented Oct 8, 2024

k8s-ci-robot commented Oct 8, 2024

Add metrics for realtime investigation of disconnected wl scenarios #7147

Add metrics for realtime investigation of disconnected wl scenarios #7147

Comments

jayunit100 commented Sep 1, 2022 • edited Loading

Desired output

Current output

killianmuldoon commented Sep 1, 2022

jayunit100 commented Sep 1, 2022 • edited Loading

fabriziopandini commented Sep 6, 2022 • edited Loading

sbueringer commented Sep 6, 2022 • edited Loading

jayunit100 commented Oct 25, 2022 • edited Loading

sbueringer commented Nov 4, 2022

jayunit100 commented Nov 5, 2022 • edited Loading

Open to other creative ideas here

jayunit100 commented Nov 6, 2022 • edited Loading

sbueringer commented Nov 29, 2022 • edited Loading

jayunit100 commented Nov 29, 2022 • edited Loading

fabriziopandini commented Dec 28, 2022

k8s-triage-robot commented Jan 20, 2024

fabriziopandini commented Apr 12, 2024

fabriziopandini commented Apr 30, 2024

sbueringer commented Oct 8, 2024

k8s-ci-robot commented Oct 8, 2024

jayunit100 commented Sep 1, 2022 •

edited

Loading

jayunit100 commented Sep 1, 2022 •

edited

Loading

fabriziopandini commented Sep 6, 2022 •

edited

Loading

sbueringer commented Sep 6, 2022 •

edited

Loading

jayunit100 commented Oct 25, 2022 •

edited

Loading

jayunit100 commented Nov 5, 2022 •

edited

Loading

jayunit100 commented Nov 6, 2022 •

edited

Loading

sbueringer commented Nov 29, 2022 •

edited

Loading

jayunit100 commented Nov 29, 2022 •

edited

Loading