Couldn't get resource list for metrics.k8s.io; connect: no route to host #3288

HectorB-2020 · 2023-07-08T22:15:46Z

RKE version:

rke --version
rke version v1.4.6

Kubernetes version:
As reported by kubectl get nodes

VERSION: v1.26.4

Docker version: (docker version,docker info preferred)
As reported by kubectl get nodes

CONTAINER-RUNTIME: docker://20.10.24

Operating system and kernel:
uname: 3.10.0-1160.el7.x86_64

Created anew with rke up on four VMs (QEMU/KVM) CentOS 7.9.
The testing environment is comprised of three nodes with controlplane,etcd and one worker.

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
QEMU/KVM (kernel 5.10.152)

Steps to Reproduce:
Frankly hard to report any specific as the setup was very straightforward

rke config
Official instructions was followed: https://rke.docs.rancher.com/installation. Certificates were not customized, used default options.
Most of answers were default. Only calico was chosen instead of default flannel.
rke up
All images were pulled from Docker Hub. These nodes had been fresh until I started running rke up on them.

Results:
Initially we got embarrassed by behaviour of kubectl tool which caused the same four error lines complaining about metrics-server

13943 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

We started looking into the issue and our investigation led us to several other problems.

metrics-server-xxxxxxxxxx-xxxxx is in status CrashLoopBackOff
Its logs show these lines below. k logs metrics-server-xxxxxxxxxx-xxxxx

Error: unable to load configmap based request-header-client-ca-file: Get "https://10.43.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 10.43.0.1:443: connect: no route to host
Usage:
   [flags]

There are other issues with deployment on the fresh cluster

NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
calico-kube-controllers   0/1     1            0           6h56m
coredns                   0/1     1            0           6h56m
coredns-autoscaler        1/1     1            1           6h56m
metrics-server            0/1     1            0           6h56m

While calico-node itself looks healthy.

NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
calico-node   4         4         4       4            4           kubernetes.io/os=linux   6h57m

Logs of calico-kube-controllers are full of these messages.

[ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.43.0.1:443: connect: no route to host
[INFO][1] main.go 138: Failed to initialize datastore error=Get "https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.43.0.1:443: connect: no route to host

Interestingly, despite these issues, this cluster is still capable of running workload.

I'm looking for a solution. Meanwhile I've found something resembling our issues:

The text was updated successfully, but these errors were encountered:

HectorB-2020 · 2023-07-09T11:09:50Z

Initially we used

network:
  plugin: calico

Now we've switched to default

network:
  plugin: flannel
  options: {}

and redeployed this testing cluster: rke remove, rke up.
cluster.rkestate

      "network": {
        "plugin": "flannel",
        "options": {
          "flannel_backend_port": "8472",
          "flannel_backend_type": "vxlan",
          "flannel_backend_vni": "1"
        },

Somebody recommended adding more resources to kind: ClusterRole, name: system:metrics-server

- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  - nodes/stats  # added
  - namespaces   # added
  - configmaps   # added
  verbs:
  - get
  - list
  - watch

but I'm afraid that didn't help.

HectorB-2020 · 2023-07-09T15:37:56Z

Here are some other steps I've made without much success.

SuSE KB ID 000019641.
But we've had these settings on all hosts from beginning.
/etc/sysctl.d/

net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1

Note: IPv6 is disabled on ALL hosts in our environment.

net.ipv6.conf.lo.disable_ipv6 = 1
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1

Prevented NetworkManager from touching CNI:
/etc/NetworkManager/conf.d/calico.conf
```
[keyfile]
unmanaged-devices=interface-name:cali*;interface-name:tunl*;interface-name:vxlan.calico;interface-name:vxlan-v6.calico;interface-name:wireguard.cali;interface-name:wg-v6.cali
```
/etc/NetworkManager/conf.d/flannel.conf (SuSE KB ID 000020017)
```
[keyfile]
unmanaged-devices=interface-name:flannel.1;interface-name:veth*;interface-name:cni0
```
Funny thing is that after playing with NetworkManager like written above those nasty errors couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request disappeared. Was that a conicedence? I cannot be certain.

manuelbuil · 2023-07-31T14:07:31Z

It was working with cni: calico and it fails with cni: flannel?

HectorB-2020 · 2023-08-06T10:08:08Z

Finally I has started working bot with Calico and Flannel, surely not at the same time. 😄
However I've tried so many options, I'm unable to tell for sure which one helped.
I'm inclined to think that proper time synchronization and tuning NetworkManager did the trick.

HectorB-2020 · 2023-08-09T20:44:39Z

@manuelbuil, interestingly today we hit this issue again with cni: calico.
The cluster seemingly deployed normally at the first sight. But then we noticed that kubectl get no reported all all nodes as 'Not Ready'.
That cluster had been initially deployed with Flannel but then we decided to switched to Calico.
Before switching to Calico we'd passed through rke remove and docker system prune --all followed by reboots etc. (I described that in #2632)
Still absolutely the same error message was printed when the cluster finally started.

13943 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

However upon investigation we didn't find any interfaces with cali*. Is there anything we should look closely?
Weird. The only difference was cni: calico.

HectorB-2020 · 2023-08-10T08:52:12Z

This looks like a problem with Calico deployment.
Like I wrote above, there are no cali* interfaces, all nodes are 'Not Ready'.
However kubectl works somehow and it shows this error:

NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Taints:             node.kubernetes.io/not-ready:NoSchedule

Unschedulable:      false

Conditions:
  Type             Status  Reason                       Message
  ----             ------  ------                       -------
  MemoryPressure   False   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Capacity:
  cpu:                8
  ephemeral-storage:  48370388Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65807988Ki
  pods:               110

Allocatable:
  cpu:                8
  ephemeral-storage:  44578149507
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65705588Ki
  pods:               110

System Info:

  Kernel Version:             3.10.0-1160.59.1.el7.x86_64
  OS Image:                   Red Hat Enterprise Linux Server 7.9 (Maipo)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.12
  Kubelet Version:            v1.24.15
  Kube-Proxy Version:         v1.24.15

PodCIDR:                      10.42.10.0/24
PodCIDRs:                     10.42.10.0/24

Non-terminated Pods:          (2 in total)
  Namespace                   Name                              CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                              ------------  ----------  ---------------  -------------  ---
  ingress-nginx               nginx-ingress-controller-k6vf6    0 (0%)        0 (0%)      0 (0%)           0 (0%)         14h
  kube-system                 calico-node-mnkgh                 250m (3%)     0 (0%)      0 (0%)           0 (0%)         14h

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                250m (3%)  0 (0%)
  memory             0 (0%)     0 (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
Events:              <none>

One thing concerns me. AFAIK file /etc/cni/net.d/10-calico.conflist has to be on all nodes, am I right? What process/container should create it?

HectorB-2020 · 2023-08-10T20:38:33Z

Let me provide another piece of details related to issues with Calico

kubectl get po -A reports either ContainerCreating or CrashLoopBackOff

E0810 11:12:41.393691   18642 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0810 11:12:41.399338   18642 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0810 11:12:41.402085   18642 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0810 11:12:41.403904   18642 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
NAMESPACE       NAME                                       READY   STATUS                  RESTARTS          AGE
ingress-nginx   ingress-nginx-admission-create-mtt8g       0/1     ContainerCreating       0                 14h
ingress-nginx   ingress-nginx-admission-patch-2h7l2        0/1     ContainerCreating       0                 14h
ingress-nginx   nginx-ingress-controller-2gnq8             0/1     ContainerCreating       0                 14h
...
ingress-nginx   nginx-ingress-controller-xn976             0/1     ContainerCreating       0                 14h
ingress-nginx   nginx-ingress-controller-z5h8b             0/1     ContainerCreating       0                 14h
kube-system     calico-kube-controllers-5fdcc56bb7-b86z9   0/1     ContainerCreating       0                 14h
kube-system     calico-node-2j5gx                          0/1     Init:CrashLoopBackOff   173 (3m28s ago)   14h
kube-system     calico-node-4ft7p                          0/1     Init:CrashLoopBackOff   173 (4m8s ago)    14h
kube-system     calico-node-7lzjs                          0/1     Init:CrashLoopBackOff   173 (5m2s ago)    14h
kube-system     calico-node-9msc8                          0/1     Init:Error              174 (5m13s ago)   14h
kube-system     calico-node-9nv7x                          0/1     Init:CrashLoopBackOff   173 (3m37s ago)   14h
...
kube-system     calico-node-v79b9                          0/1     Init:CrashLoopBackOff   173 (2m40s ago)   14h
kube-system     coredns-55d59c776b-4wxz7                   0/1     ContainerCreating       0                 14h
kube-system     coredns-autoscaler-6fc6b5cc8c-2btn6        0/1     ContainerCreating       0                 14h
kube-system     metrics-server-75dbbd96bc-25ntp            0/1     ContainerCreating       0                 14h
kube-system     rke-coredns-addon-deploy-job-7r4hh         0/1     Completed               0                 14h
kube-system     rke-ingress-controller-deploy-job-mjx4l    0/1     Completed               0                 14h
kube-system     rke-metrics-addon-deploy-job-vtqnn         0/1     Completed               0                 14h
kube-system     rke-network-plugin-deploy-job-8jrlv        0/1     Completed               0                 14h

As I've learnt so far, 10-calico.conflist is created by one of Init Containers called install-cni.

Here is what kubectl describe node reports on one of nodes

  install-cni:
    Container ID:  docker://0ee3b7552eadf53e01b3cd8c428e47a23ab914746c2880e500bcf1804f7b5125
    Image:         srv64.company.net:5000/rancher/calico-cni:v3.25.0-rancher1
    Image ID:      docker-pullable://srv64.company.net:5000/rancher/calico-cni@sha256:afb7ab68e52d4bf4d401179f62749701704ae8eed987116d65150325db81fabb
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/install
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
    Ready:          False
    Restart Count:  173
    Environment Variables from:
      kubernetes-services-endpoint  ConfigMap  Optional: true
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qwtgv (ro)

The image is pullable from our private registry. So I decided to play a bit with that image, started it locally.

docker run -it -v $(pwd)/opt:/host/opt/cni/bin 
                      -v $(pwd)/etc:/host/etc/cni/net.d 
            rancher/calico-cni:v3.25.0-rancher1

And here is what I got. Looks like the image works somehow from the first sight.

[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/bandwidth
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico-ipam
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/flannel
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/host-local
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/install
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/loopback
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/portmap
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/tuning
[INFO][1] cni-installer/<nil> <nil>: Wrote Calico CNI binaries to /host/opt/cni/bin

[INFO][1] cni-installer/<nil> <nil>: CNI plugin version: v3.25.0

[INFO][1] cni-installer/<nil> <nil>: /host/secondary-bin-dir is not writeable, skipping
[INFO][1] cni-installer/<nil> <nil>: Created /host/etc/cni/net.d/10-calico.conflist
{
  "name": "k8s-pod-network",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "type": "calico",
      "log_level": "info",
      "log_file_path": "/var/log/calico/cni/cni.log",
      "datastore_type": "kubernetes",
      "nodename": "9743a9c291e8",
      "mtu": 1500,
      "ipam": {"type": "calico-ipam"},
      "policy": {"type": "k8s"},
      "kubernetes": {"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"}
    },
    {
      "type": "portmap",
      "snat": true,
      "capabilities": {"portMappings": true}
    }
  ]
}
[INFO][1] cni-installer/<nil> <nil>: Done configuring CNI.  Sleep= true
FATA[0001] no such file or directory                     source="install.go:294"

It copied binaries to /opt/cni/bin and created 10-calico.conflist. So what's wrong?

manuelbuil · 2023-08-14T09:04:00Z

Could you get the logs from the image that is crashing please? I wonder if there is something like apparmor or selinux making it impossible for Calico to write binaries in /host/opt/cni/bin. We should see that in the logs

HectorB-2020 · 2023-08-14T19:06:12Z

@manuelbuil, this is CentOS 7.9 and I'm positive it has SELinux disabled.
My concern is that Calico is picky about versions of other images in system_images: section. Once I've removed this section as advised by @superseb, cluster has started building successfully. Looked my a kind of magic.
Interestingly Flannel is less capricious about surrounding versions of images. I started a numerous of times in absolutely the same cluster.yml including that system_images: section.

github-actions · 2023-10-14T02:02:29Z

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

DeyvsonL · 2023-10-24T18:38:51Z

I still facing this issue, do we have a fix?

github-actions · 2023-12-24T02:11:08Z

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

HectorB-2020 mentioned this issue Jul 9, 2023

10.43.0.1:443 no route to host rancher/rancher#18783

Closed

HectorB-2020 mentioned this issue Aug 9, 2023

The container name \"/etcd-fix-perm\" is already in use by container #2632

Open

github-actions bot added the status/stale label Oct 14, 2023

github-actions bot removed the status/stale label Oct 25, 2023

github-actions bot added the status/stale label Dec 24, 2023

github-actions bot closed this as completed Jan 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Couldn't get resource list for metrics.k8s.io; connect: no route to host #3288

Couldn't get resource list for metrics.k8s.io; connect: no route to host #3288

HectorB-2020 commented Jul 8, 2023 •

edited by zube bot

Loading

HectorB-2020 commented Jul 9, 2023

HectorB-2020 commented Jul 9, 2023

manuelbuil commented Jul 31, 2023

HectorB-2020 commented Aug 6, 2023

HectorB-2020 commented Aug 9, 2023

HectorB-2020 commented Aug 10, 2023

HectorB-2020 commented Aug 10, 2023

manuelbuil commented Aug 14, 2023

HectorB-2020 commented Aug 14, 2023

github-actions bot commented Oct 14, 2023

DeyvsonL commented Oct 24, 2023

github-actions bot commented Dec 24, 2023

Couldn't get resource list for metrics.k8s.io; connect: no route to host #3288

Couldn't get resource list for metrics.k8s.io; connect: no route to host #3288

Comments

HectorB-2020 commented Jul 8, 2023 • edited by zube bot Loading

Interestingly, despite these issues, this cluster is still capable of running workload.

HectorB-2020 commented Jul 9, 2023

HectorB-2020 commented Jul 9, 2023

manuelbuil commented Jul 31, 2023

HectorB-2020 commented Aug 6, 2023

HectorB-2020 commented Aug 9, 2023

HectorB-2020 commented Aug 10, 2023

HectorB-2020 commented Aug 10, 2023

manuelbuil commented Aug 14, 2023

HectorB-2020 commented Aug 14, 2023

github-actions bot commented Oct 14, 2023

DeyvsonL commented Oct 24, 2023

github-actions bot commented Dec 24, 2023

HectorB-2020 commented Jul 8, 2023 •

edited by zube bot

Loading