Couldn't get resource list for; connect: no route to host #3288

HectorB-2020 opened this issue Jul 8, 2023 · 12 comments


HectorB-2020 commented Jul 8, 2023

RKE version:

rke --version
rke version v1.4.6

Kubernetes version:
As reported by kubectl get nodes

  • VERSION: v1.26.4

Docker version: (docker version,docker info preferred)
As reported by kubectl get nodes

  • CONTAINER-RUNTIME: docker://20.10.24

Operating system and kernel:
uname: 3.10.0-1160.el7.x86_64

Created anew with rke up on four VMs (QEMU/KVM) CentOS 7.9.
The testing environment is comprised of three nodes with controlplane,etcd and one worker.

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
QEMU/KVM (kernel 5.10.152)

Steps to Reproduce:
Frankly hard to report any specific as the setup was very straightforward

  • rke config
    Official instructions was followed: Certificates were not customized, used default options.
    Most of answers were default. Only calico was chosen instead of default flannel.
  • rke up
    All images were pulled from Docker Hub. These nodes had been fresh until I started running rke up on them.

Initially we got embarrassed by behaviour of kubectl tool which caused the same four error lines complaining about metrics-server

13943 memcache.go:287] couldn't get resource list for the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for the server is currently unable to handle the request

We started looking into the issue and our investigation led us to several other problems.

  1. metrics-server-xxxxxxxxxx-xxxxx is in status CrashLoopBackOff
    Its logs show these lines below. k logs metrics-server-xxxxxxxxxx-xxxxx
Error: unable to load configmap based request-header-client-ca-file: Get "": dial tcp connect: no route to host
  1. There are other issues with deployment on the fresh cluster
NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
calico-kube-controllers   0/1     1            0           6h56m
coredns                   0/1     1            0           6h56m
coredns-autoscaler        1/1     1            1           6h56m
metrics-server            0/1     1            0           6h56m

While calico-node itself looks healthy.

calico-node   4         4         4       4            4    6h57m
  1. Logs of calico-kube-controllers are full of these messages.
[ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "": dial tcp connect: no route to host
[INFO][1] main.go 138: Failed to initialize datastore error=Get "": dial tcp connect: no route to host

Interestingly, despite these issues, this cluster is still capable of running workload.

I'm looking for a solution. Meanwhile I've found something resembling our issues:

Initially we used

  plugin: calico

Now we've switched to default

  plugin: flannel
  options: {}

and redeployed this testing cluster: rke remove, rke up.

      "network": {
        "plugin": "flannel",
        "options": {
          "flannel_backend_port": "8472",
          "flannel_backend_type": "vxlan",
          "flannel_backend_vni": "1"

Somebody recommended adding more resources to kind: ClusterRole, name: system:metrics-server

- apiGroups:
  - ""
  - pods
  - nodes
  - nodes/stats  # added
  - namespaces   # added
  - configmaps   # added
  - get
  - list
  - watch

but I'm afraid that didn't help.

Here are some other steps I've made without much success.

  1. SuSE KB ID 000019641.
    But we've had these settings on all hosts from beginning.
    net.bridge.bridge-nf-call-iptables = 1
    net.ipv4.ip_forward = 1
    Note: IPv6 is disabled on ALL hosts in our environment.
    net.ipv6.conf.lo.disable_ipv6 = 1
    net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
  2. Prevented NetworkManager from touching CNI:
    /etc/NetworkManager/conf.d/flannel.conf (SuSE KB ID 000020017)
    Funny thing is that after playing with NetworkManager like written above those nasty errors couldn't get resource list for the server is currently unable to handle the request disappeared. Was that a conicedence? I cannot be certain.

It was working with cni: calico and it fails with cni: flannel?

Finally I has started working bot with Calico and Flannel, surely not at the same time. 😄
However I've tried so many options, I'm unable to tell for sure which one helped.
I'm inclined to think that proper time synchronization and tuning NetworkManager did the trick.

@manuelbuil, interestingly today we hit this issue again with cni: calico.
The cluster seemingly deployed normally at the first sight. But then we noticed that kubectl get no reported all all nodes as 'Not Ready'.
That cluster had been initially deployed with Flannel but then we decided to switched to Calico.
Before switching to Calico we'd passed through rke remove and docker system prune --all followed by reboots etc. (I described that in #2632)
Still absolutely the same error message was printed when the cluster finally started.

13943 memcache.go:287] couldn't get resource list for the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for the server is currently unable to handle the request

However upon investigation we didn't find any interfaces with cali*. Is there anything we should look closely?
Weird. The only difference was cni: calico.

This looks like a problem with Calico deployment.
Like I wrote above, there are no cali* interfaces, all nodes are 'Not Ready'.
However kubectl works somehow and it shows this error:

NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Unschedulable:      false

  Type             Status  Reason                       Message
  ----             ------  ------                       -------
  MemoryPressure   False   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

  cpu:                8
  ephemeral-storage:  48370388Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65807988Ki
  pods:               110

  cpu:                8
  ephemeral-storage:  44578149507
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65705588Ki
  pods:               110

System Info:

  Kernel Version:             3.10.0-1160.59.1.el7.x86_64
  OS Image:                   Red Hat Enterprise Linux Server 7.9 (Maipo)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.12
  Kubelet Version:            v1.24.15
  Kube-Proxy Version:         v1.24.15


Non-terminated Pods:          (2 in total)
  Namespace                   Name                              CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                              ------------  ----------  ---------------  -------------  ---
  ingress-nginx               nginx-ingress-controller-k6vf6    0 (0%)        0 (0%)      0 (0%)           0 (0%)         14h
  kube-system                 calico-node-mnkgh                 250m (3%)     0 (0%)      0 (0%)           0 (0%)         14h

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                250m (3%)  0 (0%)
  memory             0 (0%)     0 (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
Events:              <none>

One thing concerns me. AFAIK file /etc/cni/net.d/10-calico.conflist has to be on all nodes, am I right? What process/container should create it?

Let me provide another piece of details related to issues with Calico

kubectl get po -A reports either ContainerCreating or CrashLoopBackOff
E0810 11:12:41.393691   18642 memcache.go:287] couldn't get resource list for the server is currently unable to handle the request
E0810 11:12:41.399338   18642 memcache.go:121] couldn't get resource list for the server is currently unable to handle the request
E0810 11:12:41.402085   18642 memcache.go:121] couldn't get resource list for the server is currently unable to handle the request
E0810 11:12:41.403904   18642 memcache.go:121] couldn't get resource list for the server is currently unable to handle the request
NAMESPACE       NAME                                       READY   STATUS                  RESTARTS          AGE
ingress-nginx   ingress-nginx-admission-create-mtt8g       0/1     ContainerCreating       0                 14h
ingress-nginx   ingress-nginx-admission-patch-2h7l2        0/1     ContainerCreating       0                 14h
ingress-nginx   nginx-ingress-controller-2gnq8             0/1     ContainerCreating       0                 14h
ingress-nginx   nginx-ingress-controller-xn976             0/1     ContainerCreating       0                 14h
ingress-nginx   nginx-ingress-controller-z5h8b             0/1     ContainerCreating       0                 14h
kube-system     calico-kube-controllers-5fdcc56bb7-b86z9   0/1     ContainerCreating       0                 14h
kube-system     calico-node-2j5gx                          0/1     Init:CrashLoopBackOff   173 (3m28s ago)   14h
kube-system     calico-node-4ft7p                          0/1     Init:CrashLoopBackOff   173 (4m8s ago)    14h
kube-system     calico-node-7lzjs                          0/1     Init:CrashLoopBackOff   173 (5m2s ago)    14h
kube-system     calico-node-9msc8                          0/1     Init:Error              174 (5m13s ago)   14h
kube-system     calico-node-9nv7x                          0/1     Init:CrashLoopBackOff   173 (3m37s ago)   14h
kube-system     calico-node-v79b9                          0/1     Init:CrashLoopBackOff   173 (2m40s ago)   14h
kube-system     coredns-55d59c776b-4wxz7                   0/1     ContainerCreating       0                 14h
kube-system     coredns-autoscaler-6fc6b5cc8c-2btn6        0/1     ContainerCreating       0                 14h
kube-system     metrics-server-75dbbd96bc-25ntp            0/1     ContainerCreating       0                 14h
kube-system     rke-coredns-addon-deploy-job-7r4hh         0/1     Completed               0                 14h
kube-system     rke-ingress-controller-deploy-job-mjx4l    0/1     Completed               0                 14h
kube-system     rke-metrics-addon-deploy-job-vtqnn         0/1     Completed               0                 14h
kube-system     rke-network-plugin-deploy-job-8jrlv        0/1     Completed               0                 14h

As I've learnt so far, 10-calico.conflist is created by one of Init Containers called install-cni.

Here is what kubectl describe node reports on one of nodes
    Container ID:  docker://0ee3b7552eadf53e01b3cd8c428e47a23ab914746c2880e500bcf1804f7b5125
    Image ID:      docker-pullable://
    Port:          <none>
    Host Port:     <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
    Ready:          False
    Restart Count:  173
    Environment Variables from:
      kubernetes-services-endpoint  ConfigMap  Optional: true
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/ from kube-api-access-qwtgv (ro)

The image is pullable from our private registry. So I decided to play a bit with that image, started it locally.

docker run -it -v $(pwd)/opt:/host/opt/cni/bin 
                      -v $(pwd)/etc:/host/etc/cni/net.d 
And here is what I got. Looks like the image works somehow from the first sight.
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/bandwidth
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico-ipam
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/flannel
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/host-local
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/install
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/loopback
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/portmap
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/tuning
[INFO][1] cni-installer/<nil> <nil>: Wrote Calico CNI binaries to /host/opt/cni/bin

[INFO][1] cni-installer/<nil> <nil>: CNI plugin version: v3.25.0

[INFO][1] cni-installer/<nil> <nil>: /host/secondary-bin-dir is not writeable, skipping
[INFO][1] cni-installer/<nil> <nil>: Created /host/etc/cni/net.d/10-calico.conflist
  "name": "k8s-pod-network",
  "cniVersion": "0.3.1",
  "plugins": [
      "type": "calico",
      "log_level": "info",
      "log_file_path": "/var/log/calico/cni/cni.log",
      "datastore_type": "kubernetes",
      "nodename": "9743a9c291e8",
      "mtu": 1500,
      "ipam": {"type": "calico-ipam"},
      "policy": {"type": "k8s"},
      "kubernetes": {"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"}
      "type": "portmap",
      "snat": true,
      "capabilities": {"portMappings": true}
[INFO][1] cni-installer/<nil> <nil>: Done configuring CNI.  Sleep= true
FATA[0001] no such file or directory                     source="install.go:294"

It copied binaries to /opt/cni/bin and created 10-calico.conflist. So what's wrong?

Could you get the logs from the image that is crashing please? I wonder if there is something like apparmor or selinux making it impossible for Calico to write binaries in /host/opt/cni/bin. We should see that in the logs

@manuelbuil, this is CentOS 7.9 and I'm positive it has SELinux disabled.
My concern is that Calico is picky about versions of other images in system_images: section. Once I've removed this section as advised by @superseb, cluster has started building successfully. Looked my a kind of magic.
Interestingly Flannel is less capricious about surrounding versions of images. I started a numerous of times in absolutely the same cluster.yml including that system_images: section.

I still facing this issue, do we have a fix?

