Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldn't get resource list for metrics.k8s.io; connect: no route to host #3288

Closed
HectorB-2020 opened this issue Jul 8, 2023 · 12 comments
Closed

Comments

@HectorB-2020
Copy link

HectorB-2020 commented Jul 8, 2023

RKE version:

rke --version
rke version v1.4.6

Kubernetes version:
As reported by kubectl get nodes

  • VERSION: v1.26.4

Docker version: (docker version,docker info preferred)
As reported by kubectl get nodes

  • CONTAINER-RUNTIME: docker://20.10.24

Operating system and kernel:
uname: 3.10.0-1160.el7.x86_64

Created anew with rke up on four VMs (QEMU/KVM) CentOS 7.9.
The testing environment is comprised of three nodes with controlplane,etcd and one worker.

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
QEMU/KVM (kernel 5.10.152)

Steps to Reproduce:
Frankly hard to report any specific as the setup was very straightforward

  • rke config
    Official instructions was followed: https://rke.docs.rancher.com/installation. Certificates were not customized, used default options.
    Most of answers were default. Only calico was chosen instead of default flannel.
  • rke up
    All images were pulled from Docker Hub. These nodes had been fresh until I started running rke up on them.

Results:
Initially we got embarrassed by behaviour of kubectl tool which caused the same four error lines complaining about metrics-server

13943 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

We started looking into the issue and our investigation led us to several other problems.

  1. metrics-server-xxxxxxxxxx-xxxxx is in status CrashLoopBackOff
    Its logs show these lines below. k logs metrics-server-xxxxxxxxxx-xxxxx
Error: unable to load configmap based request-header-client-ca-file: Get "https://10.43.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 10.43.0.1:443: connect: no route to host
Usage:
   [flags]
  1. There are other issues with deployment on the fresh cluster
NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
calico-kube-controllers   0/1     1            0           6h56m
coredns                   0/1     1            0           6h56m
coredns-autoscaler        1/1     1            1           6h56m
metrics-server            0/1     1            0           6h56m

While calico-node itself looks healthy.

NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
calico-node   4         4         4       4            4           kubernetes.io/os=linux   6h57m
  1. Logs of calico-kube-controllers are full of these messages.
[ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.43.0.1:443: connect: no route to host
[INFO][1] main.go 138: Failed to initialize datastore error=Get "https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.43.0.1:443: connect: no route to host

Interestingly, despite these issues, this cluster is still capable of running workload.

I'm looking for a solution. Meanwhile I've found something resembling our issues:

@HectorB-2020
Copy link
Author

Initially we used

network:
  plugin: calico

Now we've switched to default

network:
  plugin: flannel
  options: {}

and redeployed this testing cluster: rke remove, rke up.
cluster.rkestate

      "network": {
        "plugin": "flannel",
        "options": {
          "flannel_backend_port": "8472",
          "flannel_backend_type": "vxlan",
          "flannel_backend_vni": "1"
        },

Somebody recommended adding more resources to kind: ClusterRole, name: system:metrics-server

- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  - nodes/stats  # added
  - namespaces   # added
  - configmaps   # added
  verbs:
  - get
  - list
  - watch

but I'm afraid that didn't help.

@HectorB-2020
Copy link
Author

Here are some other steps I've made without much success.

  1. SuSE KB ID 000019641.
    But we've had these settings on all hosts from beginning.
    /etc/sysctl.d/
    net.bridge.bridge-nf-call-iptables = 1
    net.ipv4.ip_forward = 1
    
    Note: IPv6 is disabled on ALL hosts in our environment.
    net.ipv6.conf.lo.disable_ipv6 = 1
    net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    
  2. Prevented NetworkManager from touching CNI:
    /etc/NetworkManager/conf.d/calico.conf
    [keyfile]
    unmanaged-devices=interface-name:cali*;interface-name:tunl*;interface-name:vxlan.calico;interface-name:vxlan-v6.calico;interface-name:wireguard.cali;interface-name:wg-v6.cali
    /etc/NetworkManager/conf.d/flannel.conf (SuSE KB ID 000020017)
    [keyfile]
    unmanaged-devices=interface-name:flannel.1;interface-name:veth*;interface-name:cni0
    Funny thing is that after playing with NetworkManager like written above those nasty errors couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request disappeared. Was that a conicedence? I cannot be certain.

@manuelbuil
Copy link
Contributor

It was working with cni: calico and it fails with cni: flannel?

@HectorB-2020
Copy link
Author

Finally I has started working bot with Calico and Flannel, surely not at the same time. 😄
However I've tried so many options, I'm unable to tell for sure which one helped.
I'm inclined to think that proper time synchronization and tuning NetworkManager did the trick.

@HectorB-2020
Copy link
Author

@manuelbuil, interestingly today we hit this issue again with cni: calico.
The cluster seemingly deployed normally at the first sight. But then we noticed that kubectl get no reported all all nodes as 'Not Ready'.
That cluster had been initially deployed with Flannel but then we decided to switched to Calico.
Before switching to Calico we'd passed through rke remove and docker system prune --all followed by reboots etc. (I described that in #2632)
Still absolutely the same error message was printed when the cluster finally started.

13943 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
13943 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

However upon investigation we didn't find any interfaces with cali*. Is there anything we should look closely?
Weird. The only difference was cni: calico.

@HectorB-2020
Copy link
Author

This looks like a problem with Calico deployment.
Like I wrote above, there are no cali* interfaces, all nodes are 'Not Ready'.
However kubectl works somehow and it shows this error:

NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Taints:             node.kubernetes.io/not-ready:NoSchedule

Unschedulable:      false

Conditions:
  Type             Status  Reason                       Message
  ----             ------  ------                       -------
  MemoryPressure   False   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Capacity:
  cpu:                8
  ephemeral-storage:  48370388Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65807988Ki
  pods:               110

Allocatable:
  cpu:                8
  ephemeral-storage:  44578149507
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65705588Ki
  pods:               110

System Info:

  Kernel Version:             3.10.0-1160.59.1.el7.x86_64
  OS Image:                   Red Hat Enterprise Linux Server 7.9 (Maipo)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.12
  Kubelet Version:            v1.24.15
  Kube-Proxy Version:         v1.24.15

PodCIDR:                      10.42.10.0/24
PodCIDRs:                     10.42.10.0/24

Non-terminated Pods:          (2 in total)
  Namespace                   Name                              CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                              ------------  ----------  ---------------  -------------  ---
  ingress-nginx               nginx-ingress-controller-k6vf6    0 (0%)        0 (0%)      0 (0%)           0 (0%)         14h
  kube-system                 calico-node-mnkgh                 250m (3%)     0 (0%)      0 (0%)           0 (0%)         14h

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                250m (3%)  0 (0%)
  memory             0 (0%)     0 (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
Events:              <none>

One thing concerns me. AFAIK file /etc/cni/net.d/10-calico.conflist has to be on all nodes, am I right? What process/container should create it?

@HectorB-2020
Copy link
Author

Let me provide another piece of details related to issues with Calico

kubectl get po -A reports either ContainerCreating or CrashLoopBackOff
E0810 11:12:41.393691   18642 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0810 11:12:41.399338   18642 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0810 11:12:41.402085   18642 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0810 11:12:41.403904   18642 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
NAMESPACE       NAME                                       READY   STATUS                  RESTARTS          AGE
ingress-nginx   ingress-nginx-admission-create-mtt8g       0/1     ContainerCreating       0                 14h
ingress-nginx   ingress-nginx-admission-patch-2h7l2        0/1     ContainerCreating       0                 14h
ingress-nginx   nginx-ingress-controller-2gnq8             0/1     ContainerCreating       0                 14h
...
ingress-nginx   nginx-ingress-controller-xn976             0/1     ContainerCreating       0                 14h
ingress-nginx   nginx-ingress-controller-z5h8b             0/1     ContainerCreating       0                 14h
kube-system     calico-kube-controllers-5fdcc56bb7-b86z9   0/1     ContainerCreating       0                 14h
kube-system     calico-node-2j5gx                          0/1     Init:CrashLoopBackOff   173 (3m28s ago)   14h
kube-system     calico-node-4ft7p                          0/1     Init:CrashLoopBackOff   173 (4m8s ago)    14h
kube-system     calico-node-7lzjs                          0/1     Init:CrashLoopBackOff   173 (5m2s ago)    14h
kube-system     calico-node-9msc8                          0/1     Init:Error              174 (5m13s ago)   14h
kube-system     calico-node-9nv7x                          0/1     Init:CrashLoopBackOff   173 (3m37s ago)   14h
...
kube-system     calico-node-v79b9                          0/1     Init:CrashLoopBackOff   173 (2m40s ago)   14h
kube-system     coredns-55d59c776b-4wxz7                   0/1     ContainerCreating       0                 14h
kube-system     coredns-autoscaler-6fc6b5cc8c-2btn6        0/1     ContainerCreating       0                 14h
kube-system     metrics-server-75dbbd96bc-25ntp            0/1     ContainerCreating       0                 14h
kube-system     rke-coredns-addon-deploy-job-7r4hh         0/1     Completed               0                 14h
kube-system     rke-ingress-controller-deploy-job-mjx4l    0/1     Completed               0                 14h
kube-system     rke-metrics-addon-deploy-job-vtqnn         0/1     Completed               0                 14h
kube-system     rke-network-plugin-deploy-job-8jrlv        0/1     Completed               0                 14h

As I've learnt so far, 10-calico.conflist is created by one of Init Containers called install-cni.

Here is what kubectl describe node reports on one of nodes
  install-cni:
    Container ID:  docker://0ee3b7552eadf53e01b3cd8c428e47a23ab914746c2880e500bcf1804f7b5125
    Image:         srv64.company.net:5000/rancher/calico-cni:v3.25.0-rancher1
    Image ID:      docker-pullable://srv64.company.net:5000/rancher/calico-cni@sha256:afb7ab68e52d4bf4d401179f62749701704ae8eed987116d65150325db81fabb
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/install
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
    Ready:          False
    Restart Count:  173
    Environment Variables from:
      kubernetes-services-endpoint  ConfigMap  Optional: true
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qwtgv (ro)

The image is pullable from our private registry. So I decided to play a bit with that image, started it locally.

docker run -it -v $(pwd)/opt:/host/opt/cni/bin 
                      -v $(pwd)/etc:/host/etc/cni/net.d 
            rancher/calico-cni:v3.25.0-rancher1
And here is what I got. Looks like the image works somehow from the first sight.
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/bandwidth
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico-ipam
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/flannel
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/host-local
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/install
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/loopback
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/portmap
[INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/tuning
[INFO][1] cni-installer/<nil> <nil>: Wrote Calico CNI binaries to /host/opt/cni/bin

[INFO][1] cni-installer/<nil> <nil>: CNI plugin version: v3.25.0

[INFO][1] cni-installer/<nil> <nil>: /host/secondary-bin-dir is not writeable, skipping
[INFO][1] cni-installer/<nil> <nil>: Created /host/etc/cni/net.d/10-calico.conflist
{
  "name": "k8s-pod-network",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "type": "calico",
      "log_level": "info",
      "log_file_path": "/var/log/calico/cni/cni.log",
      "datastore_type": "kubernetes",
      "nodename": "9743a9c291e8",
      "mtu": 1500,
      "ipam": {"type": "calico-ipam"},
      "policy": {"type": "k8s"},
      "kubernetes": {"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"}
    },
    {
      "type": "portmap",
      "snat": true,
      "capabilities": {"portMappings": true}
    }
  ]
}
[INFO][1] cni-installer/<nil> <nil>: Done configuring CNI.  Sleep= true
FATA[0001] no such file or directory                     source="install.go:294"

It copied binaries to /opt/cni/bin and created 10-calico.conflist. So what's wrong?

@manuelbuil
Copy link
Contributor

Could you get the logs from the image that is crashing please? I wonder if there is something like apparmor or selinux making it impossible for Calico to write binaries in /host/opt/cni/bin. We should see that in the logs

@HectorB-2020
Copy link
Author

@manuelbuil, this is CentOS 7.9 and I'm positive it has SELinux disabled.
My concern is that Calico is picky about versions of other images in system_images: section. Once I've removed this section as advised by @superseb, cluster has started building successfully. Looked my a kind of magic.
Interestingly Flannel is less capricious about surrounding versions of images. I started a numerous of times in absolutely the same cluster.yml including that system_images: section.

@github-actions
Copy link
Contributor

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

@DeyvsonL
Copy link

I still facing this issue, do we have a fix?

Copy link
Contributor

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants