Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rke2 failed to start with Cilium when kubeProxyReplacement is set to strict or true #4862

Closed
gigi206 opened this issue Oct 6, 2023 · 10 comments

Comments

@gigi206
Copy link

gigi206 commented Oct 6, 2023

Environmental Info:
RKE2 Version:

rke2 -v
rke2 version v1.25.13+rke2r1 (785512e7ae77d7471750834ee96c14382e2461ca)
go version go1.20.7 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

uname -a
Linux k8s-m1 6.1.0-12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07) x86_64 GNU/Linux

Cluster Configuration:
1 server

Describe the bug:
Cilium failed with the config explained here => https://docs.rke2.io/install/network_options/

Steps To Reproduce:

  • Installed RKE2:
    /etc/rancher/rke2/config.yaml
disable:
- rke2-ingress-nginx
- rke2-canal # disable it with cilium
disable-kube-proxy: true
cni:
  - cilium
tls-san:
- k8s-api.gigix
etcd-expose-metrics: true
kube-controller-manager-arg:
- bind-address=0.0.0.0
kube-scheduler-arg:
- bind-address=0.0.0.0

/var/lib/rancher/rke2/server/manifests/rke2-cilium-config.yaml

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-cilium
  namespace: kube-system
spec:
  valuesContent: |-
    kubeProxyReplacement: strict
    cni:
      chainingMode: "none"
kubectl get pod -n kube-system 
NAME                                                    READY   STATUS      RESTARTS        AGE
cilium-operator-5bfc85c656-7s9n5                        0/1     Pending     0               9m6s
cilium-operator-5bfc85c656-9z9lj                        0/1     Error       5 (2m38s ago)   9m6s
cilium-snwtx                                            0/1     Init:0/6    5 (96s ago)     9m6s
cloud-controller-manager-k8s-m1                         1/1     Running     0               9m13s
etcd-k8s-m1                                             1/1     Running     0               9m11s
helm-install-rke2-cilium-lsd5b                          0/1     Completed   0               9m53s
helm-install-rke2-coredns-7lnj7                         0/1     Completed   0               9m53s
helm-install-rke2-metrics-server-r9zwq                  0/1     Pending     0               9m53s
helm-install-rke2-snapshot-controller-crd-8xzh8         0/1     Pending     0               9m53s
helm-install-rke2-snapshot-controller-ppb62             0/1     Pending     0               9m53s
helm-install-rke2-snapshot-validation-webhook-8dsq5     0/1     Pending     0               9m53s
kube-apiserver-k8s-m1                                   1/1     Running     0               9m22s
kube-controller-manager-k8s-m1                          1/1     Running     0               9m11s
kube-scheduler-k8s-m1                                   1/1     Running     0               9m5s
rke2-coredns-rke2-coredns-546587f99c-zf8fd              0/1     Pending     0               9m6s
rke2-coredns-rke2-coredns-autoscaler-797c865dbd-zc5rq   0/1     Pending     0               9m6s

Additional context / logs:

kubectl logs cilium-operator-5bfc85c656-9z9lj -n kube-system
level=info msg=Starting subsys=hive
level=info msg="Started gops server" address="127.0.0.1:9891" subsys=gops
level=info msg="Start hook executed" duration="481.896µs" function="gops.registerGopsHooks.func1 (cell.go:44)" subsys=hive
level=info msg="Establishing connection to apiserver" host="https://10.43.0.1:443" subsys=k8s-client
level=info msg="Establishing connection to apiserver" host="https://10.43.0.1:443" subsys=k8s-client
level=error msg="Unable to contact k8s api-server" error="Get \"https://10.43.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.43.0.1:443: i/o timeout" ipAddr="https://10.43.0.1:443" subsys=k8s-client
level=error msg="Start hook failed" error="Get \"https://10.43.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.43.0.1:443: i/o timeout" function="client.(*compositeClientset).onStart" subsys=hive
level=info msg=Stopping subsys=hive
level=info msg="Stopped gops server" address="127.0.0.1:9891" subsys=gops
level=info msg="Stop hook executed" duration="252.876µs" function="gops.registerGopsHooks.func2 (cell.go:51)" subsys=hive
level=fatal msg="failed to start: Get \"https://10.43.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.43.0.1:443: i/o timeout" subsys=cilium-operator-generic
@brandond
Copy link
Member

brandond commented Oct 7, 2023

IIRC, without kube-proxy, the operator can't talk to the in-cluster apiserver endpoint to deploy the kube-proxy replacement. You need to customize the apiserver address in the cilium chart config to point at localhost.

@manuelbuil should we cover this in our docs?

@gigi206
Copy link
Author

gigi206 commented Oct 7, 2023

Thank you, I added k8sServiceHost and k8sServicePort

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-cilium
  namespace: kube-system
spec:
  valuesContent: |-
    kubeProxyReplacement: true
    k8sServiceHost: 192.168.121.201
    k8sServicePort: 6443
    cni:
      chainingMode: "none"

And it works

kubectl get no
NAME     STATUS   ROLES                       AGE   VERSION
k8s-m1   Ready    control-plane,etcd,master   26m   v1.25.13+rke2r1

But could you explain me why if I start the 1st time rke2 with the following settings (k8sServiceHost: kubernetes.default.svc.cluster.local) it fails. But If I start rke2 the 1st time with the host IP (k8sServiceHost: 192.168.121.201) and I edit with k8sServiceHost: kubernetes.default.svc.cluster.local and I restart it works ?

    k8sServiceHost: kubernetes.default.svc.cluster.local
    k8sServicePort: 443

@brandond
Copy link
Member

brandond commented Oct 7, 2023

Kube-proxy is the component that makes cluster service endpoints work - you can't access kubernetes.default.svc.cluster.local without it. That is what I meant when I said

without kube-proxy, the operator can't talk to the in-cluster apiserver endpoint

After you've started it once, access to that in-cluster endpoint works - until the next time you reboot.

@gigi206
Copy link
Author

gigi206 commented Oct 7, 2023

Thank you for your reply :)
I test with only one master but if I have 3 masters ?
Put 127.0.0.1 is a good idea or not ?

@0xAlcibiades
Copy link

Thank you for your reply :) I test with only one master but if I have 3 masters ? Put 127.0.0.1 is a good idea or not ?

Same question here.

@manuelbuil
Copy link
Contributor

Thank you for your reply :) I test with only one master but if I have 3 masters ? Put 127.0.0.1 is a good idea or not ?

Same question here.

This is more of a Cilium question but I think 127.0.0.1 should work

@manuelbuil
Copy link
Contributor

IIRC, without kube-proxy, the operator can't talk to the in-cluster apiserver endpoint to deploy the kube-proxy replacement. You need to customize the apiserver address in the cilium chart config to point at localhost.

@manuelbuil should we cover this in our docs?

It's already part of our docs with a link to Cilium upstream docs where things are explained in detail

@brandond
Copy link
Member

brandond commented Jan 2, 2024

Ah OK. Well in that case, I'm not sure we need to do anything. Can't help people if they don't read the docs.

@brandond brandond closed this as completed Jan 2, 2024
@0xAlcibiades
Copy link

I'm not sure I've seen it in the cilium docs, it's definitely helpful to know that the rke2 api server listens on 127.0.0.1 for each host.

@dcorbe
Copy link

dcorbe commented Aug 15, 2024

This should ABSOLUTELY be in the cilium docs, because I ran into this issue on vanilla k3s (not RKE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants