Flaky VIP after CNI replacement #9593
-
Hello, I ran the steps to replace the talos CNI with Cilium and remove kube-proxy. It seems to work, however the Master nodes VIP has become flaky with it being constantly removed and added again. Thus the control-plane loosing connectivity. etcd weoks and sees the member sll is up and running except kubelet which has healtchcheck errors. kube-proxy and flannel manifests are not present and all nodes have been restarted. is there anything left to do so the VIP stays up? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
There are no logs in your question, so I can't help at all. Replacing CNIs is a complicated process, so do at your own risk (much easier to redeploy the cluster from scratch).
https://www.talos.dev/v1.8/introduction/troubleshooting/
|
Beta Was this translation helpful? Give feedback.
-
Well I came as close as deploying from scratch as possible with a full disaster recovery reset :) There are unfortunately only logs about context timeouts because the VIP disapeared. Looks like while replacing the CNI the etcd cluster became out of sync and did not want to recover. Weirdly all commands I know showed it healthy... In any case, I did a disaster recovery and a reset of the one node that got out of sync again after that and now it seems stable. Thanks for the commands though they help. |
Beta Was this translation helpful? Give feedback.
Well I came as close as deploying from scratch as possible with a full disaster recovery reset :)
There are unfortunately only logs about context timeouts because the VIP disapeared. Looks like while replacing the CNI the etcd cluster became out of sync and did not want to recover. Weirdly all commands I know showed it healthy... In any case, I did a disaster recovery and a reset of the one node that got out of sync again after that and now it seems stable. Thanks for the commands though they help.