You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After upgrading my two Talos clusters to Cilium 1.16.5, I immediately started having external DNS resolution issues on one cluster. CoreDNS started throwing these errors, and things quickly started going sideways:
Reverting back to 1.16.4 made the problem go away. I posted this on the Cilium issues board as #36737, where other people with Talos starting piping in with similar stories.
The Talos dns-resolve-cache logs show that it is receiving the requests and resolving them successfully, so it seems like the response just isn't making it back to the CoreDNS pod.
I did some digging around the Talos DNS docs and noticed the cluster with issues was created with Talos 1.8.0 or higher, while the other one was created long before 1.8.0. As such, forwardKubeDNSToHost was enabled by default on the problem cluster, while the other does not have it enabled.
After restarting CoreDNS, the problem immediately went away.
Since forwardKubeDNSToHost is a default option now, I suspect others may come across this issue, so its probably best to get to the bottom of it. Unsure if its a Talos problem or Cilium.
Environment
Talos version: 1.9.0
Kubernetes version: 1.32.0
Platform: ARM64 and AMD64
The text was updated successfully, but these errors were encountered:
As per cilium/cilium#36737 (comment), Cilium now uses BPF Host Routing in 1.16.5, which is conflicting with forwardKubeDNSToHost in Talos. Setting bpf.hostLegacyRouting=true in your Cilium values.yaml reverts to the behaviour used in 1.16.4 and earlier. This eliminates the need for disabling forwardKubeDNSToHost in Talos.
Not sure who's really at fault here or what should be done next.
Bug Report
After upgrading my two Talos clusters to Cilium 1.16.5, I immediately started having external DNS resolution issues on one cluster. CoreDNS started throwing these errors, and things quickly started going sideways:
Reverting back to 1.16.4 made the problem go away. I posted this on the Cilium issues board as #36737, where other people with Talos starting piping in with similar stories.
sfackler noted:
I did some digging around the Talos DNS docs and noticed the cluster with issues was created with Talos 1.8.0 or higher, while the other one was created long before 1.8.0. As such, forwardKubeDNSToHost was enabled by default on the problem cluster, while the other does not have it enabled.
I patched the problem cluster with:
After restarting CoreDNS, the problem immediately went away.
Since
forwardKubeDNSToHost
is a default option now, I suspect others may come across this issue, so its probably best to get to the bottom of it. Unsure if its a Talos problem or Cilium.Environment
The text was updated successfully, but these errors were encountered: