Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium 1.16.5 breaks external DNS resolution with forwardKubeDNSToHost enabled #10002

Open
kenlasko opened this issue Dec 20, 2024 · 2 comments

Comments

@kenlasko
Copy link

kenlasko commented Dec 20, 2024

Bug Report

After upgrading my two Talos clusters to Cilium 1.16.5, I immediately started having external DNS resolution issues on one cluster. CoreDNS started throwing these errors, and things quickly started going sideways:

[INFO] 10.244.0.38:41485 - 15314 "A IN hooks.slack.com. udp 33 false 512" - - 0 2.000258485s
[ERROR] plugin/errors: 2 hooks.slack.com. A: read udp 10.244.0.27:54684->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.38:41485 - 48588 "AAAA IN hooks.slack.com. udp 33 false 512" - - 0 2.000263845s
[ERROR] plugin/errors: 2 hooks.slack.com. AAAA: read udp 10.244.0.27:34832->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.38:41485 - 15314 "A IN hooks.slack.com. udp 33 false 512" - - 0 2.001438855s 

Reverting back to 1.16.4 made the problem go away. I posted this on the Cilium issues board as #36737, where other people with Talos starting piping in with similar stories.

sfackler noted:

The Talos dns-resolve-cache logs show that it is receiving the requests and resolving them successfully, so it seems like the response just isn't making it back to the CoreDNS pod.

I did some digging around the Talos DNS docs and noticed the cluster with issues was created with Talos 1.8.0 or higher, while the other one was created long before 1.8.0. As such, forwardKubeDNSToHost was enabled by default on the problem cluster, while the other does not have it enabled.

I patched the problem cluster with:

machine:
  features:
    hostDNS:
      enabled: true
      forwardKubeDNSToHost: false

After restarting CoreDNS, the problem immediately went away.

Since forwardKubeDNSToHost is a default option now, I suspect others may come across this issue, so its probably best to get to the bottom of it. Unsure if its a Talos problem or Cilium.

Environment

  • Talos version: 1.9.0
  • Kubernetes version: 1.32.0
  • Platform: ARM64 and AMD64
@smira
Copy link
Member

smira commented Dec 20, 2024

It is certainly a Cilium issue which decides not to deliver the packet which perfectly valid.

@kenlasko
Copy link
Author

As per cilium/cilium#36737 (comment), Cilium now uses BPF Host Routing in 1.16.5, which is conflicting with forwardKubeDNSToHost in Talos. Setting bpf.hostLegacyRouting=true in your Cilium values.yaml reverts to the behaviour used in 1.16.4 and earlier. This eliminates the need for disabling forwardKubeDNSToHost in Talos.

Not sure who's really at fault here or what should be done next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants