-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows: networking cannot route inside the cluster (so no DNS) #4320
Comments
Hi what pod are you try to contact? The ping is done on the windows node or inside the pod shell? |
Hi there. I'm running ping inside the pod shell. It cannot contact anything on the linux node in the cluster, including kube-dns, so it cannot do name resolution, but it can contact other pods on the same node. I've disabled the firewall on both nodes. |
How did you configure RKE2? I presume that you are using calico as CNI. |
Are yo sure that you are executing those commands inside a pod? Could you check the status of the deployed pod on windows? |
Yes, it's configured with calico for CNI. |
I have just tested and things seem to work on my env. Could you verify the output of Could you also verify that from a linux node you can ping a windows node (or the other way around)? |
My concern is related to how rancher exposes the windows cli and if the pod is rightly created. |
I tested a cluster deployed with rancher with a master linux node and a windows worker node. |
I have the same issue. Running a Github runner on Windows node with 2 node cluster. It can ping 8.8.8.8 but not my Linux node which in turns means DNS resolution doesn't work. Also running WS 2022, RKE2 (v1.25.10), Calico. |
And yes can ping Windows node from my Linux node. But can't ping the podIP of the Windows pod from a Linux pod. |
Not sure if it's related but I do have two pods: rke2-coredns-rke2-coredns-6b9548f79f-4x4pf First is running on the Linux node. Second is pending because it is set to only run on Linux nodes but has anti-affinity enabled. Is this supposed to run on each node including Windows ? |
It shouldn't try to run the coreDNS pod on windows. On your setup the PODs on windows can go to the internet but they can't ping other pods on the linux node? |
@rbrtbnfgl .. Correct. Same issue as OP. And the Windows machine is a fresh install of WS 2022 and RKE. |
I checked the coreDNS pod will be pending on the windows node so what happens to you is right the DNS query should be forwarded to the Linux control plane node. |
Seeing these logs repeating for calico-node:
|
Thanks these could be useful I'll check if they could be the reason of the issue. |
Are you by any chance using jumbo frames on the Windows nodes, but not on the Linux nodes? Or vice versa? |
@rbrtbnfgl .. Yes the logs are from calico-node on Linux. Not sure what 10.42.61.x is. My internal host network is 192.168.1.x. |
@brandond .. Linux:
|
I am not that familiar with Windows but as you can see above my NIC isn't connected but the Hyper-V one is. It isn't using Jumbo Frames either though. I tried enabling Jumbo Frames on both interfaces but it didn't make any difference. Looks like it maybe a known issue with Calico and Hyper-V. |
I ran Get-HNsEndpoint and it shows a bunch of Ethernet items and this is the Pod I am testing with. It's a Github Action runner that fails to access api.github.com.
|
Hi @nadenf the logs seem to be unrelated to your issue because I got the same logs but network works fine to me. How do you deploy the pod is it a deployment or it's directly github that is accessing your cluster? Which is the container image used? |
I'll check the Hyper-v issue if it could be related somehow. |
This is the image. |
I tried your pod and the networks id working fine. |
I disabled the firewall and related security features. But otherwise I just followed the instructions on the Quick Start page. |
Are the Windows nodes on the same network as the other nodes, or are they hosted remotely? What sort of connectivity do the different node types have? |
|
Connectivity between the kubelet and apiserver is not the same as connectivity between pods. Kubelet-apiserver traffic is a simple HTTPS request. CNI traffic uses the vxlan overlay network, which is much more fragile in terms of fragmentation and potential for blocking by intermediate network security devices. It is not uncommon for traffic to the apiserver to work fine, while vxlan traffic gets blocked, dropped, or mangled by intermediate network devices. |
Added a ConnectX-6 card to each node which supports VXLAN, configured for Ethernet mode, directly connected them via QSPF i.e. no switch and reinstalled RKE2. Same issue. |
You could try to capture the traffic on the windows physical interface to check if the VXLAN traffic is present or it's dropped inside the windows node. |
Yes directly connected, no VMs other than Hyper-V on the Windows side. Will look into sniffing the traffic. |
Sorry for the delay, I have just seen this reply. One node is using This is how the interface is picked in rke2-windows https://github.com/rancher/rke2/blob/master/pkg/windows/calico.go#L277-L281 |
I think that linux is using the wrong interface not windows. You should use the |
@manuelbuil .. Great. Looks like that is the issue which is actually on my Linux node that is running the rke2-server. Looking at the rke2-calico chart there is no way for me to configure the Calico IP Detection Behaviour. Also would be good to document this as having multiple interfaces is pretty common and right now it's just picking the first one it finds. Or even better is if the install script can detect multiple interfaces and ask the user which one it should use. Should I raise a separate issue or I can submit a PR to update the chart to add support for it ? |
are you using rancher UI or you are running the binary directly? The chart used is the same from Calico with patches you should be able to configure the chart on the same way the upstream calico chart is configured. You could add your configuration on the path |
I followed the Quick Start guide. And then I created the following file at rke2-calico-config.yaml to force it to use the right interface:
I then did a kubectl apply and nothing has happened. The documentation on this page really isn't clear what exactly I am supposed to do. |
You don't need to apply it through kubectl it's enough to add that file on the path that I mentioned on my previous comment and start RKE2. |
Doing this is dangerous because you are specifying that Calico should look for |
Blocked on #4403 |
/etc/rancher/rke2/config.yaml:
And after restarting/killing etc. it still shows old value:
I am assuming by the details above that the issue is the Calico autodetection. |
So this is the file used to configure Calico: And what I need is to be able to configure the behaviour described here i.e. use node IP instead of auto detect: Is it worth making node IP the default ? |
if you start RKE2 with the following calico helm configuration should be working on your case:
|
Above Applied:
|
See #1353 (comment) - working solution. |
Closing, as this issue has a known-working workaround |
Environmental Info:
rke2 version v1.24.13+rke2r1 (05a2e96)
(control plane)
Linux 5.10.0-18-amd64 #1 SMP Debian 5.10.140-1 (2022-09-02) x86_64 GNU/Linux
(worker)
Windows 2022 Standard
Cluster Configuration:
One linux to hold the control plane, one windows worker node
Describe the bug:
When start a container on the windows node it can access external IP addresses, but it cannot access other IPs in the cluster. This means DNS doesn't work, so I can't access external things except by IP.
Steps To Reproduce:
I installed RKE2 using the deployment from Rancher which gave the instructions to deploy the various node types on the control plane node and the windows worker node. For the windows node this is by running:
curl.exe -fL https://rancher.apama.com/wins-agent-install.ps1 -o install.ps1; Set-ExecutionPolicy Bypass -Scope Process -Force; ./install.ps1 -Server https://rancher.apama.com -Label 'cattle.io/os=windows' -Token (redacted) -Worker -CaChecksum (redacted)
I then deployed a pod through the rancher UI using mcr.microsoft.com/windows/servercore:ltsc2022 as the image and got a shell to run the commands below
Expected behavior:
DNS lookups to work and networking to be able to route between the pods
Actual behavior:
(another container on the same windows node)
C:>ping 10.42.215.77
Pinging 10.42.215.77 with 32 bytes of data:
Reply from 10.42.215.77: bytes=32 time<1ms TTL=128
(an external IP address)
C:>ping 8.8.8.8
Pinging 8.8.8.8 with 32 bytes of data:
Reply from 8.8.8.8: bytes=32 time=1ms TTL=114
(a container on the linux node in the cluster)
C:>ping 10.42.212.149
Pinging 10.42.212.149 with 32 bytes of data:
Request timed out.
Doing a DNS lookup:
C:>nslookup teamcity.apama.com 10.43.0.10
DNS request timed out.
timeout was 2 seconds.
Server: UnKnown
Address: 10.43.0.10
Additional context / logs:
The control plane has a lot of pods running in various -system namespaces, but the worker doesn't have any. I was expecting at least one to handle the networking stuff.
The text was updated successfully, but these errors were encountered: