Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with AMI Release v20240817, AWS CNI not working #1936

Closed
jma-amboss opened this issue Aug 26, 2024 · 6 comments
Closed

Issue with AMI Release v20240817, AWS CNI not working #1936

jma-amboss opened this issue Aug 26, 2024 · 6 comments

Comments

@jma-amboss
Copy link

What happened:

Upgrading EKS ami from version, AMI Release v20240807 to version AMI Release v20240817 breaks the AWS CNI. Reverting to the previous AMI resolves the issue.
CNI version: v1.18.3

What you expected to happen:
AMI minor version upgrade not breaking anything

How to reproduce it (as minimally and precisely as possible):
Upgrade EKS worker nodes to AMI Release v20240817

Anything else we need to know?:
The IP addresses are not getting correctly assigned. There are errors in kube-proxy
E0826 15:52:11.209793 1 reflector.go:147] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://<redacted>.<redacted>.us-east-1.eks.amazonaws.com/apis/discovery.k8s.io/v1/endpointslices?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&limit=500&resourceVersion=0": dial tcp: lookup <redacted>.<redacted>.us-east-1.eks.amazonaws.com on [::1]:53: dial udp [::1]:53: connect: cannot assign requested address

There are also noted log messages on the node level

{"level":"error","ts":"2024-08-26T14:34:42.760Z","logger":"controller-runtime.source.EventHandler","caller":"source/kind.go:68","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: networking.k8s.aws/v1alpha1: Get \"[https://172.20.0.1:443/apis/networking.k8s.aws/v1alpha1](https://172.20.0.1/apis/networking.k8s.aws/v1alpha1)\": dial tcp 172.20.0.1:443: connect: connection timed out"} {"level":"error","ts":"2024-08-26T14:34:52.745Z","logger":"controller-runtime.source.EventHandler","caller":"source/kind.go:68","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: networking.k8s.aws/v1alpha1: Get \"[https://172.20.0.1:443/apis/networking.k8s.aws/v1alpha1](https://172.20.0.1/apis/networking.k8s.aws/v1alpha1)\": dial tcp 172.20.0.1:443: connect: connection timed out"}

Also noted, there is no connectivity from the EC2 instance to the resolved IP address for the EKS cluster.

Environment:

  • AWS Region: us-east-1
  • CNI version: v1.18.3
  • Instance Type(s): t3.2xlarge, t3.large
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.16
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.28
  • AMI Version: ami-0d364e0801521b622
  • Kernel (e.g. uname -a): unknown, already reverted the AMI
  • Release information (run cat /etc/eks/release on a node):
unknown, already reverted AMI
@jma-amboss jma-amboss changed the title Issue with AMI Release v20240807, AWS CNI not working Issue with AMI Release v20240817, AWS CNI not working Aug 26, 2024
@cartermckinnon
Copy link
Member

@jma-amboss can you open a case with AWS support so we can get some more information? We're not able to reproduce anything like this.

@awslabs awslabs deleted a comment Aug 26, 2024
@cartermckinnon
Copy link
Member

cartermckinnon commented Aug 27, 2024

@ing-ash, @shubha-shyam, @asri-badlah if any of you have seen this issue and can open an AWS support case, it'd be really helpful. We need logs from an instance having this problem to determine a root cause.

@jma-amboss
Copy link
Author

@cartermckinnon Unfortunately, we don't have an AWS support plan. We can only contact our reseller, which provides us with basic support.

@cartermckinnon
Copy link
Member

Understood; do you see any evidence in the containerd logs that this could be related to #1933 ?

@apenney
Copy link

apenney commented Aug 31, 2024

I don't know if my issue is exactly the same, but I was just moving from AL2 to AL2023 and discovered something similar. I get CoreDNS issues (where it can't talk to Kubernetes) but if I kill the pod multiple times in a row it will eventually not complain.

I discovered this for other pods that talk to the Kubernetes API, if I restarted them enough times it would get to healthy but things would still be glitchy and I'd have ongoing dns issues. Reverting to AL2 immediately fixed things.

I made a case (172513694900840) so if there's anything I can gather for you I can help troubleshoot this.

I dug around in the logs on the nodes themselves (just looking at with journalctl -ef) but I didn't see anything that leapt out at me as obviously broken.

I have a cluster I can switch back over to these nodes for testing purposes if you need me to run anything specific (I rolled everything back for now. I did not try to downgrade the CNI). This was EKS 1.29, for reference.

@cartermckinnon
Copy link
Member

@apenney that sounds like a different issue, the original report was on AL2.

I don't see a smoking gun here. Timeouts to the API server can happen for many reasons, and without more information we can't really narrow down the cause. I'm not aware of any issues in us-east-1 at the time this was occurring. If you can provide more information, please @ mention me.

@cartermckinnon cartermckinnon closed this as not planned Won't fix, can't repro, duplicate, stale Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants