Issue with AMI Release v20240817, AWS CNI not working #1936

jma-amboss · 2024-08-26T16:44:42Z

What happened:

Upgrading EKS ami from version, AMI Release v20240807 to version AMI Release v20240817 breaks the AWS CNI. Reverting to the previous AMI resolves the issue.
CNI version: v1.18.3

What you expected to happen:
AMI minor version upgrade not breaking anything

How to reproduce it (as minimally and precisely as possible):
Upgrade EKS worker nodes to AMI Release v20240817

Anything else we need to know?:
The IP addresses are not getting correctly assigned. There are errors in kube-proxy
E0826 15:52:11.209793 1 reflector.go:147] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://<redacted>.<redacted>.us-east-1.eks.amazonaws.com/apis/discovery.k8s.io/v1/endpointslices?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&limit=500&resourceVersion=0": dial tcp: lookup <redacted>.<redacted>.us-east-1.eks.amazonaws.com on [::1]:53: dial udp [::1]:53: connect: cannot assign requested address

There are also noted log messages on the node level

{"level":"error","ts":"2024-08-26T14:34:42.760Z","logger":"controller-runtime.source.EventHandler","caller":"source/kind.go:68","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: networking.k8s.aws/v1alpha1: Get \"[https://172.20.0.1:443/apis/networking.k8s.aws/v1alpha1](https://172.20.0.1/apis/networking.k8s.aws/v1alpha1)\": dial tcp 172.20.0.1:443: connect: connection timed out"} {"level":"error","ts":"2024-08-26T14:34:52.745Z","logger":"controller-runtime.source.EventHandler","caller":"source/kind.go:68","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: networking.k8s.aws/v1alpha1: Get \"[https://172.20.0.1:443/apis/networking.k8s.aws/v1alpha1](https://172.20.0.1/apis/networking.k8s.aws/v1alpha1)\": dial tcp 172.20.0.1:443: connect: connection timed out"}

Also noted, there is no connectivity from the EC2 instance to the resolved IP address for the EKS cluster.

Environment:

AWS Region: us-east-1
CNI version: v1.18.3
Instance Type(s): t3.2xlarge, t3.large
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.16
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.28
AMI Version: ami-0d364e0801521b622
Kernel (e.g. uname -a): unknown, already reverted the AMI
Release information (run cat /etc/eks/release on a node):

unknown, already reverted AMI

The text was updated successfully, but these errors were encountered:

cartermckinnon · 2024-08-26T17:30:17Z

@jma-amboss can you open a case with AWS support so we can get some more information? We're not able to reproduce anything like this.

cartermckinnon · 2024-08-27T19:08:19Z

@ing-ash, @shubha-shyam, @asri-badlah if any of you have seen this issue and can open an AWS support case, it'd be really helpful. We need logs from an instance having this problem to determine a root cause.

jma-amboss · 2024-08-28T08:15:45Z

@cartermckinnon Unfortunately, we don't have an AWS support plan. We can only contact our reseller, which provides us with basic support.

cartermckinnon · 2024-08-28T19:14:57Z

Understood; do you see any evidence in the containerd logs that this could be related to #1933 ?

apenney · 2024-08-31T22:23:49Z

I don't know if my issue is exactly the same, but I was just moving from AL2 to AL2023 and discovered something similar. I get CoreDNS issues (where it can't talk to Kubernetes) but if I kill the pod multiple times in a row it will eventually not complain.

I discovered this for other pods that talk to the Kubernetes API, if I restarted them enough times it would get to healthy but things would still be glitchy and I'd have ongoing dns issues. Reverting to AL2 immediately fixed things.

I made a case (172513694900840) so if there's anything I can gather for you I can help troubleshoot this.

I dug around in the logs on the nodes themselves (just looking at with journalctl -ef) but I didn't see anything that leapt out at me as obviously broken.

I have a cluster I can switch back over to these nodes for testing purposes if you need me to run anything specific (I rolled everything back for now. I did not try to downgrade the CNI). This was EKS 1.29, for reference.

cartermckinnon · 2024-09-06T22:12:05Z

@apenney that sounds like a different issue, the original report was on AL2.

I don't see a smoking gun here. Timeouts to the API server can happen for many reasons, and without more information we can't really narrow down the cause. I'm not aware of any issues in us-east-1 at the time this was occurring. If you can provide more information, please @ mention me.

jma-amboss changed the title ~~Issue with AMI Release v20240807, AWS CNI not working~~ Issue with AMI Release v20240817, AWS CNI not working Aug 26, 2024

awslabs deleted a comment Aug 26, 2024

cartermckinnon closed this as not planned Won't fix, can't repro, duplicate, stale Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with AMI Release v20240817, AWS CNI not working #1936

Issue with AMI Release v20240817, AWS CNI not working #1936

jma-amboss commented Aug 26, 2024

cartermckinnon commented Aug 26, 2024

cartermckinnon commented Aug 27, 2024 •

edited

Loading

jma-amboss commented Aug 28, 2024

cartermckinnon commented Aug 28, 2024

apenney commented Aug 31, 2024

cartermckinnon commented Sep 6, 2024

Issue with AMI Release v20240817, AWS CNI not working #1936

Issue with AMI Release v20240817, AWS CNI not working #1936

Comments

jma-amboss commented Aug 26, 2024

cartermckinnon commented Aug 26, 2024

cartermckinnon commented Aug 27, 2024 • edited Loading

jma-amboss commented Aug 28, 2024

cartermckinnon commented Aug 28, 2024

apenney commented Aug 31, 2024

cartermckinnon commented Sep 6, 2024

cartermckinnon commented Aug 27, 2024 •

edited

Loading