Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS NodeGroups] [VersionUpdate]: VersionUpdate needs more information on errors and runtime #2522

Open
leefinfor opened this issue Jan 20, 2025 · 0 comments
Labels
EKS Managed Nodes EKS Managed Nodes EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue

Comments

@leefinfor
Copy link

leefinfor commented Jan 20, 2025

Tell us about your request

VersionUpdate has a start time, it also needs an end time

NodeCreationFailure and PodEvictionFailure are ambiguous with several potential root causes, there should be a more detailed message when VersionUpdate goes to an error state to aid diagnosis.

Errors in VersionUpdate should be time stamped.

Which service(s) is this request for?

AWS EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

VersionUpdate can fail for many reasons as documented below however these appear to fall into NodeCreationFailure or PodEvictionFailure
https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html

It is very difficult to establish the root cause, very difficult to establish the timing of errors and total runtime for a VersionUpdate.

VersionUpdate does have a start time but the errors are not timestamped which is critical if attempting to diagnose problems.

VersionUpdate does not have an end time with makes it difficult to establish the runtime for updates, something which is important when we are measuring maintenance windows.

As per the documentation for NodeCreationFailure and PodEvictionFailure has several possible causes which are not explicitly mentioned in the Errors.

It is important for AWS EKS to report the decision for moving VersionUpdate to an Error status with as much precision as possible to assist troubleshooting efforts.

For instance I have spent a long time working with AWS support only to identify that the total runtime for each new node is capped at 15-minutes, this is not mentioned as a possible cause of NodeCreationFailure and is not recorded in the error as the root cause. Out of the 4 possible errors, node bootstrap taking more than 15-minutes is a 5th undocumented root cause.

Couple this with a lack of timing information and an issue which is caused by nodes laking more than 15-minutes to bootstrap is very difficult to diagnose.

Are you currently working around this issue?

It is necessary to watch the node creation carefully if we begin to encounter errors, the root cause can only be gleaned by catching it in realtime or conducting a lengthy and costly investigation using CloudTrail.

CloudTrail does not record the end time for VersionUpdate

Additional context

Feedback was given to the authors of the documentation page below asking them to add the 15-minute note creation time limit as another root cause for NodeCreationFailure and make it clear that the 15-minutes appears to include the time for EKS to initiate scale-out, EC2 instance bootstrap, custom UserData and the time taken to run the script which connects the EC2 instance as an EKS node.

https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@leefinfor leefinfor added the Proposed Community submitted issue label Jan 20, 2025
@mikestef9 mikestef9 added EKS Amazon Elastic Kubernetes Service EKS Managed Nodes EKS Managed Nodes labels Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Managed Nodes EKS Managed Nodes EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests

2 participants