[EKS NodeGroups] [VersionUpdate]: VersionUpdate needs more information on errors and runtime #2522
Labels
EKS Managed Nodes
EKS Managed Nodes
EKS
Amazon Elastic Kubernetes Service
Proposed
Community submitted issue
Tell us about your request
VersionUpdate has a start time, it also needs an end time
NodeCreationFailure and PodEvictionFailure are ambiguous with several potential root causes, there should be a more detailed message when VersionUpdate goes to an error state to aid diagnosis.
Errors in VersionUpdate should be time stamped.
Which service(s) is this request for?
AWS EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
VersionUpdate can fail for many reasons as documented below however these appear to fall into NodeCreationFailure or PodEvictionFailure
https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html
It is very difficult to establish the root cause, very difficult to establish the timing of errors and total runtime for a VersionUpdate.
VersionUpdate does have a start time but the errors are not timestamped which is critical if attempting to diagnose problems.
VersionUpdate does not have an end time with makes it difficult to establish the runtime for updates, something which is important when we are measuring maintenance windows.
As per the documentation for NodeCreationFailure and PodEvictionFailure has several possible causes which are not explicitly mentioned in the Errors.
It is important for AWS EKS to report the decision for moving VersionUpdate to an Error status with as much precision as possible to assist troubleshooting efforts.
For instance I have spent a long time working with AWS support only to identify that the total runtime for each new node is capped at 15-minutes, this is not mentioned as a possible cause of NodeCreationFailure and is not recorded in the error as the root cause. Out of the 4 possible errors, node bootstrap taking more than 15-minutes is a 5th undocumented root cause.
Couple this with a lack of timing information and an issue which is caused by nodes laking more than 15-minutes to bootstrap is very difficult to diagnose.
Are you currently working around this issue?
It is necessary to watch the node creation carefully if we begin to encounter errors, the root cause can only be gleaned by catching it in realtime or conducting a lengthy and costly investigation using CloudTrail.
CloudTrail does not record the end time for VersionUpdate
Additional context
Feedback was given to the authors of the documentation page below asking them to add the 15-minute note creation time limit as another root cause for NodeCreationFailure and make it clear that the 15-minutes appears to include the time for EKS to initiate scale-out, EC2 instance bootstrap, custom UserData and the time taken to run the script which connects the EC2 instance as an EKS node.
https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html
Community Note
The text was updated successfully, but these errors were encountered: