Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter #7448

Open
nathan-bowman opened this issue Jan 2, 2024 · 7 comments

Comments

@nathan-bowman
Copy link

What were you trying to accomplish?

eksctl upgrade nodegroup t2-medium-v1-28

What happened?

Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter

How to reproduce it?

Simply run eksctl upgrade nodegroup ...

Logs

2024-01-02 16:04:27 [ℹ]  updating nodegroup stack to a newer format before upgrading nodegroup version
2024-01-02 16:04:27 [ℹ]  updating nodegroup stack
2024-01-02 16:04:28 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211467" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:04:59 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211467" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:04:59 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:05:30 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:05:30 [ℹ]  upgrading nodegroup version
2024-01-02 16:05:30 [ℹ]  updating nodegroup stack
2024-01-02 16:05:30 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211530" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:06:01 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211530" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:06:02 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:06:32 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:07:15 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:09:00 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:09:45 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:11:31 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:12:34 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:14:26 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:16:23 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:16:54 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:17:24 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:17:57 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:18:44 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:20:27 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:22:02 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:23:38 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:24:20 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:25:40 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:26:25 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:27:19 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:28:52 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:30:44 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:31:55 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:32:41 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:33:43 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:35:35 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:37:02 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:37:46 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:38:47 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:39:54 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:40:39 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:41:20 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:42:41 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:43:12 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:44:45 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:46:27 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:48:24 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:49:28 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:50:29 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:50:32 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter

Anything else we need to know?

An important note to mention is that this problem is intermittent. Sometimes this happens, most times the nodegroup updates fine.

If I check CloudFormation, it will say UPDATE_COMPLETE and even eksctl reports that the nodegroup is updated and active...

# eksctl get nodegroup --cluster backend-staging --name t2-medium-v1-28
CLUSTER                 NODEGROUP       STATUS  CREATED                 MIN SIZE        MAX SIZE        DESIRED CAPACITY        INSTANCE TYPE   IMAGE ID        ASG NAME                                                       TYPE
backend-staging    t2-medium-v1-28 ACTIVE  2023-11-06T20:03:26Z    3               5               3                       t2.medium       AL2_x86_64      eks-t2-medium-v1-28-a4c5d31d-ca98-ae07-c328-035fff4b462c       managed

Meanwhile I'm left with:
waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"

OS:

# cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"

eksctl installed with:

ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
mv /tmp/eksctl /usr/local/bin

Versions

# eksctl info
eksctl version: 0.165.0
kubectl version: v1.28.4
OS: linux
Copy link
Contributor

github-actions bot commented Jan 2, 2024

Hello nathan-bowman 👋 Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

Copy link
Contributor

github-actions bot commented Feb 2, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Feb 2, 2024
@nathan-bowman
Copy link
Author

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Not stale

@github-actions github-actions bot removed the stale label Feb 3, 2024
Copy link
Contributor

github-actions bot commented Mar 4, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Mar 4, 2024
@nathan-bowman
Copy link
Author

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Not stale

@yuxiang-zhang
Copy link
Member

The problem appears to surface when AWS credentials expire while the upgrade is taking place.

@yuxiang-zhang
Copy link
Member

yuxiang-zhang commented Mar 21, 2024

Yes I tried to reproduce this and this log line I've added seems to confirm my theory:

2024-03-21 00:50:18 [ℹ]  waiting for CloudFormation stack "eksctl-x-nodegroup-ng-0"
2024-03-21 00:50:18 [!]  err: operation error CloudFormation: DescribeStacks, https response error StatusCode: 403, RequestID: -, api error ExpiredToken: The security token included in the request is expired

The SDK configures 403 errors as retryable.
Similar issues were reported to the SDK team, e.g. aws/aws-sdk-go#2389 and aws/aws-sdk-go#4983 (comment)

Edit: there was a "fix" for STS hashicorp/aws-sdk-go-base#362, but here we are using the default retryer stackDeleteCompleteStateRetryable from CloudFormation instead of the standard retryer

defaultRetryer := o.Retryable
o.Retryable = func(ctx context.Context, in *cloudformation.DescribeStacksInput, out *cloudformation.DescribeStacksOutput, err error) (bool, error) {
logger.Info("waiting for CloudFormation stack %q", *i.StackName)
return defaultRetryer(ctx, in, out, err)
}

I'm inclined to just catch the ExpiredToken error and abort the waiter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants