[Bug] Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter #7448

nathan-bowman · 2024-01-02T17:11:56Z

What were you trying to accomplish?

eksctl upgrade nodegroup t2-medium-v1-28

What happened?

Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter

How to reproduce it?

Simply run eksctl upgrade nodegroup ...

Logs

2024-01-02 16:04:27 [ℹ]  updating nodegroup stack to a newer format before upgrading nodegroup version
2024-01-02 16:04:27 [ℹ]  updating nodegroup stack
2024-01-02 16:04:28 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211467" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:04:59 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211467" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:04:59 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:05:30 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:05:30 [ℹ]  upgrading nodegroup version
2024-01-02 16:05:30 [ℹ]  updating nodegroup stack
2024-01-02 16:05:30 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211530" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:06:01 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211530" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:06:02 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:06:32 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:07:15 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:09:00 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:09:45 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:11:31 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:12:34 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:14:26 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:16:23 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:16:54 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:17:24 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:17:57 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:18:44 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:20:27 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:22:02 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:23:38 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:24:20 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:25:40 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:26:25 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:27:19 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:28:52 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:30:44 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:31:55 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:32:41 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:33:43 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:35:35 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:37:02 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:37:46 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:38:47 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:39:54 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:40:39 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:41:20 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:42:41 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:43:12 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:44:45 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:46:27 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:48:24 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:49:28 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:50:29 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:50:32 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter

Anything else we need to know?

An important note to mention is that this problem is intermittent. Sometimes this happens, most times the nodegroup updates fine.

If I check CloudFormation, it will say UPDATE_COMPLETE and even eksctl reports that the nodegroup is updated and active...

# eksctl get nodegroup --cluster backend-staging --name t2-medium-v1-28
CLUSTER                 NODEGROUP       STATUS  CREATED                 MIN SIZE        MAX SIZE        DESIRED CAPACITY        INSTANCE TYPE   IMAGE ID        ASG NAME                                                       TYPE
backend-staging    t2-medium-v1-28 ACTIVE  2023-11-06T20:03:26Z    3               5               3                       t2.medium       AL2_x86_64      eks-t2-medium-v1-28-a4c5d31d-ca98-ae07-c328-035fff4b462c       managed

Meanwhile I'm left with:
waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"

OS:

# cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"

eksctl installed with:

ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
mv /tmp/eksctl /usr/local/bin

Versions

# eksctl info
eksctl version: 0.165.0
kubectl version: v1.28.4
OS: linux

The text was updated successfully, but these errors were encountered:

github-actions · 2024-01-02T17:12:21Z

Hello nathan-bowman 👋 Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

github-actions · 2024-02-02T01:45:58Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

nathan-bowman · 2024-02-02T16:25:56Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Not stale

github-actions · 2024-03-04T01:48:59Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

nathan-bowman · 2024-03-05T14:30:32Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Not stale

yuxiang-zhang · 2024-03-20T23:37:01Z

The problem appears to surface when AWS credentials expire while the upgrade is taking place.

yuxiang-zhang · 2024-03-21T00:58:00Z

Yes I tried to reproduce this and this log line I've added seems to confirm my theory:

2024-03-21 00:50:18 [ℹ]  waiting for CloudFormation stack "eksctl-x-nodegroup-ng-0"
2024-03-21 00:50:18 [!]  err: operation error CloudFormation: DescribeStacks, https response error StatusCode: 403, RequestID: -, api error ExpiredToken: The security token included in the request is expired

The SDK configures 403 errors as retryable.
Similar issues were reported to the SDK team, e.g. aws/aws-sdk-go#2389 and aws/aws-sdk-go#4983 (comment)

Edit: there was a "fix" for STS hashicorp/aws-sdk-go-base#362, but here we are using the default retryer stackDeleteCompleteStateRetryable from CloudFormation instead of the standard retryer

eksctl/pkg/cfn/manager/waiters.go

Lines 137 to 141 in 76902cd

    
           defaultRetryer := o.Retryable 
        
           o.Retryable = func(ctx context.Context, in *cloudformation.DescribeStacksInput, out *cloudformation.DescribeStacksOutput, err error) (bool, error) { 
        
           	logger.Info("waiting for CloudFormation stack %q", *i.StackName) 
        
           	return defaultRetryer(ctx, in, out, err) 
        
           }

I'm inclined to just catch the ExpiredToken error and abort the waiter.

nathan-bowman added the kind/bug label Jan 2, 2024

github-actions bot added the stale label Feb 2, 2024

github-actions bot removed the stale label Feb 3, 2024

github-actions bot added the stale label Mar 4, 2024

github-actions bot removed the stale label Mar 6, 2024

yuxiang-zhang added the needs-investigation label Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter #7448

[Bug] Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter #7448

nathan-bowman commented Jan 2, 2024

github-actions bot commented Jan 2, 2024

github-actions bot commented Feb 2, 2024

nathan-bowman commented Feb 2, 2024

github-actions bot commented Mar 4, 2024

nathan-bowman commented Mar 5, 2024

yuxiang-zhang commented Mar 20, 2024

yuxiang-zhang commented Mar 21, 2024 •

edited

Loading

[Bug] Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter #7448

[Bug] Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter #7448

Comments

nathan-bowman commented Jan 2, 2024

What were you trying to accomplish?

What happened?

How to reproduce it?

github-actions bot commented Jan 2, 2024

github-actions bot commented Feb 2, 2024

nathan-bowman commented Feb 2, 2024

github-actions bot commented Mar 4, 2024

nathan-bowman commented Mar 5, 2024

yuxiang-zhang commented Mar 20, 2024

yuxiang-zhang commented Mar 21, 2024 • edited Loading

yuxiang-zhang commented Mar 21, 2024 •

edited

Loading