Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 test: e2e: make managed suite more robust to errors with Eventually() #5215

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

damdo
Copy link
Member

@damdo damdo commented Nov 13, 2024

What type of PR is this?
/kind flake

What this PR does / why we need it:
make managed suite more robust to errors with Eventually()

Special notes for your reviewer:
Trying to address issues like the ones seen here: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-aws/5211/pull-cluster-api-provider-aws-e2e-eks/1856371404925046784

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 13, 2024
@k8s-ci-robot k8s-ci-robot added needs-priority size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 13, 2024
@damdo
Copy link
Member Author

damdo commented Nov 13, 2024

/assign @richardcase @nrb

@damdo
Copy link
Member Author

damdo commented Nov 13, 2024

/test ?

@k8s-ci-robot
Copy link
Contributor

@damdo: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-build-docker
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-apidiff-main
  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-clusterclass
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks
  • /test pull-cluster-api-provider-aws-e2e-eks-gc
  • /test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-apidiff-main
  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-build-docker
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@damdo
Copy link
Member Author

damdo commented Nov 13, 2024

/test pull-cluster-api-provider-aws-e2e-eks

@damdo
Copy link
Member Author

damdo commented Nov 13, 2024

Failed is unrelated (due to AWS CloudFormation stack)

/test pull-cluster-api-provider-aws-e2e-eks

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from nrb. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@damdo
Copy link
Member Author

damdo commented Nov 13, 2024

/test pull-cluster-api-provider-aws-e2e-eks

2 similar comments
@damdo
Copy link
Member Author

damdo commented Nov 13, 2024

/test pull-cluster-api-provider-aws-e2e-eks

@damdo
Copy link
Member Author

damdo commented Nov 14, 2024

/test pull-cluster-api-provider-aws-e2e-eks

@damdo
Copy link
Member Author

damdo commented Nov 14, 2024

/test pull-cluster-api-provider-aws-test

@damdo
Copy link
Member Author

damdo commented Nov 14, 2024

/test pull-cluster-api-provider-aws-e2e-eks

1 similar comment
@damdo
Copy link
Member Author

damdo commented Nov 14, 2024

/test pull-cluster-api-provider-aws-e2e-eks

@damdo
Copy link
Member Author

damdo commented Nov 14, 2024

/test pull-cluster-api-provider-aws-e2e-eks

2 similar comments
@damdo
Copy link
Member Author

damdo commented Nov 15, 2024

/test pull-cluster-api-provider-aws-e2e-eks

@damdo
Copy link
Member Author

damdo commented Nov 15, 2024

/test pull-cluster-api-provider-aws-e2e-eks

@damdo
Copy link
Member Author

damdo commented Nov 15, 2024

AWS Cloud formation stack timed out

/test pull-cluster-api-provider-aws-e2e-eks

@damdo
Copy link
Member Author

damdo commented Nov 15, 2024

/test pull-cluster-api-provider-aws-e2e-eks

@damdo
Copy link
Member Author

damdo commented Nov 15, 2024

@richardcase do you have any idea on why waiting for addons fails so often?

@damdo
Copy link
Member Author

damdo commented Nov 20, 2024

/test pull-cluster-api-provider-aws-e2e-eks

1 similar comment
@damdo
Copy link
Member Author

damdo commented Nov 21, 2024

/test pull-cluster-api-provider-aws-e2e-eks

@@ -149,7 +149,7 @@ intervals:
default/wait-machine-status: ["20m", "10s"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into the gathered artifacts and found this:

- lastTransitionTime: "2024-11-21T20:06:09Z"
    message: |-
      addon_update: updating eks addon coredns: ResourceInUseException: Addon coredns cannot be updated as it is currently in UPDATING state
      {
        RespMetadata: {
          StatusCode: 409,
          RequestID: "776223f5-f6d1-4a1b-83bf-6455dbfc09f6"
        },
        AddonName: "coredns",
        ClusterName: "eks-nodes-kapm37_eks-nodes-8g7yso-control-plane",
        Message_: "Addon coredns cannot be updated as it is currently in UPDATING state"
      }

from https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-aws/5215/pull-cluster-api-provider-aws-e2e-eks/1859655615442325504/artifacts/clusters/bootstrap/resources/eks-nodes-kapm37/AWSManagedControlPlane/eks-nodes-8g7yso-control-plane.yaml

So I see it's updating CoreDNS.

In this file, it's set to v1.11.1-eksbuild.8.

Amazon has versions for a given Kube version here: https://docs.aws.amazon.com/eks/latest/userguide/managing-coredns.html#coredns-add-on-update

For Kube 1.30, it should be v1.11.3-eksbuild.2.

I'm going to add a commit to this to see if incrementing the coredns version will help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this didn't help like I was hoping. For historic reference, https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-aws/5215/pull-cluster-api-provider-aws-e2e-eks/1860077544150142976 is the test run that happened w/ v1.11.3-eksbuild.2 CoreDNS

@damdo
Copy link
Member Author

damdo commented Nov 22, 2024

/test pull-cluster-api-provider-aws-e2e-eks

@Ankitasw
Copy link
Member

@damdo since eventually failed for addons test in few runs, do you think we should increase the timeout from 2 minutes? Or is there any other issue?

@damdo
Copy link
Member Author

damdo commented Nov 26, 2024

/test pull-cluster-api-provider-aws-e2e-eks

1 similar comment
@damdo
Copy link
Member Author

damdo commented Nov 26, 2024

/test pull-cluster-api-provider-aws-e2e-eks

Expect(err).ToNot(HaveOccurred())
Eventually(func() error {
return mgmtClient.Get(ctx, crclient.ObjectKey{Namespace: input.Namespace.Name, Name: controlPlaneName}, controlPlane)
}, 20*time.Minute, 5*time.Second).Should(Succeed(), "eventually failed trying to get the AWSManagedControlPlane")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how this timeout relates to the one set in the e2e_eks_conf.yaml file? Are they added together? Or do they have no relation?

@damdo
Copy link
Member Author

damdo commented Nov 26, 2024

It still fails at Should've eventually succeeded creating an AWS CloudFormation stack

@damdo
Copy link
Member Author

damdo commented Nov 26, 2024

/test pull-cluster-api-provider-aws-e2e-eks

1 similar comment
@damdo
Copy link
Member Author

damdo commented Nov 26, 2024

/test pull-cluster-api-provider-aws-e2e-eks

@k8s-ci-robot
Copy link
Contributor

@damdo: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-test 697e555 link true /test pull-cluster-api-provider-aws-test
pull-cluster-api-provider-aws-e2e-eks 697e555 link false /test pull-cluster-api-provider-aws-e2e-eks

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. needs-priority release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants