Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing test(s): TestAccContainerCluster_resourceManagerTags #20252

Closed

Comments

@melinath
Copy link
Collaborator

melinath commented Nov 8, 2024

Impacted tests

  • TestAccContainerCluster_resourceManagerTags

Affected Resource(s)

  • google_container_cluster

Failure rates

  • 100% since 2024-10-25

Message(s)

bootstrap_iam_test_utils.go:74: Added the following bindings to the test project's IAM policy:
        Members: ["serviceAccount:service-594424405950@container-engine-robot.iam.gserviceaccount.com"], Role: "roles/resourcemanager.tagHoldAdmin"
        Retry the test in a few minutes.
    resource_container_cluster_test.go:74: Stopping the test because a role was added to the policy.

The test is deleting the role it's trying to bootstrap. The fix is to stop the test configs from managing container-engine-robot service agent IAM members and make sure they're all bootstrapped instead.

Nightly build test history

https://hashicorp.teamcity.com/test/6020505809058767047?currentProjectId=TerraformProviders_GoogleCloud_GOOGLE_NIGHTLYTESTS&expandTestHistoryChartSection=true&expandedTest=build%3A%28id%3A255899%29%2Cid%3A2000000000

b/378146451

@KatrinaHoffert
Copy link

I did some sleuthing and suspect GoogleCloudPlatform/magic-modules#12014 to be the cause. The timing matches. That commit added bootstrapping for the role binding but kept the role binding in the test's TF config. Not sure if that PR should just be rolled back or what. I don't really understand the issue exactly and the bootstrapping is a bit confusing (the test fails when the bootstrapping makes changes?).

My best guess is that there's supposed to be only the bootstrapping and that it's WAI to fail the first time (because IAM isn't safe to change with how the tests are run in parallel). But because the TF for the test still exists, when it does teardown, TF tries to reverse the config and remove the role binding, which makes the bootstrapping fail next time. So we'd want to remove https://github.com/GoogleCloudPlatform/magic-modules/blob/0616a35f4dddbca1857fbc0c1330bc5ea93cc7e5/mmv1/third_party/terraform/services/container/resource_container_cluster_test.go.tmpl#L11452-L11456. Though there's actually some other IAM changes there that presumably should have been bootstrapped too. And then the sleep also removed.

That's just my guess. I didn't make any change because I'm not sure if my understanding is correct and if we want to try to fix it as opposed to rolling back the presumably breaking PR. I haven't yet looked into what it would take to get the tests running locally.

@melinath
Copy link
Collaborator Author

@KatrinaHoffert yes, that's exactly correct.

@wyardley
Copy link

wyardley commented Jan 10, 2025

Since it seems like GoogleCloudPlatform/magic-modules#12376 was stalled, I threw up GoogleCloudPlatform/magic-modules#12728 (with attribution to the author for the earlier changes), and adjusting some similar code in the autopilot resource manager tags test. Hope that's Ok.

Both tests pass for me locally in both recording and replaying modes now (after the initial bootstrapping).

I think I only adjusted formatting around those lines, but it's been a little while, so not totally clear on whether my changes are the main issue or not... but sorry for not noticing this sooner either way.

@melinath
Copy link
Collaborator Author

no worries, sorry we didn't catch it on the PR & thanks for looking into it!

@wyardley
Copy link

@melinath I had an initial fix for TestAccContainerCluster_withAutopilotResourceManagerTags (which doesn't have a ticket, but seems to have the same issues) that was working in both recording / replaying mode, but later bombed in CI... I fixed some more stuff, but then found some quirky stuff that I'm curious about. Bugging you about it here too, since I think you might have some context on a couple of these:

At the risk of duplicating some of the comments I'm making in the PR, to summarize:

  1. In some cases, at least for autopilot, the cluster seems to report that it's running before some of the underlying instances are ready to have their tags updated?
  2. Doing a sleep before the first update test maybe is helping, but adding a hard sleep even when VCR is used seems messy, not sure if I should just do an inline function, but maybe this affects other uses of the sleep function? Opened this issue to ask that.
  3. There's still one permission that has to be granted in the tf code; if the two tag update tests (autoilot and non-autopilot) could run potentially at the same time, in recording mode, in the same project, since acctest.BootstrapPSARole() can't (I don't think) handle accounts like serviceAccount:${data.google_project.project.number}@cloudservices.gserviceaccount.com, there could be an issue with the role getting removed. Not sure how likely a scenario this is, though might explain the issue in one of the previous CI runs?

@melinath
Copy link
Collaborator Author

  1. I don't have any specific knowledge about this.
  2. Looks like the linked issue has some resolution on it.
  3. responded on Support API service agents in BootstrapPSARole() utils GoogleCloudPlatform/magic-modules#12784

@wyardley
Copy link

Thanks as always for your responsiveness! Sorry to bug you on these ones!

For 1. I do seem to reliably encounter that issue, at least against my test project. I opened https://issuetracker.google.com/u/1/issues/390456348, since I don't think it's a Terraform level issue.

Peripherally related: GoogleCloudPlatform/magic-modules#12785

If there's a ticket for the autopilot variant of this anywhere, feel free to link it -- I couldn't find one, but it's surprising to me if it wasn't flaky.

@wyardley
Copy link

This was a much bigger can of worms than originally anticipated, but I think I've got a more durable fix now in the linked PR (GoogleCloudPlatform/magic-modules#12728). That should the autopilot test as well, and a node pool one from #19997

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment