Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 fix(e2e): wait for leader election #1676

Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions test/upgrade-e2e/post_upgrade_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,17 @@ func TestClusterExtensionAfterOLMUpgrade(t *testing.T) {
t.Log("Wait for operator-controller deployment to be ready")
managerPod := waitForDeployment(t, ctx, "operator-controller-controller-manager")

t.Log("Wait for acquired leader election")
// Average case is under 1 minute but in the worst case: (previous leader crashed)
// we could have LeaseDuration (137s) + RetryPeriod (26s) +/- 163s
leaderCtx, leaderCancel := context.WithTimeout(ctx, 3*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming 3 minutes is the worst case scenario. I am not familiar with context.WithTimeout , does it return if we acquire the lease before 163s?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context.WithTimeout just gives you a context that timesout (gets cancelled) after then timeout period.
This means that the call to watchPodLogsForSubstring(leaderCtx, managerPod, "manager", leaderSubstrings...) will return with an error if it hasn't already after 3 minutes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, looks like it is a straight forward timeout method.

defer leaderCancel()

leaderSubstrings := []string{"successfully acquired lease"}
leaderElected, err := watchPodLogsForSubstring(leaderCtx, managerPod, "manager", leaderSubstrings...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scraping the logs seems brittle.

Would it be better to use a Watch on the leader election? We could use the Leases from CoordinationV1Client from "k8s.io/client-go/kubernetes/typed/coordination/v1" ?

Copy link
Contributor

@bentito bentito Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize it's also longer and more code, but the upside is it reacts right away, like watching for the pod log, but without caring if strings change at some point, and break our tests out of our control.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good idea! If this work is blocking CI, I'd say merge it as it is, then follow up with the watch ^^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that we could do something more fancy
But we check the logs in many places, indeed below.
We can see if we improve after, but there is no reason for us to face the pain of the flak.

require.NoError(t, err)
require.True(t, leaderElected)

t.Log("Reading logs to make sure that ClusterExtension was reconciled by operator-controller before we update it")
// Make sure that after we upgrade OLM itself we can still reconcile old objects without any changes
logCtx, cancel := context.WithTimeout(ctx, time.Minute)
Expand Down
Loading