Skip to content

Commit

Permalink
fix(e2e): wait for leader election & measure timing for better monito…
Browse files Browse the repository at this point in the history
…ring

TestClusterExtensionAfterOLMUpgrade was failing due to increased leader
election timeouts, causing reconciliation checks to run before leadership
was acquired.

This fix ensures the test explicitly waits for leader election logs
(`"successfully acquired lease"`) before verifying reconciliation.

Additionally, the test now measures and logs the leader election duration
to help monitor election timing.
  • Loading branch information
camilamacedo86 committed Jan 31, 2025
1 parent c3a4406 commit 0ee61e2
Show file tree
Hide file tree
Showing 3 changed files with 21 additions and 6 deletions.
6 changes: 3 additions & 3 deletions catalogd/cmd/catalogd/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -235,9 +235,9 @@ func main() {
LeaderElectionID: "catalogd-operator-lock",
// Recommended Leader Election values
// https://github.com/openshift/enhancements/blob/61581dcd985130357d6e4b0e72b87ee35394bf6e/CONVENTIONS.md#handling-kube-apiserver-disruption
LeaseDuration: ptr.To(137 * time.Second),
RenewDeadline: ptr.To(107 * time.Second),
RetryPeriod: ptr.To(26 * time.Second),
LeaseDuration: ptr.To(137 * time.Second), // Default: 15s
RenewDeadline: ptr.To(107 * time.Second), // Default: 10s
RetryPeriod: ptr.To(26 * time.Second), // Default: 2s

WebhookServer: webhookServer,
Cache: cacheOptions,
Expand Down
6 changes: 3 additions & 3 deletions cmd/operator-controller/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -235,9 +235,9 @@ func main() {
LeaderElectionID: "9c4404e7.operatorframework.io",
// Recommended Leader Election values
// https://github.com/openshift/enhancements/blob/61581dcd985130357d6e4b0e72b87ee35394bf6e/CONVENTIONS.md#handling-kube-apiserver-disruption
LeaseDuration: ptr.To(137 * time.Second),
RenewDeadline: ptr.To(107 * time.Second),
RetryPeriod: ptr.To(26 * time.Second),
LeaseDuration: ptr.To(137 * time.Second), // Default: 15s
RenewDeadline: ptr.To(107 * time.Second), // Default: 10s
RetryPeriod: ptr.To(26 * time.Second), // Default: 2s

Cache: cacheOptions,
// LeaderElectionReleaseOnCancel defines if the leader should step down voluntarily
Expand Down
15 changes: 15 additions & 0 deletions test/upgrade-e2e/post_upgrade_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,21 @@ func TestClusterExtensionAfterOLMUpgrade(t *testing.T) {
t.Log("Wait for operator-controller deployment to be ready")
managerPod := waitForDeployment(t, ctx, "operator-controller-controller-manager")

t.Log("Start measuring leader election time")
// - Best case (new pod starts): ~1–5 seconds
// - Average case (leader exists, but lease not expired): +/-26–52 seconds
// - Worst case (leader was there but crashed): LeaseDuration (137s) + RetryPeriod (26s) +/- 163 secs
leaderStartTime := time.Now()
leaderElectionCtx, leaderCancel := context.WithTimeout(ctx, 3*time.Minute)
defer leaderCancel()

leaderSubstrings := []string{"successfully acquired lease"}
leaderElected, err := watchPodLogsForSubstring(leaderElectionCtx, managerPod, "manager", leaderSubstrings...)
require.NoError(t, err)
require.True(t, leaderElected)
leaderElectionDuration := time.Since(leaderStartTime)
t.Logf("Leader election took %v seconds", leaderElectionDuration.Seconds())

t.Log("Reading logs to make sure that ClusterExtension was reconciled by operator-controller before we update it")
// Make sure that after we upgrade OLM itself we can still reconcile old objects without any changes
logCtx, cancel := context.WithTimeout(ctx, time.Minute)
Expand Down

0 comments on commit 0ee61e2

Please sign in to comment.