OCPBUGS-36301: parallelize member health checks #1286

AlexVulaj · 2024-06-28T18:11:23Z

Currently, member health is checked in serial with a 30s timeout per member. 3 out of 4 GetMemberHealth callers had their own default 30s timeout as well for the entire process. Because of this, a slow check on one member could exhaust the timeout for the entire GetMemberHealth function, and thus cause later-checked members to report as unhealthy even though they were fine.

With this commit, I am dropping the internal 30s timeout from GetMemberHealth, and instead letting the caller set the timeout. Also, the code now checks the health of all members in parallel. This will prevent a single slow member from affecting the health reporting of other members.

I also added a timeout to the context used in IsMemberHealthy which calls GetMemberHealth. Neither Trevor nor I were sure why a default timeout wasn't present there, though one was present in all other call sites.

https://issues.redhat.com/browse/OCPBUGS-36301 Currently, member health is checked in serial with a 30s timeout per member. 3 out of 4 GetMemberHealth callers had their own default 30s timeout as well for the entire process. Because of this, a slow check on one member could exhaust the timeout for the entire GetMemberHealth function, and thus cause later-checked members to report as unhealthy even though they were fine. With this commit, I am dropping the internal 30s timeout from GetMemberHealth, and instead letting the caller set the timeout. Also, the code now checks the health of all members in parallel. This will prevent a single slow member from affecting the health reporting of other members. I also added a timeout to the context used in IsMemberHealthy which calls GetMemberHealth. Neither Trevor nor I were sure why a default timeout wasn't present there, though one was present in all other callsites.

openshift-ci-robot · 2024-06-28T18:11:29Z

@AlexVulaj: This pull request references Jira Issue OCPBUGS-36301, which is invalid:

expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Currently, member health is checked in serial with a 30s timeout per member. 3 out of 4 GetMemberHealth callers had their own default 30s timeout as well for the entire process. Because of this, a slow check on one member could exhaust the timeout for the entire GetMemberHealth function, and thus cause later-checked members to report as unhealthy even though they were fine.

With this commit, I am dropping the internal 30s timeout from GetMemberHealth, and instead letting the caller set the timeout. Also, the code now checks the health of all members in parallel. This will prevent a single slow member from affecting the health reporting of other members.

I also added a timeout to the context used in IsMemberHealthy which calls GetMemberHealth. Neither Trevor nor I were sure why a default timeout wasn't present there, though one was present in all other call sites.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

AlexVulaj · 2024-06-28T18:13:31Z

/jira refresh

openshift-ci-robot · 2024-06-28T18:13:37Z

@AlexVulaj: This pull request references Jira Issue OCPBUGS-36301, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.17.0) matches configured target version for branch (4.17.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @geliu2016

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

pkg/etcdcli/health.go

geliu2016

/label cherry-pick-approved

tjungblu · 2024-07-01T06:25:57Z

pkg/etcdcli/health.go

-			resChan <- checkSingleMemberHealth(ctxTimeout, member)
-			// closing here to avoid late replies to panic on resChan,
-			// the result will be considered a timeout anyway
-			close(resChan)


we kinda still have to close this channel, don't we?

fyi, that's a panic I've fixed recently:
https://issues.redhat.com//browse/OCPBUGS-27959

#1190

Looking at the pre-#1190 code, the panics were from timelines like:

checkSingleMemberHealth goroutine launched with its own 30s Context duration.

select waited on a result from resChan or a new 30s time.After.

When the select time.After won, it appended a 30s timeout waiting for member... to memberHealth, and closed resChan.

A millisecond or two later, checkSingleMemberHealth would hit its 30s Context timeout in the Get call, create its own health check failed: ... result, and push it into resChan.

But resChan was closed in step 3! Panic!

With #1190, you dropped the close from step 3, and moved it to step 4, so no more panic.

But from Go's Range and Close tour:

Channels aren't like files; you don't usually need to close them. Closing is only necessary when the receiver must be told there are no more values coming, such as to terminate a range loop.

And with this pull, we no longer have the receiver-side select or timeout. With this pull, the receiver will block until it has a result back from each launched checkSingleMemberHealth goroutine, and it's up to those goroutines to respect the Context timeout. So there is no chance of the GetMemberHealth close-ing the channel before a checkSingleMemberHealth goroutine writes, because we no longer have an explicit close at all, and the channel is just garbage-collected as it goes out of scope, like all local Go variables.

Thanks for the explanation, indeed kicking the select out and just looping all the values is enough :)

tjungblu · 2024-07-01T15:09:51Z

/lgtm

tjungblu · 2024-07-01T15:10:10Z

/cherry-pick release-4.16 release-4.15 release-4.14 release-4.13 release-4.12

openshift-cherrypick-robot · 2024-07-01T15:10:13Z

@tjungblu: once the present PR merges, I will cherry-pick it on top of release-4.16 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.16 release-4.15 release-4.14 release-4.13 release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2024-07-01T15:10:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AlexVulaj, geliu2016, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tjungblu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tjungblu · 2024-07-01T15:13:14Z

/retest-required

openshift-ci-robot · 2024-07-01T17:41:54Z

/retest-required

Remaining retests: 0 against base HEAD d82a13d and 2 for PR HEAD 6558a88 in total

tjungblu · 2024-07-02T14:24:04Z

/retest-required

openshift-ci-robot · 2024-07-02T17:10:37Z

/retest-required

Remaining retests: 0 against base HEAD 9d7b786 and 1 for PR HEAD 6558a88 in total

wking · 2024-07-02T21:05:19Z

I think the e2e-aws-ovn-etcd-scaling failures are unrelated to this change, and are instead a combination of OCPBUGS-36462 (which I've just opened) and a need to refactor the etcd is able to vertically scale up and down with a single node test case to stop assuming the ControlPlaneMachineSet status.readyReplicas will hit 4, and instead do something else to check that the roll/recovery completed.

wking · 2024-07-02T23:45:33Z

...to stop assuming the ControlPlaneMachineSet status.readyReplicas will hit 4...

This is now tracked in ETCD-637. In the meantime, possibly worth an /override ci/prow/e2e-aws-ovn-etcd-scaling here? Or keep launching retests until we get lucky? Or wait for the CPMS and etcd work to green up the test?

tjungblu · 2024-07-03T07:09:20Z

/override ci/prow/e2e-aws-ovn-etcd-scaling

no doubt :)

openshift-ci · 2024-07-03T07:09:41Z

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-etcd-scaling

In response to this:

/override ci/prow/e2e-aws-ovn-etcd-scaling

no doubt :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2024-07-03T09:50:50Z

@AlexVulaj: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-operator-fips	`6558a88`	link	false	`/test e2e-operator-fips`
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown	`6558a88`	link	false	`/test e2e-metal-ovn-ha-cert-rotation-shutdown`
ci/prow/e2e-aws-etcd-recovery	`6558a88`	link	false	`/test e2e-aws-etcd-recovery`
ci/prow/e2e-aws-etcd-certrotation	`6558a88`	link	false	`/test e2e-aws-etcd-certrotation`
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown	`6558a88`	link	false	`/test e2e-metal-ovn-sno-cert-rotation-shutdown`
ci/prow/e2e-gcp-qe-no-capabilities	`6558a88`	link	false	`/test e2e-gcp-qe-no-capabilities`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

tjungblu · 2024-07-03T10:43:08Z

unrelated failure

/override ci/prow/e2e-aws-ovn-serial

openshift-ci · 2024-07-03T10:43:28Z

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-serial

In response to this:

unrelated failure

/override ci/prow/e2e-aws-ovn-serial

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2024-07-03T10:47:19Z

@AlexVulaj: Jira Issue OCPBUGS-36301: All pull requests linked via external trackers have merged:

openshift/cluster-etcd-operator#1286

Jira Issue OCPBUGS-36301 has been moved to the MODIFIED state.

In response to this:

Currently, member health is checked in serial with a 30s timeout per member. 3 out of 4 GetMemberHealth callers had their own default 30s timeout as well for the entire process. Because of this, a slow check on one member could exhaust the timeout for the entire GetMemberHealth function, and thus cause later-checked members to report as unhealthy even though they were fine.

With this commit, I am dropping the internal 30s timeout from GetMemberHealth, and instead letting the caller set the timeout. Also, the code now checks the health of all members in parallel. This will prevent a single slow member from affecting the health reporting of other members.

I also added a timeout to the context used in IsMemberHealthy which calls GetMemberHealth. Neither Trevor nor I were sure why a default timeout wasn't present there, though one was present in all other call sites.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-cherrypick-robot · 2024-07-03T10:48:11Z

@tjungblu: new pull request created: #1290

In response to this:

/cherry-pick release-4.16 release-4.15 release-4.14 release-4.13 release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-bot · 2024-07-03T17:21:52Z

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-etcd-operator-container-v4.17.0-202407031527.p0.gaabb6d6.assembly.stream.el9 for distgit cluster-etcd-operator.
All builds following this will include this PR.

openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jun 28, 2024

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 28, 2024

openshift-ci bot requested review from dusk125 and tjungblu June 28, 2024 18:13

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 28, 2024

openshift-ci bot requested a review from geliu2016 June 28, 2024 18:13

lance5890 reviewed Jun 29, 2024

View reviewed changes

pkg/etcdcli/health.go Show resolved Hide resolved

geliu2016 approved these changes Jul 1, 2024

View reviewed changes

openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jul 1, 2024

tjungblu reviewed Jul 1, 2024

View reviewed changes

openshift-ci bot assigned tjungblu Jul 1, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 1, 2024

openshift-merge-bot bot merged commit aabb6d6 into openshift:master Jul 3, 2024
12 of 17 checks passed

openshift-cherrypick-robot mentioned this pull request Jul 3, 2024

[release-4.16] OCPBUGS-36489: parallelize member health checks #1290

Merged

AlexVulaj deleted the parallel-member-health-check branch July 3, 2024 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-36301: parallelize member health checks #1286

OCPBUGS-36301: parallelize member health checks #1286

AlexVulaj commented Jun 28, 2024

openshift-ci-robot commented Jun 28, 2024

AlexVulaj commented Jun 28, 2024

openshift-ci-robot commented Jun 28, 2024

geliu2016 left a comment

tjungblu Jul 1, 2024

wking Jul 1, 2024

tjungblu Jul 1, 2024

tjungblu commented Jul 1, 2024

tjungblu commented Jul 1, 2024

openshift-cherrypick-robot commented Jul 1, 2024

openshift-ci bot commented Jul 1, 2024

tjungblu commented Jul 1, 2024

openshift-ci-robot commented Jul 1, 2024

tjungblu commented Jul 2, 2024

openshift-ci-robot commented Jul 2, 2024

wking commented Jul 2, 2024

wking commented Jul 2, 2024

tjungblu commented Jul 3, 2024

openshift-ci bot commented Jul 3, 2024

openshift-ci bot commented Jul 3, 2024 •

edited

Loading

tjungblu commented Jul 3, 2024

openshift-ci bot commented Jul 3, 2024

openshift-ci-robot commented Jul 3, 2024

openshift-cherrypick-robot commented Jul 3, 2024

openshift-bot commented Jul 3, 2024

OCPBUGS-36301: parallelize member health checks #1286

OCPBUGS-36301: parallelize member health checks #1286

Conversation

AlexVulaj commented Jun 28, 2024

openshift-ci-robot commented Jun 28, 2024

AlexVulaj commented Jun 28, 2024

openshift-ci-robot commented Jun 28, 2024

geliu2016 left a comment

Choose a reason for hiding this comment

tjungblu Jul 1, 2024

Choose a reason for hiding this comment

wking Jul 1, 2024

Choose a reason for hiding this comment

tjungblu Jul 1, 2024

Choose a reason for hiding this comment

tjungblu commented Jul 1, 2024

tjungblu commented Jul 1, 2024

openshift-cherrypick-robot commented Jul 1, 2024

openshift-ci bot commented Jul 1, 2024

tjungblu commented Jul 1, 2024

openshift-ci-robot commented Jul 1, 2024

tjungblu commented Jul 2, 2024

openshift-ci-robot commented Jul 2, 2024

wking commented Jul 2, 2024

wking commented Jul 2, 2024

tjungblu commented Jul 3, 2024

openshift-ci bot commented Jul 3, 2024

openshift-ci bot commented Jul 3, 2024 • edited Loading

tjungblu commented Jul 3, 2024

openshift-ci bot commented Jul 3, 2024

openshift-ci-robot commented Jul 3, 2024

openshift-cherrypick-robot commented Jul 3, 2024

openshift-bot commented Jul 3, 2024

openshift-ci bot commented Jul 3, 2024 •

edited

Loading