Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-36301: parallelize member health checks #1286

Merged

Conversation

AlexVulaj
Copy link
Contributor

Currently, member health is checked in serial with a 30s timeout per member. 3 out of 4 GetMemberHealth callers had their own default 30s timeout as well for the entire process. Because of this, a slow check on one member could exhaust the timeout for the entire GetMemberHealth function, and thus cause later-checked members to report as unhealthy even though they were fine.

With this commit, I am dropping the internal 30s timeout from GetMemberHealth, and instead letting the caller set the timeout. Also, the code now checks the health of all members in parallel. This will prevent a single slow member from affecting the health reporting of other members.

I also added a timeout to the context used in IsMemberHealthy which calls GetMemberHealth. Neither Trevor nor I were sure why a default timeout wasn't present there, though one was present in all other call sites.

https://issues.redhat.com/browse/OCPBUGS-36301

Currently,
member health is checked in serial with a 30s timeout per member.
3 out of 4 GetMemberHealth callers had their own default 30s timeout as well for the entire process.
Because of this,
a slow check on one member could exhaust the timeout for the entire GetMemberHealth function,
and thus cause later-checked members
to report as unhealthy even though they were fine.

With this commit,
I am dropping the internal 30s timeout from GetMemberHealth,
and instead letting the caller set the timeout.
Also, the code now checks the health of all members in parallel.
This will prevent a single slow member
from affecting the health reporting of other members.

I also added a timeout to the context used in IsMemberHealthy
which calls GetMemberHealth.
Neither Trevor nor I were sure why a default timeout wasn't present there,
though one was present in all other callsites.
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jun 28, 2024
@openshift-ci-robot
Copy link

@AlexVulaj: This pull request references Jira Issue OCPBUGS-36301, which is invalid:

  • expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Currently, member health is checked in serial with a 30s timeout per member. 3 out of 4 GetMemberHealth callers had their own default 30s timeout as well for the entire process. Because of this, a slow check on one member could exhaust the timeout for the entire GetMemberHealth function, and thus cause later-checked members to report as unhealthy even though they were fine.

With this commit, I am dropping the internal 30s timeout from GetMemberHealth, and instead letting the caller set the timeout. Also, the code now checks the health of all members in parallel. This will prevent a single slow member from affecting the health reporting of other members.

I also added a timeout to the context used in IsMemberHealthy which calls GetMemberHealth. Neither Trevor nor I were sure why a default timeout wasn't present there, though one was present in all other call sites.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 28, 2024
@openshift-ci openshift-ci bot requested review from dusk125 and tjungblu June 28, 2024 18:13
@AlexVulaj
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 28, 2024
@openshift-ci-robot
Copy link

@AlexVulaj: This pull request references Jira Issue OCPBUGS-36301, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @geliu2016

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from geliu2016 June 28, 2024 18:13
Copy link

@geliu2016 geliu2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jul 1, 2024
resChan <- checkSingleMemberHealth(ctxTimeout, member)
// closing here to avoid late replies to panic on resChan,
// the result will be considered a timeout anyway
close(resChan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we kinda still have to close this channel, don't we?

fyi, that's a panic I've fixed recently:
https://issues.redhat.com//browse/OCPBUGS-27959

#1190

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the pre-#1190 code, the panics were from timelines like:

  1. checkSingleMemberHealth goroutine launched with its own 30s Context duration.
  2. select waited on a result from resChan or a new 30s time.After.
  3. When the select time.After won, it appended a 30s timeout waiting for member... to memberHealth, and closed resChan.
  4. A millisecond or two later, checkSingleMemberHealth would hit its 30s Context timeout in the Get call, create its own health check failed: ... result, and push it into resChan.
  5. But resChan was closed in step 3! Panic!

With #1190, you dropped the close from step 3, and moved it to step 4, so no more panic.

But from Go's Range and Close tour:

Channels aren't like files; you don't usually need to close them. Closing is only necessary when the receiver must be told there are no more values coming, such as to terminate a range loop.

And with this pull, we no longer have the receiver-side select or timeout. With this pull, the receiver will block until it has a result back from each launched checkSingleMemberHealth goroutine, and it's up to those goroutines to respect the Context timeout. So there is no chance of the GetMemberHealth close-ing the channel before a checkSingleMemberHealth goroutine writes, because we no longer have an explicit close at all, and the channel is just garbage-collected as it goes out of scope, like all local Go variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, indeed kicking the select out and just looping all the values is enough :)

@tjungblu
Copy link
Contributor

tjungblu commented Jul 1, 2024

/lgtm

@tjungblu
Copy link
Contributor

tjungblu commented Jul 1, 2024

/cherry-pick release-4.16 release-4.15 release-4.14 release-4.13 release-4.12

@openshift-cherrypick-robot

@tjungblu: once the present PR merges, I will cherry-pick it on top of release-4.16 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.16 release-4.15 release-4.14 release-4.13 release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2024
Copy link
Contributor

openshift-ci bot commented Jul 1, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AlexVulaj, geliu2016, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 1, 2024
@tjungblu
Copy link
Contributor

tjungblu commented Jul 1, 2024

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD d82a13d and 2 for PR HEAD 6558a88 in total

@tjungblu
Copy link
Contributor

tjungblu commented Jul 2, 2024

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 9d7b786 and 1 for PR HEAD 6558a88 in total

@wking
Copy link
Member

wking commented Jul 2, 2024

I think the e2e-aws-ovn-etcd-scaling failures are unrelated to this change, and are instead a combination of OCPBUGS-36462 (which I've just opened) and a need to refactor the etcd is able to vertically scale up and down with a single node test case to stop assuming the ControlPlaneMachineSet status.readyReplicas will hit 4, and instead do something else to check that the roll/recovery completed.

@wking
Copy link
Member

wking commented Jul 2, 2024

...to stop assuming the ControlPlaneMachineSet status.readyReplicas will hit 4...

This is now tracked in ETCD-637. In the meantime, possibly worth an /override ci/prow/e2e-aws-ovn-etcd-scaling here? Or keep launching retests until we get lucky? Or wait for the CPMS and etcd work to green up the test?

@tjungblu
Copy link
Contributor

tjungblu commented Jul 3, 2024

/override ci/prow/e2e-aws-ovn-etcd-scaling

no doubt :)

Copy link
Contributor

openshift-ci bot commented Jul 3, 2024

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-etcd-scaling

In response to this:

/override ci/prow/e2e-aws-ovn-etcd-scaling

no doubt :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented Jul 3, 2024

@AlexVulaj: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-operator-fips 6558a88 link false /test e2e-operator-fips
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown 6558a88 link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-aws-etcd-recovery 6558a88 link false /test e2e-aws-etcd-recovery
ci/prow/e2e-aws-etcd-certrotation 6558a88 link false /test e2e-aws-etcd-certrotation
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown 6558a88 link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-gcp-qe-no-capabilities 6558a88 link false /test e2e-gcp-qe-no-capabilities

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tjungblu
Copy link
Contributor

tjungblu commented Jul 3, 2024

unrelated failure

/override ci/prow/e2e-aws-ovn-serial

Copy link
Contributor

openshift-ci bot commented Jul 3, 2024

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-serial

In response to this:

unrelated failure

/override ci/prow/e2e-aws-ovn-serial

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit aabb6d6 into openshift:master Jul 3, 2024
12 of 17 checks passed
@openshift-ci-robot
Copy link

@AlexVulaj: Jira Issue OCPBUGS-36301: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-36301 has been moved to the MODIFIED state.

In response to this:

Currently, member health is checked in serial with a 30s timeout per member. 3 out of 4 GetMemberHealth callers had their own default 30s timeout as well for the entire process. Because of this, a slow check on one member could exhaust the timeout for the entire GetMemberHealth function, and thus cause later-checked members to report as unhealthy even though they were fine.

With this commit, I am dropping the internal 30s timeout from GetMemberHealth, and instead letting the caller set the timeout. Also, the code now checks the health of all members in parallel. This will prevent a single slow member from affecting the health reporting of other members.

I also added a timeout to the context used in IsMemberHealthy which calls GetMemberHealth. Neither Trevor nor I were sure why a default timeout wasn't present there, though one was present in all other call sites.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@tjungblu: new pull request created: #1290

In response to this:

/cherry-pick release-4.16 release-4.15 release-4.14 release-4.13 release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-etcd-operator-container-v4.17.0-202407031527.p0.gaabb6d6.assembly.stream.el9 for distgit cluster-etcd-operator.
All builds following this will include this PR.

@AlexVulaj AlexVulaj deleted the parallel-member-health-check branch July 3, 2024 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants