Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: ETCD-612: Added a function to check if the quorum is safe #1278

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jubittajohn
Copy link

@jubittajohn jubittajohn commented Jun 14, 2024

This PR is blocked by openshift/library-go#1749

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 14, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 14, 2024

@jubittajohn: This pull request references ETCD-612 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 14, 2024
@openshift-ci openshift-ci bot requested review from dusk125 and tjungblu June 14, 2024 19:20
Copy link
Contributor

openshift-ci bot commented Jun 14, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jubittajohn
Once this PR has been reviewed and has the lgtm label, please assign tjungblu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 14, 2024

@jubittajohn: This pull request references ETCD-612 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This PR uses openshift/library-go#1749

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 14, 2024

@jubittajohn: This pull request references ETCD-612 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This PR is blocked by openshift/library-go#1749

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@lance5890
Copy link
Contributor

+1

@@ -301,6 +302,20 @@ func RunOperator(ctx context.Context, controllerContext *controllercmd.Controlle
return !isSNO, precheckSucceeded, err
}

// checks if quorum is valid
quorumSafe := func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great start! could you also add another commit with the removal of all existing checks? That way we can maybe have an easier time to test whether it works correctly on its own.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjungblu I have now added a commit removing the existing checks(with commenting the corresponding test routine changes for now). There are two required cases that are failing now.

pkg/operator/starter.go Outdated Show resolved Hide resolved
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 19, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2024
@tjungblu
Copy link
Contributor

tjungblu commented Jun 20, 2024

@jubittajohn once the CI here is mostly green, you can run the payload jobs using /payload.

For example /payload 4.17 nightly blocking will run all 4.17 nightly jobs that are a must-have to generate a release ("blocking") with this PR. This tests several clouds and form factors of OpenShift.

You can check those out here as an example for the usual nightly runs: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.17.0-0.nightly/release/4.17.0-0.nightly-2024-06-20-005211

be aware, those tests are somewhat expensive to run, so use them sparingly when you feel like you're close to be done and you just want to have additional assurance that you don't break openshift as a whole somehow.

@jubittajohn jubittajohn force-pushed the skip-static-pod-rollouts branch 2 times, most recently from 50bae80 to 8645fa3 Compare June 22, 2024 22:50
} else if err != nil {
err = fmt.Errorf("unable to evaluate if the quorum is safe: %w", err)
} else {
err = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least from the current implementation of the quorumChecker, if it's safe to scale, then the error is nil and if it's not safe then the error is not nil. So this check won't ever happen.

From the other calls to the quorumChecker, they usually check !safe and err!=nil in separate if-blocks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dusk125 if the err is not nil when it isn't safe to scale, why would seperate if-blocks be required with return statements like in the existing code here ? will the return statement in !safe if-block be ever hit? Is there a case I am missing out here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required, just a code style suggestion; functionally, it's the same, but it's a little easier to parse at a glance the individual code paths.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me it would also just be

WithShouldRevisionInstall(quorumChecker.IsSafeToUpdateRevision).

no need for the additional code

@@ -269,7 +268,7 @@ func TestBootstrapAnnotationRemoval(t *testing.T) {
}
},
},
{
/* {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to remove those testcases, we should trust what we get from library-go

@dusk125
Copy link
Contributor

dusk125 commented Jun 28, 2024

/retest-required

2 similar comments
@dusk125
Copy link
Contributor

dusk125 commented Jul 2, 2024

/retest-required

@dusk125
Copy link
Contributor

dusk125 commented Jul 2, 2024

/retest-required

Signed-off-by: jubittajohn <[email protected]>

Removed test case related to exisiting quorum checks; Updated the starter.go to remove additional code to call the quorum checker

Signed-off-by: jubittajohn <[email protected]>
Copy link
Contributor

openshift-ci bot commented Jul 9, 2024

@jubittajohn: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-qe-no-capabilities 2be2710 link false /test e2e-gcp-qe-no-capabilities
ci/prow/e2e-aws-etcd-recovery add43ce link false /test e2e-aws-etcd-recovery
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown add43ce link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown add43ce link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/verify-deps add43ce link true /test verify-deps
ci/prow/e2e-aws-ovn-etcd-scaling add43ce link true /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-aws-etcd-certrotation add43ce link false /test e2e-aws-etcd-certrotation

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants