-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETCD-612: Added a callback to provide additional pre-conditions for installation #1749
base: master
Are you sure you want to change the base?
Conversation
@jubittajohn: This pull request references ETCD-612 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@jubittajohn: This pull request references ETCD-612 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
pkg/operator/staticpod/controller/installer/installer_controller.go
Outdated
Show resolved
Hide resolved
pkg/operator/staticpod/controller/installer/installer_controller.go
Outdated
Show resolved
Hide resolved
@@ -442,6 +449,16 @@ func (c *InstallerController) manageInstallationPods(ctx context.Context, operat | |||
return true, requeueAfter, nil | |||
} | |||
|
|||
// Check if revision should be installed based on if the quorum is about to be violated | |||
if c.shouldRevisionInstall != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as I understand, this blocks the entire controller, which we don't want, we want the current revisions to continue to rollout.
We can be more granular here and just block the ensureInstallerPod
routine from spawning just the installer pod. That way it wouldn't block existing rollouts. Having said that, I think there are generally very little tests around this, so we need to run a couple of payload tests in cluster-etcd-operator (CEO) as well
bf42b26
to
d1afb2e
Compare
// checks if new revision should be rolled out | ||
if c.shouldRevisionInstall != nil { | ||
shouldInstall, err := c.shouldRevisionInstall() | ||
if !shouldInstall { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't that be
if !shouldInstall { | |
if err != nil { | |
return err | |
} | |
if !shouldInstall { | |
return nil | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also add an info logging statement here, so we know if it was skipped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjungblu But only if there is error the requeue will happen as per the current logic here. https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/controller/installer/installer_controller.go#L481.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we need to rethink the ensureInstallerPod signature potentially. I don't think we should trigger an event for this for example:
c.eventRecorder.Warningf("InstallerPodFailed", "Failed to create installer pod for revision %d count %d on node %q: %v",
currNodeState.TargetRevision, currNodeState.LastFailedCount, currNodeState.NodeName, err)
that's misleading, because there was no actual error that failed the installation. We just skipped it.
kubeClient: fake.NewSimpleClientset(), | ||
eventRecorder: eventstesting.NewTestingEventRecorder(t), | ||
shouldRevisionInstall: func() (bool, error) { | ||
return false, fmt.Errorf("revision shouldn't be installed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah okay, now I get it. If you want to use an error for control flow, that's fine. But then we don't really need the boolean to denote whether to skip it or not.
I (personally) would use the error for real errors though.
}, | ||
originalOperatorStatus: &operatorv1.StaticPodOperatorStatus{ | ||
LatestAvailableRevision: 1, | ||
NodeStatuses: []operatorv1.NodeStatus{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only very superficially looked through your tests, but I'm generally missing one with multiple NodeStatuses.
Maybe more meaningful for etcd would be a test with three nodes at those revisions:
master-0 = 4
master-1 = 3 (would be next to get to 4)
master-2 = 3
Now our quorum guard says master-0 went away (eg, machine was turned off). What happens to the remaining revisions?
I would expect this to block the installation of rev 4 on master-1 and master-2.
Maybe also think of a couple of such tests, maybe with a more structured permutation approach. We can discuss this also next Tuesday.
/assign @dusk125 |
d1afb2e
to
8e7bd43
Compare
pkg/operator/staticpod/controller/installer/installer_controller.go
Outdated
Show resolved
Hide resolved
@jubittajohn: This pull request references ETCD-612 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
54dda05
to
ddabfa2
Compare
pkg/operator/staticpod/controller/installer/installer_controller.go
Outdated
Show resolved
Hide resolved
5f21a8d
to
713d859
Compare
@jubittajohn: This pull request references ETCD-612 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@@ -98,6 +98,9 @@ type InstallerController struct { | |||
clock clock.Clock | |||
installerBackOff func(count int) time.Duration | |||
fallbackBackOff func(count int) time.Duration | |||
|
|||
// shouldRevisionInstall is a callback function that determines whether a new revision should be installed | |||
shouldRevisionInstall func() (bool, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should follow the same precondition pattern that we have in the static resource controller: https://github.com/openshift/library-go/blob/master/pkg/operator/staticresourcecontroller/static_resource_controller.go#L55-L59. Meaning that we would allow for any condition to be defined to gate the revision process.
Also, we should pass on a context in that method and propagate it all the way down in case we need to cancel it for some reasons. Especially in https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/operator/ceohelpers/bootstrap.go#L139 and maybe some other functions that don't take it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, @jubittajohn can we create it slightly different like this?
type StaticPodInstallerPreconditionsFuncType func(ctx context.Context) (bool, error)
and then the struct value could look like this
installPrecondition StaticPodInstallerPreconditionsFuncType
that way is more consistent with the other packages.
// returns whether or not requeue and if an error happened while creating installer pod | ||
func (c *InstallerController) ensureInstallerPod(ctx context.Context, operatorSpec *operatorv1.StaticPodOperatorSpec, ns *operatorv1.NodeStatus) (bool, error) { | ||
// checks if new revision should be rolled out | ||
if c.shouldRevisionInstall != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check should be moved to manageInstallationPods
because this function is supposed to be only about creating the installer pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could maybe even move it inside the sync
function before the call to manageInstallationPods
since we want to skip the installation process entirely when the preconditions are not fulfilled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we thought so, but it's better to continue the current rollout. See #1749 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. it made me notice that we might be subject to the following issue: #1749 (comment)
119fa6c
to
f152bd5
Compare
lgtm, but I'll let Damien have the last word |
return true, err | ||
} | ||
if !shouldInstall { | ||
return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's imagine the following scenario:
t0: sync for a new revision occurs, preconditions are true, the installer pod is created
t1: preconditions became false while installation is still in progress
t2: sync happens again, ensureInstallerPod
returns true
for requeuing because the installer pod can't be installed due to the preconditions not being met.
When that happens, we return early in
library-go/pkg/operator/staticpod/controller/installer/installer_controller.go
Lines 500 to 503 in f152bd5
if requeue { | |
klog.V(4).Infof("Requeuing the creation of installer pod for revision %d on node %q", currNodeState.TargetRevision, currNodeState.NodeName) | |
return true, 0, nil | |
} |
To prevent the whole flow from being dependent on the preconditions, we should make sure that the preconditions are only evaluated when the installer pod isn't already present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I'm off, but wouldn't t2 not go through the ensureInstallerPod
branch? the previous if condition should be true?
if operatorStatus.LatestAvailableRevision > currNodeState.TargetRevision {
// no backoff if new revision is pending
}
@jubittajohn can we add a unit test for the condition that @dgrisonnet mentioned? I think he has a valid point that we definitely must update the node status for ongoing installations, even when the precondition is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgrisonnet As pointed out , ensureInstallerPod
returns true
at t2.
@tjungblu The if
condition mentioned would evaluate to false
at t2 allowing the installer creation process to continue. The mentioned condition only evaluates to true
when a later revision is pending compared to the one currently being rolled out. Therefore, the ensureInstallerPod
is invoked.
To address this, I have added a check as suggested to ensure that preconditions are only evaluated when the installer pod isn't already present. Additionally, I have added a unit test for this scenario to confirm the behavior.
Could you please review the new changes ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, thanks for the test case :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you for testing this edge case! :)
pkg/operator/staticpod/controller/installer/installer_controller.go
Outdated
Show resolved
Hide resolved
return true, err | ||
} | ||
if !shouldInstall { | ||
return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we leave a log statement here, too, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return true, nil | |
klog.Infof("Skipping the creation of installer pod [%s] because of precondition check", installerPodName) | |
return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just as an example, feel free to find a more fitting log statement
a73116f
to
4b49b84
Compare
4b49b84
to
6b3b95e
Compare
// checks if a new installer pod should be created based on the preconditions being met | ||
// preconditions are only evaluated when the installer pod isn't already present | ||
installerPodName := getInstallerPodName(ns) | ||
_, err := c.podsGetter.Pods(c.targetNamespace).Get(ctx, installerPodName, metav1.GetOptions{}) | ||
if apierrors.IsNotFound(err) { | ||
if c.installPrecondition != nil { | ||
shouldInstall, err := c.installPrecondition(ctx) | ||
if err != nil { | ||
return true, err | ||
} | ||
if !shouldInstall { | ||
klog.Infof("Preconditions not met, skipping the creation of installer pod %s", installerPodName) | ||
return true, nil | ||
} | ||
} | ||
} else if err != nil { | ||
return true, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move this code to its own function and maybe move the call to
library-go/pkg/operator/staticpod/controller/installer/installer_controller.go
Lines 490 to 491 in 6b3b95e
requeue, err := c.ensureInstallerPod(ctx, operatorSpec, currNodeState) |
b51efe9
to
c4870eb
Compare
c4870eb
to
ead0772
Compare
Signed-off-by: jubittajohn <[email protected]> Added unit test Signed-off-by: jubittajohn <[email protected]> Preconditions are only evaluated when the installer pod isn't already present Signed-off-by: jubittajohn <[email protected]>
ead0772
to
3c6c63e
Compare
@jubittajohn: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@dgrisonnet any progress here? I would like to add this before feature freeze so we have some time to soak this in CEO. |
I synced on slack with @jubittajohn, but there was some concerns from my team with adding new logic to the installer controller, so we were exploring some other options. The first one was the revision controller, but this was not possible because rollouts can still occur even if a revision is gated. |
That said, if the investigation doesn't conclude, I am fine with going with this approach. |
@dgrisonnet we had a quorum check implicitly a part of the readyz implementation that was removed to fix: https://issues.redhat.com//browse/OCPBUGS-19987 and PR openshift/cluster-etcd-operator#1134 - the implicit check here is that a linearizable read needs quorum where a serizialable does not. The gist is that a node can still be 'ready' while quorum is down - being able to still serve serializable requests (instead of linearizable requests). IMO, this isn't adding that much complexity to the installer controller, just a "should we continue installing or not", the complexity of that decision lies with the actual user of the installer controller. |
Thanks @dgrisonnet
@soltysh / @deads2k can we assume that static pods in library-go are deprecated then? This is already the second change we could not make to bugfix etcd operator because of concerns from the team in the last four weeks. |
@dusk125 etcd seems to have readyz checks for both serializable and linearizable requests: https://github.com/etcd-io/etcd/blob/8b1b69b1e26621e1038f78a3fe84e0919b00034d/server/etcdserver/api/etcdhttp/health.go#L252-L254. As far as I understand it, if any of these would fail, the node would report not ready. So if quorum was down the node wouldn't be able to serve linearizable requests and would report not ready. I looked at the links you shared, but I am failing to see when etcd-operator readyz comes into the picture. https://github.com/openshift/cluster-etcd-operator/blob/37ac49d0f8695be21ba8a75b0fab161c8a083121/pkg/operator/starter.go#L311-L324 should be pointing to etcd members readyz endpoints. So there might be something that I don't understand here, and I am also wondering if the change is still relevant with the new http probes in the picture. I am not against the change in the installer controller, but from my perspective having a solution relying on guard pods which protects pods from rollouts is better than a solution that prevents the installation of new revisions. And looking at the new probes from a high level, it seems quite promising to me. |
@dgrisonnet it's true that etcd can do both, that bug makes it so that we only use serializable reads. The operator controls the pod-spec for the etcd pods and thus, the readyz command for etcd comes from the operator: https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/pod.yaml#L289 |
Added the callback shouldRevisionInstall to provide additional conditions for installation
These changes are consumed by : openshift/cluster-etcd-operator#1278