-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for fast operator lock on unclean restart #824
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Sebastian Sch <[email protected]>
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
Pull Request Test Coverage Report for Build 12559468654Details
💛 - Coveralls |
Hi @adrianchiris @ykulazhenkov when you have time please take a look on this PR. the failing lane is because of docker pull limits for flannel.... I am working on a w/a to switch the images |
// we remove the lease to speed up the restart process | ||
if len(pods.Items) == 1 { | ||
setupLog.Info("only one operator pod exist removing lease to speed up controller lock") | ||
err = kubeClient.DeleteAllOf(listContext, &coordinationv1.Lease{}, client.InNamespace(vars.Namespace)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
between this and the above list, there may be an additional pod spawned (i know chances are slim...)
i wonder if we really need to do something about it or just let 250 seconds pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need, we have many requests that restarting the node + create policy takes a huge amount of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also see that in the RDMA configuration functest for example on single node it failed before I apply this PR to the tests as the WaitForSriovToStable
or the create sriovNetwork was failing complaining the net-attach-def didn't exist, and that is because the operator was waiting..
@@ -267,6 +270,34 @@ func main() { | |||
} | |||
// +kubebuilder:scaffold:builder | |||
|
|||
// To speed up the leader election progress we check if the current running code is the only operator we remove the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chance to more leader election logic to a separate func ? :) main is getting long
Namespace: vars.Namespace, | ||
Raw: &metav1.ListOptions{ | ||
LabelSelector: "name=sriov-network-operator", | ||
ResourceVersion: "0"}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the resourceVersion: 0
LabelSelector: "name=sriov-network-operator", | ||
ResourceVersion: "0"}}) | ||
if err != nil { | ||
listContextCancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you could just defer listContextCancel()
that said, im fine either way. so up2u
To speed up the leader election progress we check if the currently running operator is the only operator pod. if that is the case we remove the lock CR. This is to handle cases where the node running the operator gets rebooted without a clean exist