-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improvement] Remove severity levels from controller logic #589
Conversation
… linodecluster_controller.go
if !reconciler.HasConditionSeverity(clusterScope.LinodeCluster, clusterv1.ReadyCondition, clusterv1.ConditionSeverityError) { | ||
if !reconciler.HasStaleCondition(clusterScope.LinodeCluster, | ||
clusterv1.ReadyCondition, | ||
reconciler.DefaultTimeout(r.ReconcileTimeout, reconciler.DefaultClusterControllerReconcileTimeout)) { | ||
logger.Info("re-queuing cluster/nb deletion") | ||
return ctrl.Result{RequeueAfter: reconciler.DefaultClusterControllerReconcileDelay}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what our approach should be here. Currently we are doing:
if err != nil {
hasStaleCondition() {
# try delete every 5 seconds
return {5 sec}, nil
}
# try delete with exponential backoff
return {...}, err
}
I wonder for delete, we should just have delete every 5 or 10 seconds and not have exponential backoff. With exponential backoffs, deletes could get stuck for 1000 secs (16+ mins) max if something goes wrong. Looking for thoughts from others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this yesterday. Do we know how many times does it have to fail to get stuck for 1000+ sec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I have seen in past for exponential backoff:
0.004881143569946289
0.022043943405151367
0.04145383834838867
0.08641910552978516
0.1618349552154541
0.3228580951690674
0.6422019004821777
1.2825469970703125
2.561889171600342
5.122460842132568
10.242242097854614
20.482059001922607
40.962461948394775
81.92205500602722
0.0189058780670166
163.84187817573547
655.36194896698
1000.0018429756165
1000.0022790431976
1000.0020320415497
1000.0018348693848
So mostly failures 16-17 times. I see in most of the places we are doing exponential first and then serial when its a stale condition, I wonder if its a fixed pattern we want to use here as well or we want to have fixed for delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be okay to do 10 or more seconds but I'm worried that a bad delete request could slow down the controllers from processing other requests Or could the controller handle that well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since controllers pick from queue, a bad request would mean that request will keep on getting requeued every 10 secs if its using fixed requeue logic or exponentially if its using exponential logic. Since default concurrency is 10, I don't think it will slow down things much as its just for delete and we don't have too many deletes all the time. With exponential delete, I just see that someone might see their resources not getting cleaned up on UI fast if its stuck in exponential backoff. But that also is a rare case. I am ok with current approach, just wanted some extra eyes and thoughts here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with doing 10 sec delayed requeue. Its a rare case but probably will happen :)
reconciler.DefaultTimeout(r.ReconcileTimeout, reconciler.DefaultClusterControllerReconcileTimeout)) { | ||
conditions.MarkFalse(machineScope.LinodeMachine, ConditionPreflightLinodeVPCReady, string(cerrs.CreateClusterError), "", "%s", err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this lead to lastTransitionTime
getting updated for the condition? If yes, then in next iteration HasStaleCondition()
will return false as it only looks at lastTransitionTime now. Previously, if it was already stale (by checking error severity), we were marking it as stale even if lastTransitionTime changed. Not sure here how it will behave.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch! Currently MarkFalse() will update the last transition time but I won't be using this func anymore in the follow up. I'll use Set() func which give more granular control over how to set conditions and I can just not set it then.
But is this something to look out for in a lot of different places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see it being used in couple of other places in the PR.
Looking at the current main branch's RecordDecayingCondition()
method, I do see it running conditions.MarkFalse()
everytime. So if previously it was finding the condition as stale, its possible conditions.MarkFalse()
wasn't changing the lastTransitionTime
if it matched with the existing condition.
If thats the case, then with the latest code in this PR, there will be no change in condition even if we do conditions.MarkFalse()
inside or outside of HasStaleCondition() block. In that case, both below examples look same to me:
if reconciler.HasStaleCondition(...) {
conditions.MarkFalse(X, Y, Z, "", "%s", err.Error())
return {...}, nil
}
conditions.MarkFalse(X, Y, Z, "", "%s", err.Error())
return {}, err
Is same as:
conditions.MarkFalse(X, Y, Z, "", "%s", err.Error())
if reconciler.HasStaleCondition(...) {
return {...}, nil
}
return {}, err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup I confirmed that lastTransitionTime
is not changed or update if the condition stays the same. MarkFalse() uses Set() func which has this comment over it:
// NOTE: If a condition already exists, the LastTransitionTime is updated only if a change is detected
// in any of the following fields: Status, Reason, Severity and Message.
Other than couple of comments I have to test and confirm, the PR looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
TODOs: