Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improvement] Remove severity levels from controller logic #589

Merged
merged 10 commits into from
Dec 13, 2024

Conversation

komer3
Copy link
Contributor

@komer3 komer3 commented Dec 9, 2024

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

@komer3 komer3 changed the title Remove severity levels from controller logic [improvement] Remove severity levels from controller logic Dec 9, 2024
if !reconciler.HasConditionSeverity(clusterScope.LinodeCluster, clusterv1.ReadyCondition, clusterv1.ConditionSeverityError) {
if !reconciler.HasStaleCondition(clusterScope.LinodeCluster,
clusterv1.ReadyCondition,
reconciler.DefaultTimeout(r.ReconcileTimeout, reconciler.DefaultClusterControllerReconcileTimeout)) {
logger.Info("re-queuing cluster/nb deletion")
return ctrl.Result{RequeueAfter: reconciler.DefaultClusterControllerReconcileDelay}, nil
Copy link
Collaborator

@rahulait rahulait Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what our approach should be here. Currently we are doing:

if err != nil {
     hasStaleCondition() {
         # try delete every 5 seconds
         return {5 sec}, nil
     }
     # try delete with exponential backoff
     return {...}, err
}

I wonder for delete, we should just have delete every 5 or 10 seconds and not have exponential backoff. With exponential backoffs, deletes could get stuck for 1000 secs (16+ mins) max if something goes wrong. Looking for thoughts from others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this yesterday. Do we know how many times does it have to fail to get stuck for 1000+ sec?

Copy link
Collaborator

@rahulait rahulait Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I have seen in past for exponential backoff:

   0.004881143569946289
   0.022043943405151367
   0.04145383834838867
   0.08641910552978516
   0.1618349552154541
   0.3228580951690674
   0.6422019004821777
   1.2825469970703125
   2.561889171600342
   5.122460842132568
   10.242242097854614
   20.482059001922607
   40.962461948394775
   81.92205500602722
   0.0189058780670166
   163.84187817573547
   655.36194896698
   1000.0018429756165
   1000.0022790431976
   1000.0020320415497
   1000.0018348693848

So mostly failures 16-17 times. I see in most of the places we are doing exponential first and then serial when its a stale condition, I wonder if its a fixed pattern we want to use here as well or we want to have fixed for delete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be okay to do 10 or more seconds but I'm worried that a bad delete request could slow down the controllers from processing other requests Or could the controller handle that well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since controllers pick from queue, a bad request would mean that request will keep on getting requeued every 10 secs if its using fixed requeue logic or exponentially if its using exponential logic. Since default concurrency is 10, I don't think it will slow down things much as its just for delete and we don't have too many deletes all the time. With exponential delete, I just see that someone might see their resources not getting cleaned up on UI fast if its stuck in exponential backoff. But that also is a rare case. I am ok with current approach, just wanted some extra eyes and thoughts here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with doing 10 sec delayed requeue. Its a rare case but probably will happen :)

reconciler.DefaultTimeout(r.ReconcileTimeout, reconciler.DefaultClusterControllerReconcileTimeout)) {
conditions.MarkFalse(machineScope.LinodeMachine, ConditionPreflightLinodeVPCReady, string(cerrs.CreateClusterError), "", "%s", err.Error())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this lead to lastTransitionTime getting updated for the condition? If yes, then in next iteration HasStaleCondition() will return false as it only looks at lastTransitionTime now. Previously, if it was already stale (by checking error severity), we were marking it as stale even if lastTransitionTime changed. Not sure here how it will behave.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch! Currently MarkFalse() will update the last transition time but I won't be using this func anymore in the follow up. I'll use Set() func which give more granular control over how to set conditions and I can just not set it then.

But is this something to look out for in a lot of different places?

Copy link
Collaborator

@rahulait rahulait Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see it being used in couple of other places in the PR.

Looking at the current main branch's RecordDecayingCondition() method, I do see it running conditions.MarkFalse() everytime. So if previously it was finding the condition as stale, its possible conditions.MarkFalse() wasn't changing the lastTransitionTime if it matched with the existing condition.

If thats the case, then with the latest code in this PR, there will be no change in condition even if we do conditions.MarkFalse() inside or outside of HasStaleCondition() block. In that case, both below examples look same to me:

if reconciler.HasStaleCondition(...) {
    conditions.MarkFalse(X, Y, Z, "", "%s", err.Error())
    return {...}, nil
}
conditions.MarkFalse(X, Y, Z, "", "%s", err.Error())
return {}, err

Is same as:

conditions.MarkFalse(X, Y, Z, "", "%s", err.Error())
if reconciler.HasStaleCondition(...) {
    return {...}, nil
}
return {}, err

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I confirmed that lastTransitionTime is not changed or update if the condition stays the same. MarkFalse() uses Set() func which has this comment over it:

// NOTE: If a condition already exists, the LastTransitionTime is updated only if a change is detected
// in any of the following fields: Status, Reason, Severity and Message.

@rahulait
Copy link
Collaborator

Other than couple of comments I have to test and confirm, the PR looks good to me.

Copy link
Collaborator

@rahulait rahulait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@komer3 komer3 merged commit a431a05 into main Dec 13, 2024
10 checks passed
@komer3 komer3 deleted the remove-severity-levels branch December 13, 2024 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants