Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARO Operator - Reduce unnecessary reconciles by limiting watched resources/changes #3292

Merged
merged 2 commits into from
Jul 9, 2024

Conversation

tsatam
Copy link
Collaborator

@tsatam tsatam commented Nov 27, 2023

Which issue this PR addresses:

Another proposed fix for ARO-4632
May help with ARO-5216 as well

What this PR does / why we need it:

Various changes to how the ARO operator watches resources:

  • Extract all common predicates used by controllers to shared package
  • Only watch for generation change in ARO cluster resource on most controllers (avoid Reconciles when just status changes)

Test plan for issue:

  • Unit tests continue to pass
  • E2E tests continue to pass
  • Operator functionality not covered by existing tests but likely to be impacted by the change have new tests (TODO)

Note - I am expecting some E2E tests to become more flaky because of this change, as we have many E2E tests that are dependent on the operator reconciling more often than is necessary, and expecting an operator controller to perform changes without actually modifying any of the resources that controller is actually watching. I am planning to fix up any observed occurrences of such test flakes with individual commits on this branch for each test/controller, and each commit will likely contain changes to both the tests as well as the watched resources for that controller.

Is there any documentation that needs to be updated for this PR?

N/A

Additional ideas not implemented in this PR (yet?)

  • For the majority of controllers - only watch for changes on a specific set of cluster flags on the ARO cluster resource, rather than the whole spec
  • Similar to above, implement more targeted Watches for other watched resources that only reconcile when the specific fields actually used by the controller change

@tsatam tsatam changed the title Aro 4632/operator predicates ARO Operator - Reduce unnecessary reconciles by limiting watched resources/changes Nov 27, 2023
@tsatam tsatam force-pushed the ARO-4632/operator-predicates branch from a015b39 to 6ee7bae Compare November 27, 2023 14:31
@github-actions github-actions bot added the needs-rebase branch needs a rebase label Jan 4, 2024
Copy link

github-actions bot commented Jan 4, 2024

Please rebase pull request.

@hawkowl
Copy link
Collaborator

hawkowl commented Feb 5, 2024

/azp run e2e

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tsatam tsatam force-pushed the ARO-4632/operator-predicates branch from 92dc9fc to 08bf84c Compare February 9, 2024 16:13
@tsatam tsatam added the hold Hold label Feb 9, 2024
@tsatam
Copy link
Collaborator Author

tsatam commented Feb 9, 2024

I'm validating any flake impact to e2e on this, and will include any required fixups on this PR. Adding hold until that is complete.

Copy link
Collaborator

@mociarain mociarain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. 2 thoughts that don't block this at all

  • A quick grep shows me that we have 29 instances of NewControllerManagedBy in the repo and this PR will make more than half use the GenerationChangedPredicate. I don't have a great sense of how big a problem all these extra calls were/are/will be but did we ever consider wrapping the controllerBuilder and changing the default from "load" to "quiet"?
  • I don't usually like this pattern but what would you say about wrapping GenerationChangedPredicate in something that lives with the other predicates for discoverability? Since the default with the controllers is to shout and we need to add this force them to only fire on "real" changes making that option very clear seems like it might be worth it. What do you think?

// Any changes will trigger reconcile, but only for that config.
return b.
Named(ControllerName).
// Controller adds ControllerManagedBy to KubeletConfit created by this controller.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Controller adds ControllerManagedBy to KubeletConfit created by this controller.
// Controller adds ControllerManagedBy to KubeletConfig created by this controller.

I assume

Comment on lines +117 to +118
Watches(&source.Kind{Type: &machinev1beta1.Machine{}}, &handler.EnqueueRequestForObject{}, builder.WithPredicates(predicates.MachineRoleMaster)). // to reconcile on master machine replacement
Watches(&source.Kind{Type: &machinev1beta1.MachineSet{}}, &handler.EnqueueRequestForObject{}, builder.WithPredicates(predicates.MachineRoleWorker)). // to reconcile on worker machines
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Borderline Rant - Not Blocking Anything 🙃 ]

These 2 lines are very long. I know Go doesn't have a max line length but we live in a physical world with fixed screens. A quick grep '.\{120\}' -r ./pkg/*/*.go | wc -l tells me we have > 1000 lines longer than 120.

  • Have we ever considered adding a max line length?
  • Has this ever bothered anyone else?
  • Do I need to stop worrying and learn to love some line wrapping plugin???

@tsatam
Copy link
Collaborator Author

tsatam commented Feb 13, 2024

  • A quick grep shows me that we have 29 instances of NewControllerManagedBy in the repo and this PR will make more than half use the GenerationChangedPredicate. I don't have a great sense of how big a problem all these extra calls were/are/will be but did we ever consider wrapping the controllerBuilder and changing the default from "load" to "quiet"?

I'm not sure if it's possible or even desirable to make this change a "default", since GenerationChangedPredicate here is being applied specifically to the ARO cluster resource, not all resources. I also personally prefer to keep the composition of watch rules within each controller, so you can refer to them at-a-glance when making changes to them, rather than having to refer to multiple layers of abstraction to put together the rules. I think your second point is probably more preferable for reusability and deduplication here.

  • I don't usually like this pattern but what would you say about wrapping GenerationChangedPredicate in something that lives with the other predicates for discoverability? Since the default with the controllers is to shout and we need to add this force them to only fire on "real" changes making that option very clear seems like it might be worth it. What do you think?

Along the same lines as my personal preference above, I think just extracting out the GenerationChangedPredicate wouldn't help discoverability here. I do have some ideas for extracting out this specific predicate, but I need to determine a path forward and choose one:

  • Extract predicate.And(predicates.AROCluster, predicate.GenerationChangedPredicate{})) into a shared/reusable token (simplest option)
  • Make more opinionated functions that return predicates (names can be workshopped, they're intentionally verbose for now):
    • ChangesInAROFeatureFlags(flags []string)
    • ChangesInSpecificAROSpecProperty() (for the few controllers that do act on specific fields in the CRD)
    • ChangesInAROStatus() (only used by ClusterOperatorARO controller to propagate controller statuses to operator status)

@mociarain
Copy link
Collaborator

I'm not sure if it's possible or even desirable to make this change a "default", since GenerationChangedPredicate here is being applied specifically to the ARO cluster resource, not all resources.

Sorry I wasn't clear. I was suggesting something like a NewAROControllerManagedBy that just adds GenerationChangedPredicate but I take your point all the same.

Along the same lines as my personal preference above, I think just extracting out the GenerationChangedPredicate wouldn't help discoverability here.

Yeah I buy this. I thought after this PR we just used predicates and predicate.GenerationChangedPredicate and could collapse them to one place but predicate is all over.

Thanks for explaining that and FWIW your suggestions make sense to me.

@mociarain
Copy link
Collaborator

This changes look great. What's stopping us merging it, since the E2E tests passed? Would we like to run the E2E pipeline against it to get a larger sample size? If so we have a new joiner and this would be a nice task for them to see some parts of the system.

What do you think @tsatam?

@tsatam
Copy link
Collaborator Author

tsatam commented Jul 9, 2024

This changes look great. What's stopping us merging it, since the E2E tests passed? Would we like to run the E2E pipeline against it to get a larger sample size? If so we have a new joiner and this would be a nice task for them to see some parts of the system.

What do you think @tsatam?

I think this change is mergeable, there are some future follow-up ideas in the description of the PR but they can be done later. This work was ultimately de-prioritized due to being fairly low on the impact/effort ratio, we ended up remediating the issue this PR seeked to reduce through more direct changes, but it is still a net positive IMO.

@mociarain mociarain merged commit ff69ffe into Azure:master Jul 9, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants