🐛 gracefully shutdown reconcilers and catalogd FBC server #1688

joelanford · 2025-01-31T21:23:17Z

Description

This PR adds a new context utility package with a new kind of context that is cancelled some duration after its parent context is cancelled. This is useful for reconcilers and servers where having some extra time to complete can decrease the chance of interruptions.

The particular motivation for this is a case we've seen in e2e tests where the operator-controller pod is cancelled while it is in the middle of an installation or upgrade. If this process is interrupted, it leaves the underlying installation in an intermediate state that requires human intervention to recover.

While this change does not specifically try to make it possible to progress from this intermediate state, it does make it less likely for the interruption to happen during this critical section.

It is an open question of the design philosophy of the project how an unfinished installation/upgrade should be treated. The current stance is "let a human decide what to do", but that is not a reasonable answer at scale, so I think we'll need to think about further improvements to help OLM know when a roll forward (or back) is safe.

Reviewer Checklist

API Go Documentation
Tests: Unit Tests (and E2E Tests, if appropriate)
Comprehensive Commit Messages
Links to related GitHub Issue(s)

netlify · 2025-01-31T21:23:32Z

✅ Deploy Preview for olmv1 ready!

Name	Link
🔨 Latest commit	`ad9f130`
🔍 Latest deploy log	https://app.netlify.com/sites/olmv1/deploys/679f5f4ffc7c670008fcd7b7
😎 Deploy Preview	https://deploy-preview-1688--olmv1.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

joelanford · 2025-01-31T21:33:38Z

I added the do-not-merge/work-in-progress label for two reasons because maintainers should discuss whether this is a change we want to make and if there is a deeper problem we still want to solve around recovering from interruptions.

codecov · 2025-01-31T21:36:14Z

Codecov Report

Attention: Patch coverage is 90.47619% with 8 lines in your changes missing coverage. Please review.

Project coverage is 68.21%. Comparing base (9b08aea) to head (ad9f130).
Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
...rnal/controllers/core/clustercatalog_controller.go	0.00%	6 Missing ⚠️
catalogd/internal/serverutil/serverutil.go	0.00%	1 Missing ⚠️
...nternal/controllers/clusterextension_controller.go	83.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1688      +/-   ##
==========================================
+ Coverage   67.74%   68.21%   +0.46%     
==========================================
  Files          57       58       +1     
  Lines        4620     4697      +77     
==========================================
+ Hits         3130     3204      +74     
- Misses       1265     1268       +3     
  Partials      225      225

Flag	Coverage Δ
e2e	`53.37% <54.54%> (-0.08%)`	⬇️
unit	`55.18% <89.28%> (+0.68%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Joe Lanford <[email protected]>

azych · 2025-02-03T10:51:10Z

I'm not sure about this as I worry that using those delayed contexts might make it more difficult to debug and understand things in the future with the additional waiting times.

Maybe alternatively we could try and solve this by adding (additional) cleanup logic when we get the parent context cancellation for the components that might need it and then placing single-place 'global' timeout (eg. in main.go) so that this logic can fire when we get the interrupt signal? WDYT?

joelanford · 2025-02-04T05:01:45Z

Yeah, things could get out of hand if these delayed contexts exist all over the place, I agree.

But I'm not sure cleanup logic is the right concept either. For one, we would generally want cleanup logic related to a particular CR object to run as part of reconcile for better recovery mechanics. But also, in the case of an interrupted helm release, there is no safe cleanup after the helm release has been interrupted unless there is some signal from the chart author, the user, or both.

So I'm trying to think through a mechanic where:

we stop pulling items off the controller queues
we allow any in-flight reconciles to continue for some period of time. it would be interesting if we had some metrics history for a histogram of reconcile duration. I know we have a metric for that, but I don't think we have any historical data.

Also, I'm not at all attached to this PR, and I can definitely see how it has downsides. Another downside is that it could make issues due to interruptions even harder to find and reproduce.

This PR is mainly a lever for getting a conversation going, so I'm interesting to see where this leads!

joelanford · 2025-02-04T05:05:43Z

Also of note, I just looked at our helm usage, and tracing through helm applier -> helm-operator-plugins -> helm, we are using the helm install/upgrade variants that use context.Background(), so I suspect that existing code is running up to terminationGracePeriodSeconds after the stop signal is received by the process.

Maybe we should switch over to Helm's RunWithContext methods in helm-operator-plugins. But that would risk making things worse if we don't have some other mitigation in place.

openshift-merge-robot · 2025-02-04T09:20:50Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

azych · 2025-02-04T19:00:45Z

I think it might be a good start to go through all of the reconciliation logic and try to identify places where an interruption could mean entering into this intermediate state. Once we have that we should better understand the scope of this and think about what can we do in each specific case to potentially mitigate/cleanup any sideffects to a working state or if there is anything we could actually do. Also, fully agree about taking into account any metrics we might have, or adding new ones to have a better understanding if need be.

Just thinking out loud, but one rough idea could be to keep track of how many reconciliations are in those critical places and communicate this outside of reconciliation loop and specific controller, so that upon receiving the interruption signal we could for example adjust the behavior/timeout on a more "global" level, which should be potentially easier to reason about and debug.

Update:
so to have something done right now, we could technically:

pass a separate context to the one we get from ctrl.SetupSignalHandler() to manager
catch an interruption via ctx := ctrl.SetupSignalHandler() like we currently do, but then allow a custom grace period timeout before we cancel the actual separate context we passed to manager

The main issue here is that the context we pass to manager at the start governs all of its operations including lifecycle of its controllers so I'm not sure what options would we have to handle stopping new reconciliation events in this scenario (or generally tbh) outside of this context. From what I've seen there are options like Unmanaged controllers, which are not automatically added to manager or more granular context separation, though that might lead to potentially similar issues with complexity and decreased debuggability.

joelanford requested a review from a team as a code owner January 31, 2025 21:23

joelanford changed the title ~~🐛 gracefully shutdown reconcilers and catalogd FBC~~ 🐛 gracefully shutdown reconcilers and catalogd FBC server Jan 31, 2025

joelanford added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2025

joelanford force-pushed the graceful-shutdown branch from 8c84738 to 0b8e7f5 Compare February 2, 2025 04:32

gracefully shutdown reconcilers and catalogd FBC

ad9f130

Signed-off-by: Joe Lanford <[email protected]>

joelanford force-pushed the graceful-shutdown branch from 0b8e7f5 to ad9f130 Compare February 2, 2025 12:04

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 gracefully shutdown reconcilers and catalogd FBC server #1688

🐛 gracefully shutdown reconcilers and catalogd FBC server #1688

joelanford commented Jan 31, 2025

netlify bot commented Jan 31, 2025 •

edited

Loading

joelanford commented Jan 31, 2025 •

edited

Loading

codecov bot commented Jan 31, 2025 •

edited

Loading

azych commented Feb 3, 2025 •

edited

Loading

joelanford commented Feb 4, 2025 •

edited

Loading

joelanford commented Feb 4, 2025

openshift-merge-robot commented Feb 4, 2025

azych commented Feb 4, 2025 •

edited

Loading

🐛 gracefully shutdown reconcilers and catalogd FBC server #1688

Are you sure you want to change the base?

🐛 gracefully shutdown reconcilers and catalogd FBC server #1688

Conversation

joelanford commented Jan 31, 2025

Description

Reviewer Checklist

netlify bot commented Jan 31, 2025 • edited Loading

✅ Deploy Preview for olmv1 ready!

joelanford commented Jan 31, 2025 • edited Loading

codecov bot commented Jan 31, 2025 • edited Loading

Codecov Report

azych commented Feb 3, 2025 • edited Loading

joelanford commented Feb 4, 2025 • edited Loading

joelanford commented Feb 4, 2025

openshift-merge-robot commented Feb 4, 2025

azych commented Feb 4, 2025 • edited Loading

netlify bot commented Jan 31, 2025 •

edited

Loading

joelanford commented Jan 31, 2025 •

edited

Loading

codecov bot commented Jan 31, 2025 •

edited

Loading

azych commented Feb 3, 2025 •

edited

Loading

joelanford commented Feb 4, 2025 •

edited

Loading

azych commented Feb 4, 2025 •

edited

Loading