-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 gracefully shutdown reconcilers and catalogd FBC server #1688
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
I added the |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1688 +/- ##
==========================================
+ Coverage 67.74% 68.21% +0.46%
==========================================
Files 57 58 +1
Lines 4620 4697 +77
==========================================
+ Hits 3130 3204 +74
- Misses 1265 1268 +3
Partials 225 225
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
8c84738
to
0b8e7f5
Compare
Signed-off-by: Joe Lanford <[email protected]>
0b8e7f5
to
ad9f130
Compare
I'm not sure about this as I worry that using those delayed contexts might make it more difficult to debug and understand things in the future with the additional waiting times. Maybe alternatively we could try and solve this by adding (additional) cleanup logic when we get the parent context cancellation for the components that might need it and then placing single-place 'global' timeout (eg. in main.go) so that this logic can fire when we get the interrupt signal? WDYT? |
Yeah, things could get out of hand if these delayed contexts exist all over the place, I agree. But I'm not sure cleanup logic is the right concept either. For one, we would generally want cleanup logic related to a particular CR object to run as part of reconcile for better recovery mechanics. But also, in the case of an interrupted helm release, there is no safe cleanup after the helm release has been interrupted unless there is some signal from the chart author, the user, or both. So I'm trying to think through a mechanic where:
Also, I'm not at all attached to this PR, and I can definitely see how it has downsides. Another downside is that it could make issues due to interruptions even harder to find and reproduce. This PR is mainly a lever for getting a conversation going, so I'm interesting to see where this leads! |
Also of note, I just looked at our helm usage, and tracing through helm applier -> helm-operator-plugins -> helm, we are using the helm install/upgrade variants that use Maybe we should switch over to Helm's |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
I think it might be a good start to go through all of the reconciliation logic and try to identify places where an interruption could mean entering into this intermediate state. Once we have that we should better understand the scope of this and think about what can we do in each specific case to potentially mitigate/cleanup any sideffects to a working state or if there is anything we could actually do. Also, fully agree about taking into account any metrics we might have, or adding new ones to have a better understanding if need be. Just thinking out loud, but one rough idea could be to keep track of how many reconciliations are in those critical places and communicate this outside of reconciliation loop and specific controller, so that upon receiving the interruption signal we could for example adjust the behavior/timeout on a more "global" level, which should be potentially easier to reason about and debug. Update:
The main issue here is that the context we pass to manager at the start governs all of its operations including lifecycle of its controllers so I'm not sure what options would we have to handle stopping new reconciliation events in this scenario (or generally tbh) outside of this context. From what I've seen there are options like |
Description
This PR adds a new
context
utility package with a new kind of context that is cancelled some duration after its parent context is cancelled. This is useful for reconcilers and servers where having some extra time to complete can decrease the chance of interruptions.The particular motivation for this is a case we've seen in e2e tests where the operator-controller pod is cancelled while it is in the middle of an installation or upgrade. If this process is interrupted, it leaves the underlying installation in an intermediate state that requires human intervention to recover.
While this change does not specifically try to make it possible to progress from this intermediate state, it does make it less likely for the interruption to happen during this critical section.
It is an open question of the design philosophy of the project how an unfinished installation/upgrade should be treated. The current stance is "let a human decide what to do", but that is not a reasonable answer at scale, so I think we'll need to think about further improvements to help OLM know when a roll forward (or back) is safe.
Reviewer Checklist