-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Handle interrupted helm releases in applier #1776
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@joelanford WDYT? are there any additional considerations with this approach? |
@azych, my main concern is whether helm allows any objects from the release manifest to be applied while still in the pending phase. If the answer is "yes", I'm concerned that rolling back could mean that some workload from the Maybe the better option would be to:
|
@joelanford I agree that this might be a valid concern in non-installation scenarios, though I also wonder that if we'd ever end up in an interrupted state that would prevent doing a successful rollback, couldn't the same/similar state issue prevent us from actually progressing with a smooth upgrade when we remove release state info via secret(s) deletion instead? However, I followed your idea and so far just doing some basic testing it seems to be able to resolve the interruption regardless of the specific action that was interrupted (installation/upgrade). |
c073dfc
to
6d62085
Compare
@@ -220,3 +262,20 @@ func (p *postrenderer) Run(renderedManifests *bytes.Buffer) (*bytes.Buffer, erro | |||
} | |||
return &buf, nil | |||
} | |||
|
|||
func (h *Helm) deleteReleaseSecrets(ctx context.Context, releaseName string) error { | |||
return h.secretsClientGetter.Secrets(h.systemNamespace).DeleteCollection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes assumptions about the underlying storage driver that we should probably avoid. I'm not sure everything is plumbed through in a usable way from the ActionClientGetter
so we may need a change in helm-operator-plugins
, but what would be ideal would be to find a way to get access to the action.Configuration
created here: https://github.com/operator-framework/helm-operator-plugins/blob/3d11ac3f0107f9553dbb3c0060a54f9684c95b2d/pkg/client/actionclient.go#L167
And then use it's .Releases.Delete()
to delete the the underlying release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this would be preferable and less error-prone since even though we control the storage driver we initialize the ActionConfigGetter
with first, we'd have to track those changes in two places if they ever happen.
With getting access to action.Configuration
I think we should avoid exposing it via actionClient
as a whole, as it's currently private in both helm-operator-plugins
and helm
making it immutable and predictable in the context of ActionClientFor
and all the actual actions.
Because helm-operator-plugins
action client already handles Release
objects and has an Uninstall
, adding something like DeleteMetadata(releaseName)
that calls .Releases.Delete()
would make the most sense to me.
WDYT?
Sidenote:
Theoretically we also have access to ActionConfigGetter
which we initialize in main
and then use to initialize the actionClient
, but I don't it's a viable option here because it would also mean making assumptions that we do not use the Memory
driver for example, in which case our storages would be different each time we call ActionConfigFor
and that is the first call ActionClientFor
makes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like DeleteMetadata(releaseName) that calls .Releases.Delete()
I kinda feel like that particular interface method would put us on a slippery slope of stuffing more and more into the ActionClient interface. Maybe I'm wrong and this will be the last one, but we did also recently add History
method.
I'm kinda tempted to put a Config() (*action.Configuration)
escape hatch method in instead, which would definitely prevent the slippery slope. But that does somewhat have the downside that it returns a mutable action.Configuration
.
But! If you trace ActionClientGetter -> ActionConfigGetter -> ActionConfigFor
, you'll see that we build a new action.Configuration
struct during every reconcile anyway. So yes, we would return a mutable action.Configuration
, but we throw that one away when we are finished with reconcile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this still has that drawback of exposing a whole mutable config which comes from the fact that we'll be making it available via library method. While I agree it doesn't really matter for us, it might for other users of this lib.
Having said that I'll go with it. At this point every option has some drawback(s) and I don't think we are going to find a perfect solution and this probably comes as close to it as possible.
I think there is always the possibility of failing to move forward. The primary cases I can think of for that are:
|
true, but what I meant was that if an interruption leaves an in-flight action in such a broken state that we wouldn't be able to rollback because of it, there is a chance the same broken state would prevent moving forward when we use a different workaround to deal with post-interruption situation. I'd differentiate a 'broken state' caused by interruption and a 'failed' state that can definitely happen |
I see what you mean. My unsubstantiated hunch is that failures during upgrades will largely fall into two buckets
IMO, (1) isn't really a failure. It's an indefinite pause to the progression of the upgrade that the helm client doesn't deal with well. I'm primarily concerned with recovering gracefully where we are able to simply pick up where we left off, and what I'm hoping is that there's a way to convince helm to do that. My original proposal of "if the most recent release secret is pending, delete it" would rely on helm's adoption logic. What I would envision happening is that the next time through (after deleting the latest pending release), we would:
|
This exposes action.Configuration of a specific actionClient instance. General motivation behind it is for the users to have an 'escape hatch' access to the underlying dependency objects that action.Configuration holds, which include for example: - release records store - registry client - Kubernetes API client More specific motivation comes from a scenario where an access to the underlying release store instance is needed in order to gracefully handle pending helm releases which were previously interrupted. For additional context on this, see: operator-framework/operator-controller#1776 Signed-off-by: Artur Zych <[email protected]>
One question that came to mind for me after we wrapped our discussion today: Will we be able to distinguish between "actual failure" and "resumable interruption"? I think that distinction matters because we may want to handle these scenarios differently. Another thing that came to mind (going back to the assumption I was talking about in the discussion). If we see a pending upgrade, would it make sense to try to complete that particular upgrade to that particular bundle before we attempt to resolve a new bundle? Consider this scenario:
Since we partially rolled out What if the scenario changes where For example: suppose there are two datasets managed by the operator's manifests, where:
If v2 partially rolls out such that one dataset gets migrated, but the other doesn't, then we can't migrate to Therefore, it seems like the only way to do this safely is to complete the upgrade to Thoughts? |
Looking at what actually happens within So while resuming the interrupted release does not mean it will succeed (or succeeded already), it should mean that it did not fail up to the point of interruption, although theoretically the interruption could still happen just before the 'pending' state is replaced and recorded, in which case failure will be 'masked' by 'pending' state.
Theoretically and if we can (and want) to detect those new upgrade paths I could also see attempting to do a rollback as a valid option. That way, if there were upgrade path changes, we can do a fresh start taking those into consideration. If not and assuming a rollback will be successful and is properly set up (not sure how concerned we should be if it is) we simply retry.
My general thoughts about this go back to "available" workarounds for this issue that are present in the community (eg. reading those threads I linked to in the doc comment). What's repeatedly being mentioned there is:
At this point I don't really have a 'clear winner' here which I could say is the absolute best way to approach it with. All of those are really 'hacks' (maybe uninstall/rollback less so) that just work around the problem and the more we discuss it it seems we can come up with more and more scenarios where each of those workarounds might not play out perfectly and we might not be able to resolve the situation to a 'sane' state. My question would be do we really need to ensure it? Should we dig deeper and hope we find a bulletproof solution (which I don't think there is given that the main problem isn't really on our side)? Maybe with the effort so far, we should just consider picking one of those "established" ideas to provide a "best effort" workarounds on our side? BTW. There has been some activity around resolving this on the helm side (either in 3.x or 4), see: helm/helm#11863 and helm/community#354 |
This exposes action.Configuration of a specific actionClient instance. General motivation behind it is for the users to have an 'escape hatch' access to the underlying dependency objects that action.Configuration holds, which include for example: - release records store - registry client - Kubernetes API client More specific motivation comes from a scenario where an access to the underlying release store instance is needed in order to gracefully handle pending helm releases which were previously interrupted. For additional context on this, see: operator-framework/operator-controller#1776 Signed-off-by: Artur Zych <[email protected]>
This exposes action.Configuration of a specific actionClient instance. General motivation behind it is for the users to have an 'escape hatch' access to the underlying dependency objects that action.Configuration holds, which include for example: - release records store - registry client - Kubernetes API client More specific motivation comes from a scenario where an access to the underlying release store instance is needed in order to gracefully handle pending helm releases which were previously interrupted. For additional context on this, see: operator-framework/operator-controller#1776 Signed-off-by: Artur Zych <[email protected]> Co-authored-by: Artur Zych <[email protected]>
Introduces a workaround for 'interrupted' helm releases which enter into a 'pending' (-install/uninstall/rollback) state. If that happens, for example because of immediate application exit with one of those operations being in flight, helm is not able to resolve it automatically which means we end up in a permanent reconcile error state. One of the workarounds for this that has been repeated in the community is to remove metadata of the pending release, which is the solution chosen here. For full context see: operator-framework#1776 helm/helm#5595
6d62085
to
7a31a06
Compare
Introduces a workaround for 'interrupted' helm releases which enter into a 'pending' (-install/uninstall/rollback) state. If that happens, for example because of immediate application exit with one of those operations being in flight, helm is not able to resolve it automatically which means we end up in a permanent reconcile error state. One of the workarounds for this that has been repeated in the community is to remove metadata of the pending release, which is the solution chosen here. For full context see: operator-framework#1776 helm/helm#5595
7a31a06
to
44b5e31
Compare
Introduces a workaround for 'interrupted' helm releases which enter into a 'pending' (-install/uninstall/rollback) state. If that happens, for example because of immediate application exit with one of those operations being in flight, helm is not able to resolve it automatically which means we end up in a permanent reconcile error state. One of the workarounds for this that has been repeated in the community is to remove metadata of the pending release, which is the solution chosen here. For full context see: operator-framework#1776 helm/helm#5595
44b5e31
to
75be8c4
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1776 +/- ##
==========================================
+ Coverage 68.34% 68.37% +0.03%
==========================================
Files 63 63
Lines 5117 5145 +28
==========================================
+ Hits 3497 3518 +21
- Misses 1390 1395 +5
- Partials 230 232 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Description
Introduces a workaround for 'interrupted' helm releases which enter into a 'pending' (-install/uninstall/rollback) state.
If that happens, for example because of immediate application exit with one of those operations being in flight, helm is not able to resolve it automatically which means we end up in a permanent reconcile error state.
One of the workarounds for this that has been repeated in the community is to remove metadata of the pending release,
which is the solution chosen here.
For full context the discussion in this PR and:
helm/helm#5595
Reviewer Checklist