🐛 Handle interrupted helm releases in applier #1776

azych · 2025-02-14T12:55:12Z

Description

Introduces a workaround for 'interrupted' helm releases which enter into a 'pending' (-install/uninstall/rollback) state.
If that happens, for example because of immediate application exit with one of those operations being in flight, helm is not able to resolve it automatically which means we end up in a permanent reconcile error state.

One of the workarounds for this that has been repeated in the community is to remove metadata of the pending release,
which is the solution chosen here.

For full context the discussion in this PR and:
helm/helm#5595

Reviewer Checklist

API Go Documentation
Tests: Unit Tests (and E2E Tests, if appropriate)
Comprehensive Commit Messages
Links to related GitHub Issue(s)

netlify · 2025-02-14T12:55:29Z

✅ Deploy Preview for olmv1 ready!

Name	Link
🔨 Latest commit	`75be8c4`
🔍 Latest deploy log	https://app.netlify.com/sites/olmv1/deploys/67bc702c91c3a90008024542
😎 Deploy Preview	https://deploy-preview-1776--olmv1.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

azych · 2025-02-14T12:57:08Z

@joelanford WDYT? are there any additional considerations with this approach?

joelanford · 2025-02-14T20:08:57Z

@azych, my main concern is whether helm allows any objects from the release manifest to be applied while still in the pending phase. If the answer is "yes", I'm concerned that rolling back could mean that some workload from the to version may have run some migration that makes it impossible to successfully go back to the from version.

Maybe the better option would be to:

Leave the existing half-rolled-out release manifests objects that have already been applied in place on the cluster.
Directly delete the release secret(s) (not helm uninstall/rollback) that says it is pending. (We have a custom helm release storage driver that shards a release across multiple secrets if necessary to overcome etcd size limits).
Try again, the idea being that helm is able to essentially pick up where it left off without rolling back to the previous revision.

azych · 2025-02-17T11:12:28Z

@joelanford I agree that this might be a valid concern in non-installation scenarios, though I also wonder that if we'd ever end up in an interrupted state that would prevent doing a successful rollback, couldn't the same/similar state issue prevent us from actually progressing with a smooth upgrade when we remove release state info via secret(s) deletion instead?
I honestly don't feel knowledgeable enough about helm internals to be able to tell one way or the other at this point.

However, I followed your idea and so far just doing some basic testing it seems to be able to resolve the interruption regardless of the specific action that was interrupted (installation/upgrade).

joelanford · 2025-02-17T16:14:59Z

internal/operator-controller/applier/helm.go

@@ -220,3 +262,20 @@ func (p *postrenderer) Run(renderedManifests *bytes.Buffer) (*bytes.Buffer, erro
 	}
 	return &buf, nil
 }
+
+func (h *Helm) deleteReleaseSecrets(ctx context.Context, releaseName string) error {
+	return h.secretsClientGetter.Secrets(h.systemNamespace).DeleteCollection(


This makes assumptions about the underlying storage driver that we should probably avoid. I'm not sure everything is plumbed through in a usable way from the ActionClientGetter so we may need a change in helm-operator-plugins, but what would be ideal would be to find a way to get access to the action.Configuration created here: https://github.com/operator-framework/helm-operator-plugins/blob/3d11ac3f0107f9553dbb3c0060a54f9684c95b2d/pkg/client/actionclient.go#L167

And then use it's .Releases.Delete() to delete the the underlying release

I agree that this would be preferable and less error-prone since even though we control the storage driver we initialize the ActionConfigGetter with first, we'd have to track those changes in two places if they ever happen.

With getting access to action.Configuration I think we should avoid exposing it via actionClient as a whole, as it's currently private in both helm-operator-plugins and helm making it immutable and predictable in the context of ActionClientFor and all the actual actions.
Because helm-operator-plugins action client already handles Release objects and has an Uninstall, adding something like DeleteMetadata(releaseName) that calls .Releases.Delete() would make the most sense to me.
WDYT?

Sidenote:
Theoretically we also have access to ActionConfigGetter which we initialize in main and then use to initialize the actionClient, but I don't it's a viable option here because it would also mean making assumptions that we do not use the Memory driver for example, in which case our storages would be different each time we call ActionConfigFor and that is the first call ActionClientFor makes.

something like DeleteMetadata(releaseName) that calls .Releases.Delete()

I kinda feel like that particular interface method would put us on a slippery slope of stuffing more and more into the ActionClient interface. Maybe I'm wrong and this will be the last one, but we did also recently add History method.

I'm kinda tempted to put a Config() (*action.Configuration) escape hatch method in instead, which would definitely prevent the slippery slope. But that does somewhat have the downside that it returns a mutable action.Configuration.

But! If you trace ActionClientGetter -> ActionConfigGetter -> ActionConfigFor, you'll see that we build a new action.Configuration struct during every reconcile anyway. So yes, we would return a mutable action.Configuration, but we throw that one away when we are finished with reconcile.

To me this still has that drawback of exposing a whole mutable config which comes from the fact that we'll be making it available via library method. While I agree it doesn't really matter for us, it might for other users of this lib.

Having said that I'll go with it. At this point every option has some drawback(s) and I don't think we are going to find a perfect solution and this probably comes as close to it as possible.

joelanford · 2025-02-17T16:40:12Z

Couldn't the same/similar state issue prevent us from actually progressing with a smooth upgrade when we remove release state info via secret(s) deletion instead?

I think there is always the possibility of failing to move forward. The primary cases I can think of for that are:

lack of permissions (something for the admin to fix)
admission errors (generally not something OLMv1 can fix, but perhaps our current lack of using helm upgrade --force is an area that our behavior might affect ability to move forward)
(not a concern yet, since we don't implement health checking yet, but...) issues where we successfully apply changes, but then the resulting resources don't actually "work".

azych · 2025-02-18T09:02:49Z

Couldn't the same/similar state issue prevent us from actually progressing with a smooth upgrade when we remove release state info via secret(s) deletion instead?

I think there is always the possibility of failing to move forward. The primary cases I can think of for that are:

true, but what I meant was that if an interruption leaves an in-flight action in such a broken state that we wouldn't be able to rollback because of it, there is a chance the same broken state would prevent moving forward when we use a different workaround to deal with post-interruption situation. I'd differentiate a 'broken state' caused by interruption and a 'failed' state that can definitely happen

joelanford · 2025-02-18T22:13:39Z

true, but what I meant was that if an interruption leaves an in-flight action in such a broken state that we wouldn't be able to rollback because of it

I see what you mean. My unsubstantiated hunch is that failures during upgrades will largely fall into two buckets

Failure because everything was going smoothly and the process was interrupted (e.g. context was cancelled, connection with API server was lost, etc.).
Failure because of anything else

IMO, (1) isn't really a failure. It's an indefinite pause to the progression of the upgrade that the helm client doesn't deal with well. I'm primarily concerned with recovering gracefully where we are able to simply pick up where we left off, and what I'm hoping is that there's a way to convince helm to do that.

My original proposal of "if the most recent release secret is pending, delete it" would rely on helm's adoption logic. What I would envision happening is that the next time through (after deleting the latest pending release), we would:

See the "installed" release
Do a dry-run upgrade, and detect a diff because the previous upgrade attempt did not complete its updates/patches
Start a real upgrade
- Any objects with the same name/namespace in existing/upgrade releases that we applied successfully during the first attempt would likely see a no-op update/patch
- Any objects that are new in the upgrade release that we applied successfully during the first attempt would have the correct labels that would allow helm to "adopt" the object into the release
- Any objects that we didn't get to before the interruption would be created/updated as usual.

This exposes action.Configuration of a specific actionClient instance. General motivation behind it is for the users to have an 'escape hatch' access to the underlying dependency objects that action.Configuration holds, which include for example: - release records store - registry client - Kubernetes API client More specific motivation comes from a scenario where an access to the underlying release store instance is needed in order to gracefully handle pending helm releases which were previously interrupted. For additional context on this, see: operator-framework/operator-controller#1776 Signed-off-by: Artur Zych <[email protected]>

joelanford · 2025-02-19T18:19:26Z

One question that came to mind for me after we wrapped our discussion today: Will we be able to distinguish between "actual failure" and "resumable interruption"? I think that distinction matters because we may want to handle these scenarios differently.

Another thing that came to mind (going back to the assumption I was talking about in the discussion). If we see a pending upgrade, would it make sense to try to complete that particular upgrade to that particular bundle before we attempt to resolve a new bundle?

Consider this scenario:

v1 is "deployed"
v2 upgrade started, but didn't finish
v3 is now available in the catalog, there is an upgrade edge from v1, but not v2

Since we partially rolled out v2 (or maybe we fully rolled it out, but our network connection failed right as we were about to set the release secret to "deployed"), it seems like we should not proceed to v3 because we have to assume that whatever it is that makes v2 not upgradeable to v3 was included in the partial rollout.

What if the scenario changes where v3 upgrades from both v1 and v2? Would it be safe then to "resume" the upgrade but instead go to v3? I feel like there are scenarios there where the answer is still no.

For example: suppose there are two datasets managed by the operator's manifests, where:

v1 has two workloads that individually migrate their respective datasets to level 1
v2 has two workloads that individually migrate their respective datasets to level 2
v3 expects the datasets to be at the same migration level, either both datasets at level 1 or both datasets at level 2, but not a mix.

If v2 partially rolls out such that one dataset gets migrated, but the other doesn't, then we can't migrate to v3. It seems reasonable that a v2 -> v3 upgrade edge has the implicit pre-condition that v2 is healthy (by our best possible definition of healthy).

Therefore, it seems like the only way to do this safely is to complete the upgrade to v2 before we try to do anything else. In theory, the pending release secret has all of the bookkeeping necessary to actually try to resume v2, even if the v2 bundle is no longer available.

Thoughts?

azych · 2025-02-20T10:15:28Z

One question that came to mind for me after we wrapped our discussion today: Will we be able to distinguish between "actual failure" and "resumable interruption"? I think that distinction matters because we may want to handle these scenarios differently.

Looking at what actually happens within helm/pkg/action/upgrade.go:Upgrade(), once the in-flight release is created in storage with a 'pending' state (end of performUpgrade()) and the actual upgrade logic starts, in case there was an error at any step, that 'pending' state will be replaced by failRelease() and error will also be returned to the caller.

So while resuming the interrupted release does not mean it will succeed (or succeeded already), it should mean that it did not fail up to the point of interruption, although theoretically the interruption could still happen just before the 'pending' state is replaced and recorded, in which case failure will be 'masked' by 'pending' state.

Another thing that came to mind (going back to the assumption I was talking about in the discussion). If we see a pending upgrade, would it make sense to try to complete that particular upgrade to that particular bundle before we attempt to resolve a new bundle?
...
Therefore, it seems like the only way to do this safely is to complete the upgrade to v2 before we try to do anything else. In theory, the pending release secret has all of the bookkeeping necessary to actually try to resume v2, even if the v2 bundle is no longer available.

Theoretically and if we can (and want) to detect those new upgrade paths I could also see attempting to do a rollback as a valid option. That way, if there were upgrade path changes, we can do a fresh start taking those into consideration. If not and assuming a rollback will be successful and is properly set up (not sure how concerned we should be if it is) we simply retry.

Thoughts?

My general thoughts about this go back to "available" workarounds for this issue that are present in the community (eg. reading those threads I linked to in the doc comment). What's repeatedly being mentioned there is:

uninstall/rollback (my initial approach)
pending release secret deletion
With some people also bringing up manually editing 'pending' status to 'failure' or 'success' (if we can detect it)

At this point I don't really have a 'clear winner' here which I could say is the absolute best way to approach it with. All of those are really 'hacks' (maybe uninstall/rollback less so) that just work around the problem and the more we discuss it it seems we can come up with more and more scenarios where each of those workarounds might not play out perfectly and we might not be able to resolve the situation to a 'sane' state.

My question would be do we really need to ensure it? Should we dig deeper and hope we find a bulletproof solution (which I don't think there is given that the main problem isn't really on our side)? Maybe with the effort so far, we should just consider picking one of those "established" ideas to provide a "best effort" workarounds on our side?
To be fair, I thought we kinda did that already with secret deletion #1776 (comment)

BTW. There has been some activity around resolving this on the helm side (either in 3.x or 4), see: helm/helm#11863 and helm/community#354

This exposes action.Configuration of a specific actionClient instance. General motivation behind it is for the users to have an 'escape hatch' access to the underlying dependency objects that action.Configuration holds, which include for example: - release records store - registry client - Kubernetes API client More specific motivation comes from a scenario where an access to the underlying release store instance is needed in order to gracefully handle pending helm releases which were previously interrupted. For additional context on this, see: operator-framework/operator-controller#1776 Signed-off-by: Artur Zych <[email protected]>

This exposes action.Configuration of a specific actionClient instance. General motivation behind it is for the users to have an 'escape hatch' access to the underlying dependency objects that action.Configuration holds, which include for example: - release records store - registry client - Kubernetes API client More specific motivation comes from a scenario where an access to the underlying release store instance is needed in order to gracefully handle pending helm releases which were previously interrupted. For additional context on this, see: operator-framework/operator-controller#1776 Signed-off-by: Artur Zych <[email protected]> Co-authored-by: Artur Zych <[email protected]>

Introduces a workaround for 'interrupted' helm releases which enter into a 'pending' (-install/uninstall/rollback) state. If that happens, for example because of immediate application exit with one of those operations being in flight, helm is not able to resolve it automatically which means we end up in a permanent reconcile error state. One of the workarounds for this that has been repeated in the community is to remove metadata of the pending release, which is the solution chosen here. For full context see: operator-framework#1776 helm/helm#5595

codecov · 2025-02-24T13:26:39Z

Codecov Report

Attention: Patch coverage is 83.33333% with 6 lines in your changes missing coverage. Please review.

Project coverage is 68.37%. Comparing base (9b888eb) to head (75be8c4).

Files with missing lines	Patch %	Lines
cmd/operator-controller/main.go	25.00%	2 Missing and 1 partial ⚠️
internal/operator-controller/applier/helm.go	90.62%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1776      +/-   ##
==========================================
+ Coverage   68.34%   68.37%   +0.03%     
==========================================
  Files          63       63              
  Lines        5117     5145      +28     
==========================================
+ Hits         3497     3518      +21     
- Misses       1390     1395       +5     
- Partials      230      232       +2

Flag	Coverage Δ
e2e	`51.74% <69.44%> (+0.03%)`	⬆️
unit	`56.09% <80.55%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 14, 2025

azych force-pushed the handle-helm-interrupt branch 3 times, most recently from c073dfc to 6d62085 Compare February 17, 2025 14:48

joelanford reviewed Feb 17, 2025

View reviewed changes

joelanford mentioned this pull request Feb 17, 2025

Auto-rollback of failed upgrade fails with some resources operator-framework/helm-operator-plugins#421

Open

azych mentioned this pull request Feb 19, 2025

pkg/client: add Config method to ActionClient operator-framework/helm-operator-plugins#424

Merged

azych force-pushed the handle-helm-interrupt branch from 6d62085 to 7a31a06 Compare February 24, 2025 12:48

azych changed the title ~~[Draft] Handle interrupted helm releases in applier~~ 🐛 Handle interrupted helm releases in applier Feb 24, 2025

azych marked this pull request as ready for review February 24, 2025 12:50

azych requested a review from a team as a code owner February 24, 2025 12:50

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 24, 2025

azych force-pushed the handle-helm-interrupt branch from 7a31a06 to 44b5e31 Compare February 24, 2025 12:56

azych force-pushed the handle-helm-interrupt branch from 44b5e31 to 75be8c4 Compare February 24, 2025 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Handle interrupted helm releases in applier #1776

🐛 Handle interrupted helm releases in applier #1776

azych commented Feb 14, 2025 •

edited

Loading

netlify bot commented Feb 14, 2025 •

edited

Loading

azych commented Feb 14, 2025 •

edited

Loading

joelanford commented Feb 14, 2025

azych commented Feb 17, 2025 •

edited

Loading

joelanford Feb 17, 2025

azych Feb 18, 2025

joelanford Feb 18, 2025

azych Feb 19, 2025 •

edited

Loading

joelanford commented Feb 17, 2025

azych commented Feb 18, 2025 •

edited

Loading

joelanford commented Feb 18, 2025

joelanford commented Feb 19, 2025

azych commented Feb 20, 2025 •

edited

Loading

codecov bot commented Feb 24, 2025

🐛 Handle interrupted helm releases in applier #1776

Are you sure you want to change the base?

🐛 Handle interrupted helm releases in applier #1776

Conversation

azych commented Feb 14, 2025 • edited Loading

Description

Reviewer Checklist

netlify bot commented Feb 14, 2025 • edited Loading

✅ Deploy Preview for olmv1 ready!

azych commented Feb 14, 2025 • edited Loading

joelanford commented Feb 14, 2025

azych commented Feb 17, 2025 • edited Loading

joelanford Feb 17, 2025

Choose a reason for hiding this comment

azych Feb 18, 2025

Choose a reason for hiding this comment

joelanford Feb 18, 2025

Choose a reason for hiding this comment

azych Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

joelanford commented Feb 17, 2025

azych commented Feb 18, 2025 • edited Loading

joelanford commented Feb 18, 2025

joelanford commented Feb 19, 2025

azych commented Feb 20, 2025 • edited Loading

codecov bot commented Feb 24, 2025

Codecov Report

azych commented Feb 14, 2025 •

edited

Loading

netlify bot commented Feb 14, 2025 •

edited

Loading

azych commented Feb 14, 2025 •

edited

Loading

azych commented Feb 17, 2025 •

edited

Loading

azych Feb 19, 2025 •

edited

Loading

azych commented Feb 18, 2025 •

edited

Loading

azych commented Feb 20, 2025 •

edited

Loading