Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Handle interrupted helm releases in applier #1776

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

azych
Copy link
Contributor

@azych azych commented Feb 14, 2025

Description

Introduces a workaround for 'interrupted' helm releases which enter into a 'pending' (-install/uninstall/rollback) state.
If that happens, for example because of immediate application exit with one of those operations being in flight, helm is not able to resolve it automatically which means we end up in a permanent reconcile error state.

One of the workarounds for this that has been repeated in the community is to remove metadata of the pending release,
which is the solution chosen here.

For full context the discussion in this PR and:
helm/helm#5595

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 14, 2025
Copy link

netlify bot commented Feb 14, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 75be8c4
🔍 Latest deploy log https://app.netlify.com/sites/olmv1/deploys/67bc702c91c3a90008024542
😎 Deploy Preview https://deploy-preview-1776--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@azych
Copy link
Contributor Author

azych commented Feb 14, 2025

@joelanford WDYT? are there any additional considerations with this approach?

@joelanford
Copy link
Member

@azych, my main concern is whether helm allows any objects from the release manifest to be applied while still in the pending phase. If the answer is "yes", I'm concerned that rolling back could mean that some workload from the to version may have run some migration that makes it impossible to successfully go back to the from version.

Maybe the better option would be to:

  1. Leave the existing half-rolled-out release manifests objects that have already been applied in place on the cluster.
  2. Directly delete the release secret(s) (not helm uninstall/rollback) that says it is pending. (We have a custom helm release storage driver that shards a release across multiple secrets if necessary to overcome etcd size limits).
  3. Try again, the idea being that helm is able to essentially pick up where it left off without rolling back to the previous revision.

@azych
Copy link
Contributor Author

azych commented Feb 17, 2025

@joelanford I agree that this might be a valid concern in non-installation scenarios, though I also wonder that if we'd ever end up in an interrupted state that would prevent doing a successful rollback, couldn't the same/similar state issue prevent us from actually progressing with a smooth upgrade when we remove release state info via secret(s) deletion instead?
I honestly don't feel knowledgeable enough about helm internals to be able to tell one way or the other at this point.

However, I followed your idea and so far just doing some basic testing it seems to be able to resolve the interruption regardless of the specific action that was interrupted (installation/upgrade).

@azych azych force-pushed the handle-helm-interrupt branch 3 times, most recently from c073dfc to 6d62085 Compare February 17, 2025 14:48
@@ -220,3 +262,20 @@ func (p *postrenderer) Run(renderedManifests *bytes.Buffer) (*bytes.Buffer, erro
}
return &buf, nil
}

func (h *Helm) deleteReleaseSecrets(ctx context.Context, releaseName string) error {
return h.secretsClientGetter.Secrets(h.systemNamespace).DeleteCollection(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes assumptions about the underlying storage driver that we should probably avoid. I'm not sure everything is plumbed through in a usable way from the ActionClientGetter so we may need a change in helm-operator-plugins, but what would be ideal would be to find a way to get access to the action.Configuration created here: https://github.com/operator-framework/helm-operator-plugins/blob/3d11ac3f0107f9553dbb3c0060a54f9684c95b2d/pkg/client/actionclient.go#L167

And then use it's .Releases.Delete() to delete the the underlying release

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this would be preferable and less error-prone since even though we control the storage driver we initialize the ActionConfigGetter with first, we'd have to track those changes in two places if they ever happen.

With getting access to action.Configuration I think we should avoid exposing it via actionClient as a whole, as it's currently private in both helm-operator-plugins and helm making it immutable and predictable in the context of ActionClientFor and all the actual actions.
Because helm-operator-plugins action client already handles Release objects and has an Uninstall, adding something like DeleteMetadata(releaseName) that calls .Releases.Delete() would make the most sense to me.
WDYT?

Sidenote:
Theoretically we also have access to ActionConfigGetter which we initialize in main and then use to initialize the actionClient, but I don't it's a viable option here because it would also mean making assumptions that we do not use the Memory driver for example, in which case our storages would be different each time we call ActionConfigFor and that is the first call ActionClientFor makes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like DeleteMetadata(releaseName) that calls .Releases.Delete()

I kinda feel like that particular interface method would put us on a slippery slope of stuffing more and more into the ActionClient interface. Maybe I'm wrong and this will be the last one, but we did also recently add History method.

I'm kinda tempted to put a Config() (*action.Configuration) escape hatch method in instead, which would definitely prevent the slippery slope. But that does somewhat have the downside that it returns a mutable action.Configuration.

But! If you trace ActionClientGetter -> ActionConfigGetter -> ActionConfigFor, you'll see that we build a new action.Configuration struct during every reconcile anyway. So yes, we would return a mutable action.Configuration, but we throw that one away when we are finished with reconcile.

Copy link
Contributor Author

@azych azych Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this still has that drawback of exposing a whole mutable config which comes from the fact that we'll be making it available via library method. While I agree it doesn't really matter for us, it might for other users of this lib.

Having said that I'll go with it. At this point every option has some drawback(s) and I don't think we are going to find a perfect solution and this probably comes as close to it as possible.

@joelanford
Copy link
Member

Couldn't the same/similar state issue prevent us from actually progressing with a smooth upgrade when we remove release state info via secret(s) deletion instead?

I think there is always the possibility of failing to move forward. The primary cases I can think of for that are:

  • lack of permissions (something for the admin to fix)
  • admission errors (generally not something OLMv1 can fix, but perhaps our current lack of using helm upgrade --force is an area that our behavior might affect ability to move forward)
  • (not a concern yet, since we don't implement health checking yet, but...) issues where we successfully apply changes, but then the resulting resources don't actually "work".

@azych
Copy link
Contributor Author

azych commented Feb 18, 2025

Couldn't the same/similar state issue prevent us from actually progressing with a smooth upgrade when we remove release state info via secret(s) deletion instead?

I think there is always the possibility of failing to move forward. The primary cases I can think of for that are:

true, but what I meant was that if an interruption leaves an in-flight action in such a broken state that we wouldn't be able to rollback because of it, there is a chance the same broken state would prevent moving forward when we use a different workaround to deal with post-interruption situation. I'd differentiate a 'broken state' caused by interruption and a 'failed' state that can definitely happen

@joelanford
Copy link
Member

true, but what I meant was that if an interruption leaves an in-flight action in such a broken state that we wouldn't be able to rollback because of it

I see what you mean. My unsubstantiated hunch is that failures during upgrades will largely fall into two buckets

  1. Failure because everything was going smoothly and the process was interrupted (e.g. context was cancelled, connection with API server was lost, etc.).
  2. Failure because of anything else

IMO, (1) isn't really a failure. It's an indefinite pause to the progression of the upgrade that the helm client doesn't deal with well. I'm primarily concerned with recovering gracefully where we are able to simply pick up where we left off, and what I'm hoping is that there's a way to convince helm to do that.

My original proposal of "if the most recent release secret is pending, delete it" would rely on helm's adoption logic. What I would envision happening is that the next time through (after deleting the latest pending release), we would:

  1. See the "installed" release
  2. Do a dry-run upgrade, and detect a diff because the previous upgrade attempt did not complete its updates/patches
  3. Start a real upgrade
    • Any objects with the same name/namespace in existing/upgrade releases that we applied successfully during the first attempt would likely see a no-op update/patch
    • Any objects that are new in the upgrade release that we applied successfully during the first attempt would have the correct labels that would allow helm to "adopt" the object into the release
    • Any objects that we didn't get to before the interruption would be created/updated as usual.

azych added a commit to azych/helm-operator-plugins that referenced this pull request Feb 19, 2025
This exposes action.Configuration of a specific
actionClient instance.

General motivation behind it is for the users to have
an 'escape hatch' access to the underlying dependency
objects that action.Configuration holds, which include
for example:
- release records store
- registry client
- Kubernetes API client

More specific motivation comes from a scenario where
an access to the underlying release store instance
is needed in order to gracefully handle pending helm
releases which were previously interrupted.
For additional context on this, see:
operator-framework/operator-controller#1776

Signed-off-by: Artur Zych <[email protected]>
@joelanford
Copy link
Member

One question that came to mind for me after we wrapped our discussion today: Will we be able to distinguish between "actual failure" and "resumable interruption"? I think that distinction matters because we may want to handle these scenarios differently.

Another thing that came to mind (going back to the assumption I was talking about in the discussion). If we see a pending upgrade, would it make sense to try to complete that particular upgrade to that particular bundle before we attempt to resolve a new bundle?

Consider this scenario:

  • v1 is "deployed"
  • v2 upgrade started, but didn't finish
  • v3 is now available in the catalog, there is an upgrade edge from v1, but not v2

Since we partially rolled out v2 (or maybe we fully rolled it out, but our network connection failed right as we were about to set the release secret to "deployed"), it seems like we should not proceed to v3 because we have to assume that whatever it is that makes v2 not upgradeable to v3 was included in the partial rollout.

What if the scenario changes where v3 upgrades from both v1 and v2? Would it be safe then to "resume" the upgrade but instead go to v3? I feel like there are scenarios there where the answer is still no.

For example: suppose there are two datasets managed by the operator's manifests, where:

  1. v1 has two workloads that individually migrate their respective datasets to level 1
  2. v2 has two workloads that individually migrate their respective datasets to level 2
  3. v3 expects the datasets to be at the same migration level, either both datasets at level 1 or both datasets at level 2, but not a mix.

If v2 partially rolls out such that one dataset gets migrated, but the other doesn't, then we can't migrate to v3. It seems reasonable that a v2 -> v3 upgrade edge has the implicit pre-condition that v2 is healthy (by our best possible definition of healthy).

Therefore, it seems like the only way to do this safely is to complete the upgrade to v2 before we try to do anything else. In theory, the pending release secret has all of the bookkeeping necessary to actually try to resume v2, even if the v2 bundle is no longer available.

Thoughts?

@azych
Copy link
Contributor Author

azych commented Feb 20, 2025

One question that came to mind for me after we wrapped our discussion today: Will we be able to distinguish between "actual failure" and "resumable interruption"? I think that distinction matters because we may want to handle these scenarios differently.

Looking at what actually happens within helm/pkg/action/upgrade.go:Upgrade(), once the in-flight release is created in storage with a 'pending' state (end of performUpgrade()) and the actual upgrade logic starts, in case there was an error at any step, that 'pending' state will be replaced by failRelease() and error will also be returned to the caller.

So while resuming the interrupted release does not mean it will succeed (or succeeded already), it should mean that it did not fail up to the point of interruption, although theoretically the interruption could still happen just before the 'pending' state is replaced and recorded, in which case failure will be 'masked' by 'pending' state.

Another thing that came to mind (going back to the assumption I was talking about in the discussion). If we see a pending upgrade, would it make sense to try to complete that particular upgrade to that particular bundle before we attempt to resolve a new bundle?
...
Therefore, it seems like the only way to do this safely is to complete the upgrade to v2 before we try to do anything else. In theory, the pending release secret has all of the bookkeeping necessary to actually try to resume v2, even if the v2 bundle is no longer available.

Theoretically and if we can (and want) to detect those new upgrade paths I could also see attempting to do a rollback as a valid option. That way, if there were upgrade path changes, we can do a fresh start taking those into consideration. If not and assuming a rollback will be successful and is properly set up (not sure how concerned we should be if it is) we simply retry.

Thoughts?

My general thoughts about this go back to "available" workarounds for this issue that are present in the community (eg. reading those threads I linked to in the doc comment). What's repeatedly being mentioned there is:

  • uninstall/rollback (my initial approach)
  • pending release secret deletion
    With some people also bringing up manually editing 'pending' status to 'failure' or 'success' (if we can detect it)

At this point I don't really have a 'clear winner' here which I could say is the absolute best way to approach it with. All of those are really 'hacks' (maybe uninstall/rollback less so) that just work around the problem and the more we discuss it it seems we can come up with more and more scenarios where each of those workarounds might not play out perfectly and we might not be able to resolve the situation to a 'sane' state.

My question would be do we really need to ensure it? Should we dig deeper and hope we find a bulletproof solution (which I don't think there is given that the main problem isn't really on our side)? Maybe with the effort so far, we should just consider picking one of those "established" ideas to provide a "best effort" workarounds on our side?
To be fair, I thought we kinda did that already with secret deletion #1776 (comment)

BTW. There has been some activity around resolving this on the helm side (either in 3.x or 4), see: helm/helm#11863 and helm/community#354

azych added a commit to azych/helm-operator-plugins that referenced this pull request Feb 20, 2025
This exposes action.Configuration of a specific
actionClient instance.

General motivation behind it is for the users to have
an 'escape hatch' access to the underlying dependency
objects that action.Configuration holds, which include
for example:
- release records store
- registry client
- Kubernetes API client

More specific motivation comes from a scenario where
an access to the underlying release store instance
is needed in order to gracefully handle pending helm
releases which were previously interrupted.
For additional context on this, see:
operator-framework/operator-controller#1776

Signed-off-by: Artur Zych <[email protected]>
github-merge-queue bot pushed a commit to operator-framework/helm-operator-plugins that referenced this pull request Feb 22, 2025
This exposes action.Configuration of a specific
actionClient instance.

General motivation behind it is for the users to have
an 'escape hatch' access to the underlying dependency
objects that action.Configuration holds, which include
for example:
- release records store
- registry client
- Kubernetes API client

More specific motivation comes from a scenario where
an access to the underlying release store instance
is needed in order to gracefully handle pending helm
releases which were previously interrupted.
For additional context on this, see:
operator-framework/operator-controller#1776

Signed-off-by: Artur Zych <[email protected]>
Co-authored-by: Artur Zych <[email protected]>
azych added a commit to azych/operator-controller that referenced this pull request Feb 24, 2025
Introduces a workaround for 'interrupted' helm releases
which enter into a 'pending' (-install/uninstall/rollback) state.
If that happens, for example because of immediate application
exit with one of those operations being in flight, helm is not
able to resolve it automatically which means we end up in
a permanent reconcile error state.
One of the workarounds for this that has been repeated in the
community is to remove metadata of the pending release,
which is the solution chosen here.

For full context see:
operator-framework#1776
helm/helm#5595
@azych azych force-pushed the handle-helm-interrupt branch from 6d62085 to 7a31a06 Compare February 24, 2025 12:48
@azych azych changed the title [Draft] Handle interrupted helm releases in applier 🐛 Handle interrupted helm releases in applier Feb 24, 2025
@azych azych marked this pull request as ready for review February 24, 2025 12:50
@azych azych requested a review from a team as a code owner February 24, 2025 12:50
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 24, 2025
azych added a commit to azych/operator-controller that referenced this pull request Feb 24, 2025
Introduces a workaround for 'interrupted' helm releases
which enter into a 'pending' (-install/uninstall/rollback) state.
If that happens, for example because of immediate application
exit with one of those operations being in flight, helm is not
able to resolve it automatically which means we end up in
a permanent reconcile error state.
One of the workarounds for this that has been repeated in the
community is to remove metadata of the pending release,
which is the solution chosen here.

For full context see:
operator-framework#1776
helm/helm#5595
@azych azych force-pushed the handle-helm-interrupt branch from 7a31a06 to 44b5e31 Compare February 24, 2025 12:56
Introduces a workaround for 'interrupted' helm releases
which enter into a 'pending' (-install/uninstall/rollback) state.
If that happens, for example because of immediate application
exit with one of those operations being in flight, helm is not
able to resolve it automatically which means we end up in
a permanent reconcile error state.
One of the workarounds for this that has been repeated in the
community is to remove metadata of the pending release,
which is the solution chosen here.

For full context see:
operator-framework#1776
helm/helm#5595
@azych azych force-pushed the handle-helm-interrupt branch from 44b5e31 to 75be8c4 Compare February 24, 2025 13:12
Copy link

codecov bot commented Feb 24, 2025

Codecov Report

Attention: Patch coverage is 83.33333% with 6 lines in your changes missing coverage. Please review.

Project coverage is 68.37%. Comparing base (9b888eb) to head (75be8c4).

Files with missing lines Patch % Lines
cmd/operator-controller/main.go 25.00% 2 Missing and 1 partial ⚠️
internal/operator-controller/applier/helm.go 90.62% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1776      +/-   ##
==========================================
+ Coverage   68.34%   68.37%   +0.03%     
==========================================
  Files          63       63              
  Lines        5117     5145      +28     
==========================================
+ Hits         3497     3518      +21     
- Misses       1390     1395       +5     
- Partials      230      232       +2     
Flag Coverage Δ
e2e 51.74% <69.44%> (+0.03%) ⬆️
unit 56.09% <80.55%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants