Skip to content

Conversation

camilamacedo86
Copy link
Contributor

@camilamacedo86 camilamacedo86 commented Sep 3, 2025

Problem

We use the crddiff (https://github.com/kubernetes-sigs/crdify) library to check API differences and ensure CRD upgrade safety. See: CRD upgrade safety docs

When crddiff cannot handle the diff, it reports an unhandled scenario.
In this case, a full JSON diff is returned which causes the problem

  • The JSON is too large to update in the CR status.
  • This causes an error: Too long: may not be more than 32768 bytes</code>

-As a result:

  • ClusterExtension is not upgraded.
  • Users see no information about why the upgrade failed.
  • The only details are hidden in confusing logs in the operator-controller deployment, which are not meaningful or helpful.

More info: OCPBUGS-59518

Solution

Instead of outputting the full JSON, we:

  • Extract only the meaningful information.
  • Return a short, clear message that explains the unhandled diff.

This allows:

  • OLMv1 to update the status conditions.
  • Users to understand why the CE upgrade could not happen.

Example (from unit tests):

require.ErrorContains(t, err, "unhandled changes found")  
require.ErrorContains(t, err, "Format \"\" -> \"email\"")  

This way:

  • Users get actionable feedback.
  • Status remains within size limits.
  • The CE status will be updated as expected

Note

As discussed, any further improvement beyond this would need to be done directly in the crddiff library, to reduce the number of unhandled scenarios.

Reviewer Checklist

  • [N/A ] API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • [N/A] Links to related GitHub Issue(s)

@camilamacedo86 camilamacedo86 requested a review from a team as a code owner September 3, 2025 07:41
Copy link

netlify bot commented Sep 3, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 44cef57
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/68cc00e5e8bdef00086af15e
😎 Deploy Preview https://deploy-preview-2179--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@openshift-ci openshift-ci bot requested review from bentito and OchiengEd September 3, 2025 07:41
Copy link

openshift-ci bot commented Sep 3, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign thetechnick for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@camilamacedo86 camilamacedo86 changed the title (fix): unhandle changes for crd upgrade safety (fix): unhandle changes for crd upgrade safety ( OCPBUGS-59518 ) Sep 3, 2025
@camilamacedo86 camilamacedo86 requested review from everettraven, joelanford and perdasilva and removed request for a team, OchiengEd and bentito September 3, 2025 07:42
@camilamacedo86 camilamacedo86 changed the title (fix): unhandle changes for crd upgrade safety ( OCPBUGS-59518 ) 🐛 (fix): unhandle changes for crd upgrade safety Sep 3, 2025
@camilamacedo86 camilamacedo86 changed the title 🐛 (fix): unhandle changes for crd upgrade safety 🐛 (fix): unhandle changes for crd upgrade safety ( OCPBUGS-59518 ) Sep 3, 2025
@camilamacedo86
Copy link
Contributor Author

camilamacedo86 commented Sep 3, 2025

Hi @joelanford @perdasilva

It seems to be the real fix for OCPBUGS-59518.
I hope that helps out.

Moreover, IMHO we should perform additional checks and carefully evaluate the errors in those unhandled scenarios. It might be worth verifying if a newer version of the library addresses this, or if there are other improvements we could apply. In short, we should confirm whether there’s truly nothing else we can do here to make the handling more robust.

Copy link

codecov bot commented Sep 3, 2025

Codecov Report

❌ Patch coverage is 75.92593% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.31%. Comparing base (33fdce2) to head (44cef57).

Files with missing lines Patch % Lines
...ak/preflights/crdupgradesafety/crdupgradesafety.go 75.92% 17 Missing and 9 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2179      +/-   ##
==========================================
+ Coverage   72.05%   72.31%   +0.26%     
==========================================
  Files          85       85              
  Lines        8405     8506     +101     
==========================================
+ Hits         6056     6151      +95     
  Misses       1948     1948              
- Partials      401      407       +6     
Flag Coverage Δ
e2e 38.70% <0.00%> (-0.48%) ⬇️
experimental-e2e 45.57% <0.00%> (-0.52%) ⬇️
unit 57.27% <75.92%> (+0.27%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

if result.Name == "unhandled" {
msg = "unhandled changes found"
}
errs = append(errs, fmt.Errorf("%s: %s: %s: %s", version, property, result.Name, msg))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels like we might be occluding the error. Is it possible to format it it in a better way? Then maybe at the end we limit the number of errors we print out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I could understand your suggestion here.
Could you please clarify?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main point is that, the way it is, it's kinda just an HTTP 500 without any explanation. As a user, I don't know what to do with that. Maybe this is an issue with the underlying library we are using and the fix needs to go there. But as a user, I would like to know: what, where, why, and how to fix it.

The second part of the comment can be ignored as a brain fart. I was in the context of the "we're generating too many errors" problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, it looks like an unexpected error. Maybe we could explore whether the library can handle fewer scenarios as unhandled cases. That said, you raise a great point — I was able to extract some information from the JSON, and I think this is the best we can do for now. What do you think?

when the error occurs, we can’t update the CR with the status. This leaves the user without any visibility into why the upgrade didn’t happen. Here’s the bug for reference: https://issues.redhat.com/browse/OCPBUGS-59518 so this PR should fix it.

- Keep unhandled spec changes as errors; message: "unhandled changes found"

Assisted-by: Cursor
Copy link
Contributor

@anik120 anik120 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camilamacedo86 is it possible to get a

Problem :  __________________________ (eg before this PR, crd upgrade safety feature did not do _____ right)
Solution: _________________________ 

statement in the description?

Helps with reviewing the PR.

@camilamacedo86
Copy link
Contributor Author

@anik120 can you please check the description?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants