Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use patch instead of update for GroupSnapshots, VolumeSnapshots, PVCs #1019

Closed
wants to merge 5 commits into from

Conversation

kaovilai
Copy link
Contributor

@kaovilai kaovilai commented Feb 22, 2024

Signed-off-by: Tiger Kaovilai [email protected]

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
This PR cleans up some of the unit test reactor code, and eliminate update calls that I can see, fixing unit tests to accommodate the changes.
Previously update call was required simply cause unit test is borked. Patch calls was modifying original input causing compare failures.
Which issue(s) this PR fixes:

Fixes #748

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


This PR extend #876 work to later added update calls.

@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 22, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kaovilai
Once this PR has been reviewed and has the lgtm label, please assign jsafrane for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 22, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @kaovilai. Thanks for your PR.

I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 22, 2024
@kaovilai kaovilai marked this pull request as ready for review February 22, 2024 22:39
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 22, 2024
@kaovilai kaovilai force-pushed the removeUpdateCalls branch 2 times, most recently from 0ca6d08 to a877f42 Compare February 22, 2024 22:41
@kaovilai kaovilai marked this pull request as draft February 22, 2024 22:52
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 22, 2024
@kaovilai kaovilai force-pushed the removeUpdateCalls branch 2 times, most recently from 0c28974 to 903eb0c Compare February 23, 2024 01:39
@kaovilai kaovilai marked this pull request as ready for review February 23, 2024 08:08
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 23, 2024
@kaovilai kaovilai force-pushed the removeUpdateCalls branch 2 times, most recently from 6fe49d4 to 1261a31 Compare February 23, 2024 08:10
@kaovilai kaovilai changed the title Use patch instead of update for GroupSnapshots and VolumeSnapshots Use patch instead of update for GroupSnapshots, VolumeSnapshots, PVCs Feb 23, 2024
@kaovilai kaovilai force-pushed the removeUpdateCalls branch 3 times, most recently from 05f89ec to 06733fe Compare February 23, 2024 08:15
Signed-off-by: Tiger Kaovilai <[email protected]>

remove debugging code

Signed-off-by: Tiger Kaovilai <[email protected]>

remove more update calls

Signed-off-by: Tiger Kaovilai <[email protected]>

Fix patch json unmarshal unitTests comparison failures

Signed-off-by: Tiger Kaovilai <[email protected]>

Fix tests in reactor by not modifying original for patch

Signed-off-by: Tiger Kaovilai <[email protected]>
Comment on lines 166 to 171
for _, i := range indexes {
patches = append(patches, PatchOp{
Op: "remove",
Path: "/metadata/finalizers/" + fmt.Sprint(i),
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is racy. Consider another controller / user that added a finalizer to an object when this loop runs. Since the PATCH removes indices, it will remove a wrong item.

Is there a way how to remove a value and not an index via Patch()? If not, then please stick to Update()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is.
One is to remove all then add all back in one call. Let me try that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fyi already have it working, just working on adding patch for pvc to framework_test reactor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is still considered racy, then will have to wait for json-patch/json-patch2#18 and will move back to update. lmk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably still is racy.. hmm. but at least it external-snapshotter won't hang. User could see the finalizer they added isn't there. but at least this won't have the "wrong index" issue.

Another would be to use update but test it can get out of the "out of date, please apply again"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, the current version does not fix the race. It blindly removes all finalizers and add those that were known at the time the controller processed VolumeSnapshot / VolumeSnapshotContent.
It will again erase any finalizers added in parallel to the controller.

While it is a bandaid, it is better than nothing

No. You are breaking finalizers of someone else. This is not a good behavior.

I think the whole fear of "the object has been modified" error is unjustified. It tells the controller that it has been working with stale data. The controller should check what has changed and try again, if it's still applicable. Or it may discover that the work is not needed any longer. In this case, the VolumeSnapshot / Content should be re-queued with exp. backoff.

there needs to be tests that ensures update fail calls in external-snapshotter are recoverable within few seconds, not over 10 minutes which is what we have been seeing.

If that's true then this is the part that needs to be fixed. And not work around it using patch blindly everywhere. Can you reliably reproduce the issue, e.g. in an unit test? It should be easy to debug what's causing the delay then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we've been hitting this issue everywhere in our CI and other velero users have been hitting it as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

		{
			name:             "2-4 - successful remove Snapshot finalizer after update conflict",
			initialSnapshots: newSnapshotArray("snap2-4", "snapuid2-4", "claim2-4", "", classSilver, "", &False, nil, nil, nil, false, true, nil),
			initialClaims:    newClaimArray("claim2-4", "pvc-uid2-4", "1Gi", "volume2-4", v1.ClaimBound, &classEmpty),
			test:             testRemoveSnapshotFinalizerAfterUpdateConflict,
			expectSuccess:    true,
			errors: []reactorError{
				{"update", "volumesnapshots", errors.NewConflict(crdv1.Resource("volumesnapshots"), "snap2-4", nil)},
			},
		},

Added this case here.
#1023 (review)

Copy link
Contributor Author

@kaovilai kaovilai Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requeue with new data shouldn't take more than a minute. We've seen the external-snapshotter controller stuck for 10minutes+ Maybe timeout/backoff used needs changing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reliably reproduce the issue

50% of the time prior to #876
After 876 it's much improved but added update calls after that OCP QE noted still cause issues sometimes.

@shubham-pampattiwar
Copy link
Contributor

Thank you @kaovilai !!

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 26, 2024
Signed-off-by: Tiger Kaovilai <[email protected]>
Signed-off-by: Tiger Kaovilai <[email protected]>
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 26, 2024
Signed-off-by: Tiger Kaovilai <[email protected]>
@kaovilai
Copy link
Contributor Author

Closing due to other priorities as the sense is this isn't going to be approved as is and to see if #876 recently cherrypicked to our nightlies would be sufficient. Feel free to takeover.

Test improvements are moved to #1024

@kaovilai kaovilai closed this Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

snapshot-controller logs report failure frequently
5 participants