Ensure recoverability of backup/restore from WaitingForPluginoperations state during velero server restart #6727

anshulahuja98 · 2023-08-31T06:55:49Z

Describe the problem/challenge you have

After the integration of BIAv2 based plugins, after the backup/restore's core flow is done, thy are marked with WaitingForPluginoperations phase after which velero polls async the plugin operations to complete.

We need a POC to confirm if this polling resumes if a pod restart happens - by velero fetching the ongoing plugin operations from object store and polling them
If the above is not supported, look at gap areas of what is required to support it, given that plugin operations can now take a long time, velero should be resilient in tracking them.
3.Once we have clarity on above, in context of CSI datamover impl, we should ensure the above assumptions are not broken and we can recover tracking DataUpload/Download etc.

Describe the solution you'd like

Anything else you would like to add:

Environment:

Velero version (use velero version):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "The project would be better with this feature added"
👎 for "This feature will not enhance the project in a meaningful way"

The text was updated successfully, but these errors were encountered:

kaovilai · 2023-09-21T21:07:12Z

Related: #6710

anshulahuja98 · 2023-10-11T09:01:07Z

In 1.13 context, I'll drive towards completing these POCs and validating. If POC does not work, I will work towards identifying the action items needed here.
CC: @ywk253100, @reasonerjt

Lyndon-Li · 2023-10-11T09:30:00Z

@anshulahuja98 Thanks! And let me add this issue to 1.13 milestone, if we see any problem later, we can move it out.

sseago · 2023-10-23T13:26:11Z

@anshulahuja98 Yes, I was going to create a bug on this today. Let me find the PR that added this -- that PR canceled data upload/download on node agent restart (which we need), but it also failed WaitingForPluginOperations backups and canceled data upload/download on velero pod restart, but we don't really want either of those.

sseago · 2023-10-23T13:28:19Z

@anshulahuja98 this was it: #6461
We want to keep the data upload/download controller and node agent changes, but I think we want to revert the changes to pkg/server/server.go

anshulahuja98 · 2023-10-25T06:47:20Z

@sseago thanks for sharing this. I'll check these changes and raise a PR to fix this behavior.

anshulahuja98 · 2023-10-25T06:48:25Z

@qiuming-best since you did these changes - do you see any concern with us changing the behaviour - WaitingForPluginOperations backups won't be failed on velero pod restart.

qiuming-best · 2023-10-30T02:16:32Z

@anshulahuja98 It's fine to revert the changes, I didn't know much about BIAV2 and RIAV2 at that time, so it was too rough to simply let the backup fail when the velero pod was restarted.

If the velero pod that is doing backup or restore doesn't restart, we really don't need to fail to do backup or restore if the other velero pod restarts.

anshulahuja98 · 2023-10-30T04:43:57Z

Great
Thanks for your input @qiuming-best
I'll revert the velero server related changes and test it out.

Lyndon-Li · 2023-10-30T05:57:00Z

As the current behavior of Velero server, once it restarts, it marks all the backups as Failed. So besides reverting the changes of cancelling DUCR/DDCR, we also need the change to prevent Velero server from failing the backups and remap them to the DUCR/DDCR.

anshulahuja98 · 2023-10-30T06:06:18Z

Yes correct.
We will basically revert fully the changes in file> pkg/cmd/server/server.go in the PR https://github.com/vmware-tanzu/velero/pull/6461/files#

That will take care of both these things.
I hope this answers your concern,

Lyndon-Li · 2023-10-30T06:18:20Z

I am afraid that is not enough, because the existing velero server code marks all the running backup CRs as Failed.
Anyway, the first step is to revert the code as you mentioned, then you will see the problems following and you can fix them accordingly

anshulahuja98 · 2023-10-30T06:31:56Z

I understood now what you are saying. Will do the required changes for that also.

anshulahuja98 · 2023-10-30T06:32:11Z

thanks for your input @Lyndon-Li

anshulahuja98 · 2023-11-10T16:36:36Z

I will prioritize and try to complete this in next week.

sseago · 2023-11-10T19:01:59Z

"the existing velero server code marks all the running backup CRs as Failed." -- for InProgress backups, this is still correct. If a backup has not progressed to WaitingForPluginOperations or Finalizing, then the only option is to fail it and start over with a new backup. Without failing InProgress backups, they will be listed as InProgress forever.

anshulahuja98 added 1.13-candidate issue/pr that should be considered to target v1.13 minor release area/datamover Performance and removed Performance labels Aug 31, 2023

danfengliu assigned Lyndon-Li Sep 4, 2023

danfengliu added the Needs Design label Sep 4, 2023

reasonerjt assigned anshulahuja98 Sep 4, 2023

reasonerjt added the Reviewed Q3 2023 label Sep 4, 2023

reasonerjt assigned qiuming-best and unassigned Lyndon-Li Sep 20, 2023

Lyndon-Li added this to the v1.13 milestone Oct 11, 2023

Lyndon-Li removed the 1.13-candidate issue/pr that should be considered to target v1.13 minor release label Oct 11, 2023

anshulahuja98 mentioned this issue Nov 13, 2023

Don't fail backup/restore on velero server restart in PhaseWaitingFor… #7091

Merged

3 tasks

qiuming-best closed this as completed in #7091 Nov 20, 2023

qiuming-best mentioned this issue Nov 21, 2023

Node agent restart enhancement #7130

Merged

3 tasks

qiuming-best added the target/1.12.3 label Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure recoverability of backup/restore from WaitingForPluginoperations state during velero server restart #6727

Ensure recoverability of backup/restore from WaitingForPluginoperations state during velero server restart #6727

anshulahuja98 commented Aug 31, 2023

kaovilai commented Sep 21, 2023

anshulahuja98 commented Oct 11, 2023

Lyndon-Li commented Oct 11, 2023

sseago commented Oct 23, 2023

sseago commented Oct 23, 2023

anshulahuja98 commented Oct 25, 2023

anshulahuja98 commented Oct 25, 2023

qiuming-best commented Oct 30, 2023

anshulahuja98 commented Oct 30, 2023

Lyndon-Li commented Oct 30, 2023

anshulahuja98 commented Oct 30, 2023

Lyndon-Li commented Oct 30, 2023

anshulahuja98 commented Oct 30, 2023

anshulahuja98 commented Oct 30, 2023

anshulahuja98 commented Nov 10, 2023

sseago commented Nov 10, 2023

Ensure recoverability of backup/restore from WaitingForPluginoperations state during velero server restart #6727

Ensure recoverability of backup/restore from WaitingForPluginoperations state during velero server restart #6727

Comments

anshulahuja98 commented Aug 31, 2023

kaovilai commented Sep 21, 2023

anshulahuja98 commented Oct 11, 2023

Lyndon-Li commented Oct 11, 2023

sseago commented Oct 23, 2023

sseago commented Oct 23, 2023

anshulahuja98 commented Oct 25, 2023

anshulahuja98 commented Oct 25, 2023

qiuming-best commented Oct 30, 2023

anshulahuja98 commented Oct 30, 2023

Lyndon-Li commented Oct 30, 2023

anshulahuja98 commented Oct 30, 2023

Lyndon-Li commented Oct 30, 2023

anshulahuja98 commented Oct 30, 2023

anshulahuja98 commented Oct 30, 2023

anshulahuja98 commented Nov 10, 2023

sseago commented Nov 10, 2023