Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure recoverability of backup/restore from WaitingForPluginoperations state during velero server restart #6727

Closed
anshulahuja98 opened this issue Aug 31, 2023 · 16 comments · Fixed by #7091

Comments

@anshulahuja98
Copy link
Collaborator

Describe the problem/challenge you have

After the integration of BIAv2 based plugins, after the backup/restore's core flow is done, thy are marked with WaitingForPluginoperations phase after which velero polls async the plugin operations to complete.

  1. We need a POC to confirm if this polling resumes if a pod restart happens - by velero fetching the ongoing plugin operations from object store and polling them
  2. If the above is not supported, look at gap areas of what is required to support it, given that plugin operations can now take a long time, velero should be resilient in tracking them.
    3.Once we have clarity on above, in context of CSI datamover impl, we should ensure the above assumptions are not broken and we can recover tracking DataUpload/Download etc.

Describe the solution you'd like

Anything else you would like to add:

Environment:

  • Velero version (use velero version):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "The project would be better with this feature added"
  • 👎 for "This feature will not enhance the project in a meaningful way"
@kaovilai
Copy link
Contributor

Related: #6710

@anshulahuja98
Copy link
Collaborator Author

In 1.13 context, I'll drive towards completing these POCs and validating. If POC does not work, I will work towards identifying the action items needed here.
CC: @ywk253100, @reasonerjt

@Lyndon-Li
Copy link
Contributor

@anshulahuja98 Thanks! And let me add this issue to 1.13 milestone, if we see any problem later, we can move it out.

@Lyndon-Li Lyndon-Li added this to the v1.13 milestone Oct 11, 2023
@Lyndon-Li Lyndon-Li removed the 1.13-candidate issue/pr that should be considered to target v1.13 minor release label Oct 11, 2023
@sseago
Copy link
Collaborator

sseago commented Oct 23, 2023

@anshulahuja98 Yes, I was going to create a bug on this today. Let me find the PR that added this -- that PR canceled data upload/download on node agent restart (which we need), but it also failed WaitingForPluginOperations backups and canceled data upload/download on velero pod restart, but we don't really want either of those.

@sseago
Copy link
Collaborator

sseago commented Oct 23, 2023

@anshulahuja98 this was it: #6461
We want to keep the data upload/download controller and node agent changes, but I think we want to revert the changes to pkg/server/server.go

@anshulahuja98
Copy link
Collaborator Author

@sseago thanks for sharing this. I'll check these changes and raise a PR to fix this behavior.

@anshulahuja98
Copy link
Collaborator Author

@qiuming-best since you did these changes - do you see any concern with us changing the behaviour - WaitingForPluginOperations backups won't be failed on velero pod restart.

@qiuming-best
Copy link
Contributor

@anshulahuja98 It's fine to revert the changes, I didn't know much about BIAV2 and RIAV2 at that time, so it was too rough to simply let the backup fail when the velero pod was restarted.

If the velero pod that is doing backup or restore doesn't restart, we really don't need to fail to do backup or restore if the other velero pod restarts.

@anshulahuja98
Copy link
Collaborator Author

Great
Thanks for your input @qiuming-best
I'll revert the velero server related changes and test it out.

@Lyndon-Li
Copy link
Contributor

As the current behavior of Velero server, once it restarts, it marks all the backups as Failed. So besides reverting the changes of cancelling DUCR/DDCR, we also need the change to prevent Velero server from failing the backups and remap them to the DUCR/DDCR.

@anshulahuja98
Copy link
Collaborator Author

Yes correct.
We will basically revert fully the changes in file> pkg/cmd/server/server.go in the PR https://github.com/vmware-tanzu/velero/pull/6461/files#

That will take care of both these things.
I hope this answers your concern,

@Lyndon-Li
Copy link
Contributor

I am afraid that is not enough, because the existing velero server code marks all the running backup CRs as Failed.
Anyway, the first step is to revert the code as you mentioned, then you will see the problems following and you can fix them accordingly

@anshulahuja98
Copy link
Collaborator Author

I understood now what you are saying. Will do the required changes for that also.

@anshulahuja98
Copy link
Collaborator Author

thanks for your input @Lyndon-Li

@anshulahuja98
Copy link
Collaborator Author

I will prioritize and try to complete this in next week.

@sseago
Copy link
Collaborator

sseago commented Nov 10, 2023

"the existing velero server code marks all the running backup CRs as Failed." -- for InProgress backups, this is still correct. If a backup has not progressed to WaitingForPluginOperations or Finalizing, then the only option is to fail it and start over with a new backup. Without failing InProgress backups, they will be listed as InProgress forever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment