Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volumesnapshot getting deleted during backup (argocd) #7905

Open
mynktl opened this issue Jun 19, 2024 · 15 comments
Open

Volumesnapshot getting deleted during backup (argocd) #7905

mynktl opened this issue Jun 19, 2024 · 15 comments
Assignees
Labels
1.16-candidate Area/CSI Related to Container Storage Interface support Needs investigation Reviewed Q3 2024

Comments

@mynktl
Copy link
Contributor

mynktl commented Jun 19, 2024

What steps did you take and what happened:

We created a backup for cluster and observe that Volumesnapshot resource was getting deleted by argocd. This lead to failed backup.

argocd uses label for maintaining the inventory of the resources it created. If any other resource is having such label then it will prune/delete it.
While doing backup with velero, velero csi plugin copy the PVC label to volumesnapshot. Because of this, argocd assumes the ownership of newly created VS and prune it since it doesn't match with given manifest.

What did you expect to happen:
Volumesnapshot resource should not have been deleted and backup should succeed.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:

Ideally we should avoid copying label of PVC to VS as we don't know why those labels were used. If this is still required then there should be a flag to enable this functionality.

For linking PVC to VS, VS already have the information of PVC.

Environment:

  • Velero version (use velero version):
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@ywk253100 ywk253100 added the Area/CSI Related to Container Storage Interface support label Jun 23, 2024
@blackpiglet
Copy link
Contributor

https://argo-cd.readthedocs.io/en/stable/user-guide/resource_tracking/

I will check whether it's possible to avoid adding similar labels to the VolumeSnapshot created by the CSI backup.

@blackpiglet
Copy link
Contributor

@anshulahuja98
#6294 introduces the labels on the VolumeSnapshot.
If we set the VolumeSnapshot as the must-restore resource, could we remove the labels on the VolumeSnapshot.

var resourceMustHave = []string{
"datauploads.velero.io",
}

@anshulahuja98
Copy link
Collaborator

This is an interesting issue caused by argo CD

@blackpiglet, let me try to better understand the fix you are proposing, if we add it to MustHave list, then the pitfall I see is that during restore if user gives label selector, we will end up restoring all VS instead of VS which are actually required.

@blackpiglet
Copy link
Contributor

Yes, it's possible that unwanted VolumeSnapshots are restored when the label selector is used in restore resource filtering.
The better way to do it is something like the must-have annotation in the backup workflow.

@anshulahuja98
Copy link
Collaborator

hmm yeah
Basically given the impact of #7978 of leaving orphan snapshots etc

There could be bunch of edge cases which I don't want to enter into, if we restore all VS by default.

Overall the right approach to solve it is to maybe when PVC is being restored we restore VS from that code
instead on depending on VS to be present.
That way we can remove the label
Does that make sense?

@blackpiglet
Copy link
Contributor

blackpiglet commented Jul 11, 2024

Agreed.
Instead of depending on the VolumeSnapshot in the cluster, we can add the VolumeSnapshot as the AdditionalItem of the PVC RIA result, then the VolumeSnapshot is created from the backup too.

return &velero.RestoreItemActionExecuteOutput{
UpdatedItem: &unstructured.Unstructured{Object: pvcMap},
OperationID: operationID,
}, nil

@anshulahuja98 anshulahuja98 changed the title Volumesnapshot getting deleted during backup Volumesnapshot getting deleted during backup (argocd) Sep 6, 2024
@anshulahuja98
Copy link
Collaborator

anshulahuja98 commented Sep 6, 2024

@reasonerjt / @blackpiglet can we mark this as 1.15 candidate?

@blackpiglet
Copy link
Contributor

How about we put this issue in v1.15.1?
v1.15.0 is already approaching the final cut date.

@anshulahuja98
Copy link
Collaborator

ah my bad, I meant 1.16

@weshayutin
Copy link
Contributor

This would be a great thing to fix upstream, I've been taking some more detailed notes on how oadp/velero play w/ argo. At the moment, the suggestion is to WARN in the logs if the argo label is found on the namespace. Interested in hearing other ideas. I like what I see in the comments!

My notes - https://github.com/weshayutin/gitops-oadp
Note: these are WIP and not to be considered doc :)

@shashwattrivedi
Copy link

Instead of depending on the VolumeSnapshot in the cluster, we can add the VolumeSnapshot as the AdditionalItem of the PVC RIA result, then the VolumeSnapshot is created from the backup too.

@blackpiglet since VolumeSnapshot won't be present in the cluster, we will have to remove validations present in the code before
RestoreItemActionExecuteOutput, restoreFromVolumeSnapshot won't work, => AdditionalItems approach

err := p.crClient.Get(
context.TODO(),
crclient.ObjectKey{
Namespace: input.Restore.Namespace,
Name: input.Restore.Spec.BackupName,
},
backup,
)

@blackpiglet
Copy link
Contributor

You are right.
If we are going that way, this is not a small change.
The code you mentioned should be modified to suit the VolumeSnapshot and VolumeSnapshotContent deletion change.

@anshulahuja98
Copy link
Collaborator

but will that be sufficient @blackpiglet?
If we remove the GET on VS in this flow.
We assume it will be there.

Do you have any other suggestion of what can be done?

@reasonerjt
Copy link
Contributor

This will likely be solved once we do not sync VSC when backup is synced, see more discussions in #7979

@shubham-pampattiwar
Copy link
Collaborator

I have posted a PR to add a backup level warning for such scenarios: #8257
It would look something like:

Phase:  Completed                                                                                                                                                    
                                                                                                                                                                     
                                                                                                                                                                     
Warnings:                                                                                                                                                            
  Velero:    message: /backup operation may encounter complications and potentially produce undesirable results due to the inclusion of namespaces [minimal-8csivol] 
managed by ArgoCD in the backup. 

While this does not solve the problem altogether, the warning at the very least notifies the user of the presence of Velero and ArgoCD conflicting systems, which could in turn push the user to take the necessary steps to avoid the undesirable outcomes like updating the resource exclusions in ArgoCD to include PV's, PVC's, VS and VSCs etc (these can be documented).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.16-candidate Area/CSI Related to Container Storage Interface support Needs investigation Reviewed Q3 2024
Projects
None yet
Development

No branches or pull requests

8 participants