Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expired Backup objects from BackupStorageLocations with accessMode=ReadOnly don't get cleaned up #7106

Closed
phoenix-bjoern opened this issue Nov 15, 2023 · 19 comments
Assignees
Milestone

Comments

@phoenix-bjoern
Copy link

When a BackupStorageLocation is defined with accessMode: ReadOnly Velero imports Backup objects into the cluster as usual.

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: my-backups
  namespace: velero
spec:
  accessMode: ReadOnly

Once the Backups objects expire, Velero usually would remove the Backup objects in the BackupStorageLocation and in the cluster if the accessMode is set to ReadWrite. However, for ReadOnly BackupStorageLocations Velero neither removes the entities in the storage location (which is expected), nor does it remove it in the K8S cluster is it installed. Over the time this behavior could lead to a vast number of Backup objects, which could slow down Velero's operations.

It is expected that Velero cleans up the Backup (and other objects) in the cluster it is installed even of the accessMode of the BackupStorageLocation is ReadOnly.

@phoenix-bjoern phoenix-bjoern changed the title Backups from BackupStorageLocations in ReadOnly access mode don't get cleaned up Expired Backup objects from BackupStorageLocations with accessMode=ReadOnly don't get cleaned up Nov 15, 2023
@blackpiglet
Copy link
Contributor

IMO, this is expected.
Please read this chapter in the Velero document: https://velero.io/docs/v1.12/how-velero-works/#restores
The read-only mode is used for the scenario, in which the user only wants to restore from the created backups, but doesn't want to create and delete backups by accident.

@blackpiglet blackpiglet self-assigned this Nov 16, 2023
@blackpiglet blackpiglet added Needs info Waiting for information backlog Candidate for close Issues that should be closed and need a team review before closing labels Nov 16, 2023
@phoenix-bjoern
Copy link
Author

As I mentioned before, the behavior in regards to the BackupStorageLocation is fine for me and works as documented:

„which disables backup creation and deletion for the storage location“

However, the Backup CRs in the cluster don’t get removed , even if the Backups in the BackupStorageLocation are removed by another Velero instance which has ReadWrite access (= backup source).
So, IMHO there is gap, as Velero neither cleans up expired Backup CRs in the local cluster by itself, nor does it reconcile the Backup CRs and remove it locally if they got removed in the storage location.

@blackpiglet
Copy link
Contributor

blackpiglet commented Nov 16, 2023

Thanks for the clarification. I see.
What's the Velero version you are using?
I just tried with the main branch code in my development environment.
When the BSL is set as ReadOnly access mode, the orphan backup CRs can be deleted.

@phoenix-bjoern
Copy link
Author

We are on Velero 1.12.1.

@sseago
Copy link
Collaborator

sseago commented Nov 16, 2023

@blackpiglet @phoenix-bjoern Hmm. So the use case is the BSL is in 2 clusters, read-write in cluster1, read-only in cluster 2. Cluster 2 won't delete the backup on expiration since it's read-only, but cluster1 will delete it and clean out the BSL.

However, the backup sync controller has code to delete orphaned backups. Once the backup is removed from cluster1 and the BSL, it becomes an orphaned backup in cluster2, and the controller shoould delete this backup CR from clsuter2 now, not because it's expired but because it's orphaned. It could be that there's a bug in the backup sync controller that's preventing this from happening -- that's where I'd look to see what's going on.

@blackpiglet
Copy link
Contributor

@phoenix-bjoern
Tested on the v1.12.1. I didn't reproduce this issue.
I also didn't find the backup sync controller checks the BSL AccessMode when it deletes the orphan backups.

Could you help to collect the Velero log in your environment to help debug?

@phoenix-bjoern
Copy link
Author

@blackpiglet sure, will collect some logs for you.

@phoenix-bjoern
Copy link
Author

The only log I've identified related to an expired Backup is the one from the garbage collector:

time="2023-11-17T11:39:08Z" level=info msg="Backup:velero-applications-with-volumes-20231109210000 has expired" backup=velero/velero-applications-with-volumes-20231109210000 expiration="2023-11-16 21:00:00 +0000 UTC" logSource="pkg/controller/gc_controller.go:128"
time="2023-11-17T11:39:08Z" level=info msg="Backup cannot be garbage-collected because backup storage location develop is currently in read-only mode" backup=velero/velero-applications-with-volumes-20231109210000 expiration="2023-11-16 21:00:00 +0000 UTC" logSource="pkg/controller/gc_controller.go:152"

I there another mechanism which should kick in here to remove the CR?

@phoenix-bjoern
Copy link
Author

I found out that expired Backups actually get deleted. Only Backups with CSI volumes with errors (PartiallyFailed) don't get removed locally. Maybe the (partial) errors or the CSI plugin/mechanism is the reason why the CRs don't get removed properly.

@phoenix-bjoern
Copy link
Author

@blackpiglet As mentioned before, I think only orphaned Backups with status PartiallyFailed are not deleted while backups with status Complete actually get deleted. Obviously I missed that important point in my issue description, sorry.

I think the relevant code is https://github.com/vmware-tanzu/velero/blob/v1.12.1/pkg/controller/backup_sync_controller.go#L334C47:

if backup.Status.Phase != velerov1api.BackupPhaseCompleted || backupStoreBackups.Has(backup.Name) {
    continue
}

So the question is: Why is it necessary to keep this check for the backup's phase here? I tracked down when this code was introduced and found this commit: 8ce513a. It seems the author wanted to prevent the removal of "in-flight" backups.

So IMHO it should be safe to extend the status list to other "final" statuses like PartiallyFailed and Failed, which should resolved the described problem. What do you think?

@blackpiglet blackpiglet removed the Candidate for close Issues that should be closed and need a team review before closing label Nov 20, 2023
@blackpiglet
Copy link
Contributor

Thanks for the feedback.

There were some discussions about how to handle the backups with failure. The concern is the Velero server cannot know what goes wrong in the backup, and the Velero server may not have the ability to handle the error, for example, the volume's snapshot creation failure, and the filesystem data upload failure. There are some operations that the Velero server cannot guarantee the idempotency.
So leaving the failed backup there for users to handle may be a better choice here.

@phoenix-bjoern
Copy link
Author

@blackpiglet I was thinking about the issue description in #705 too. But the Backup CR doesn't have any logs or detailed information about the errors during the backup. So if the Backup entity at the BSL is gone, there is no value in keeping the Backup CR either as you can't do any analysis.
IMHO if the Backup in the BSL gets GC, the orphaned Backup CR should be removed too as it doesn't provide any useful data anymore.

@blackpiglet
Copy link
Contributor

I agree that if the backup logs are gone, then Velero cannot find a way to figure out what goes wrong exactly.

@sseago @Lyndon-Li
What do you think about this case?

@Lyndon-Li
Copy link
Contributor

if the backup logs are gone, then Velero cannot find a way to figure out what goes wrong exactly.

For this topic, here is a possible solution we've discussed by not settled down yet #6606.

@phoenix-bjoern
Copy link
Author

@blackpiglet one additional hint: The backups have not been created in the same cluster. In our scenario the backup got imported from the ReadOnly BSL. The Backup CR on the original cluster has been removed already (I assume by the GC process), which also removed the entity in the BSL.
So I don't think it is worth to keep the Backup CR in this "secondary" cluster, if the original CR and the entity in the BSL is gone. If logs of (partially) failed backups should be kept for analyzing, then it has to be adjusted in the GC, not in the deleteOrphanedBackups() method.

@phoenix-bjoern
Copy link
Author

@blackpiglet is there anything I could do to move on with this issue?

@blackpiglet
Copy link
Contributor

@phoenix-bjoern
This issue will be discussed in the next release(v1.14)'s planning.

@reasonerjt reasonerjt added this to the v1.14 milestone Feb 6, 2024
@reasonerjt reasonerjt added Bug and removed Needs info Waiting for information 1.14-candidate labels Feb 6, 2024
@reasonerjt
Copy link
Contributor

Seems the issue is in backup sync controller, when it's syncing the deletion of a backup it should disregard both the status and access mode of the BSL.

@allenxu404
Copy link
Contributor

Close it since it has already been fixed by the PR: #6649.

@danfengliu danfengliu added the Needs Testing If you've tested your change but would still like a maintainer to try it in a different environment label Mar 19, 2024
@danfengliu danfengliu added in-test-plan and removed in-test-plan Needs Testing If you've tested your change but would still like a maintainer to try it in a different environment labels Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants