Expired Backup objects from BackupStorageLocations with accessMode=ReadOnly don't get cleaned up #7106

phoenix-bjoern · 2023-11-15T21:00:46Z

When a BackupStorageLocation is defined with accessMode: ReadOnly Velero imports Backup objects into the cluster as usual.

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: my-backups
  namespace: velero
spec:
  accessMode: ReadOnly

Once the Backups objects expire, Velero usually would remove the Backup objects in the BackupStorageLocation and in the cluster if the accessMode is set to ReadWrite. However, for ReadOnly BackupStorageLocations Velero neither removes the entities in the storage location (which is expected), nor does it remove it in the K8S cluster is it installed. Over the time this behavior could lead to a vast number of Backup objects, which could slow down Velero's operations.

It is expected that Velero cleans up the Backup (and other objects) in the cluster it is installed even of the accessMode of the BackupStorageLocation is ReadOnly.

The text was updated successfully, but these errors were encountered:

blackpiglet · 2023-11-16T03:59:50Z

IMO, this is expected.
Please read this chapter in the Velero document: https://velero.io/docs/v1.12/how-velero-works/#restores
The read-only mode is used for the scenario, in which the user only wants to restore from the created backups, but doesn't want to create and delete backups by accident.

phoenix-bjoern · 2023-11-16T06:55:22Z

As I mentioned before, the behavior in regards to the BackupStorageLocation is fine for me and works as documented:

„which disables backup creation and deletion for the storage location“

However, the Backup CRs in the cluster don’t get removed , even if the Backups in the BackupStorageLocation are removed by another Velero instance which has ReadWrite access (= backup source).
So, IMHO there is gap, as Velero neither cleans up expired Backup CRs in the local cluster by itself, nor does it reconcile the Backup CRs and remove it locally if they got removed in the storage location.

blackpiglet · 2023-11-16T07:17:09Z

Thanks for the clarification. I see.
What's the Velero version you are using?
I just tried with the main branch code in my development environment.
When the BSL is set as ReadOnly access mode, the orphan backup CRs can be deleted.

phoenix-bjoern · 2023-11-16T07:28:51Z

We are on Velero 1.12.1.

sseago · 2023-11-16T20:10:36Z

@blackpiglet @phoenix-bjoern Hmm. So the use case is the BSL is in 2 clusters, read-write in cluster1, read-only in cluster 2. Cluster 2 won't delete the backup on expiration since it's read-only, but cluster1 will delete it and clean out the BSL.

However, the backup sync controller has code to delete orphaned backups. Once the backup is removed from cluster1 and the BSL, it becomes an orphaned backup in cluster2, and the controller shoould delete this backup CR from clsuter2 now, not because it's expired but because it's orphaned. It could be that there's a bug in the backup sync controller that's preventing this from happening -- that's where I'd look to see what's going on.

blackpiglet · 2023-11-17T01:56:41Z

@phoenix-bjoern
Tested on the v1.12.1. I didn't reproduce this issue.
I also didn't find the backup sync controller checks the BSL AccessMode when it deletes the orphan backups.

Could you help to collect the Velero log in your environment to help debug?

phoenix-bjoern · 2023-11-17T04:38:25Z

@blackpiglet sure, will collect some logs for you.

phoenix-bjoern · 2023-11-17T12:37:55Z

The only log I've identified related to an expired Backup is the one from the garbage collector:

time="2023-11-17T11:39:08Z" level=info msg="Backup:velero-applications-with-volumes-20231109210000 has expired" backup=velero/velero-applications-with-volumes-20231109210000 expiration="2023-11-16 21:00:00 +0000 UTC" logSource="pkg/controller/gc_controller.go:128"
time="2023-11-17T11:39:08Z" level=info msg="Backup cannot be garbage-collected because backup storage location develop is currently in read-only mode" backup=velero/velero-applications-with-volumes-20231109210000 expiration="2023-11-16 21:00:00 +0000 UTC" logSource="pkg/controller/gc_controller.go:152"

I there another mechanism which should kick in here to remove the CR?

phoenix-bjoern · 2023-11-18T13:35:03Z

I found out that expired Backups actually get deleted. Only Backups with CSI volumes with errors (PartiallyFailed) don't get removed locally. Maybe the (partial) errors or the CSI plugin/mechanism is the reason why the CRs don't get removed properly.

phoenix-bjoern · 2023-11-19T22:38:38Z

@blackpiglet As mentioned before, I think only orphaned Backups with status PartiallyFailed are not deleted while backups with status Complete actually get deleted. Obviously I missed that important point in my issue description, sorry.

I think the relevant code is https://github.com/vmware-tanzu/velero/blob/v1.12.1/pkg/controller/backup_sync_controller.go#L334C47:

if backup.Status.Phase != velerov1api.BackupPhaseCompleted || backupStoreBackups.Has(backup.Name) {
    continue
}

So the question is: Why is it necessary to keep this check for the backup's phase here? I tracked down when this code was introduced and found this commit: 8ce513a. It seems the author wanted to prevent the removal of "in-flight" backups.

So IMHO it should be safe to extend the status list to other "final" statuses like PartiallyFailed and Failed, which should resolved the described problem. What do you think?

blackpiglet · 2023-11-20T07:22:24Z

Thanks for the feedback.

There were some discussions about how to handle the backups with failure. The concern is the Velero server cannot know what goes wrong in the backup, and the Velero server may not have the ability to handle the error, for example, the volume's snapshot creation failure, and the filesystem data upload failure. There are some operations that the Velero server cannot guarantee the idempotency.
So leaving the failed backup there for users to handle may be a better choice here.

phoenix-bjoern · 2023-11-20T08:33:44Z

@blackpiglet I was thinking about the issue description in #705 too. But the Backup CR doesn't have any logs or detailed information about the errors during the backup. So if the Backup entity at the BSL is gone, there is no value in keeping the Backup CR either as you can't do any analysis.
IMHO if the Backup in the BSL gets GC, the orphaned Backup CR should be removed too as it doesn't provide any useful data anymore.

blackpiglet · 2023-11-20T09:19:30Z

I agree that if the backup logs are gone, then Velero cannot find a way to figure out what goes wrong exactly.

@sseago @Lyndon-Li
What do you think about this case?

Lyndon-Li · 2023-11-20T09:45:58Z

if the backup logs are gone, then Velero cannot find a way to figure out what goes wrong exactly.

For this topic, here is a possible solution we've discussed by not settled down yet #6606.

phoenix-bjoern · 2023-11-20T10:44:30Z

@blackpiglet one additional hint: The backups have not been created in the same cluster. In our scenario the backup got imported from the ReadOnly BSL. The Backup CR on the original cluster has been removed already (I assume by the GC process), which also removed the entity in the BSL.
So I don't think it is worth to keep the Backup CR in this "secondary" cluster, if the original CR and the entity in the BSL is gone. If logs of (partially) failed backups should be kept for analyzing, then it has to be adjusted in the GC, not in the deleteOrphanedBackups() method.

phoenix-bjoern · 2023-12-08T01:39:35Z

@blackpiglet is there anything I could do to move on with this issue?

blackpiglet · 2023-12-08T06:11:38Z

@phoenix-bjoern
This issue will be discussed in the next release(v1.14)'s planning.

reasonerjt · 2024-02-06T06:56:32Z

Seems the issue is in backup sync controller, when it's syncing the deletion of a backup it should disregard both the status and access mode of the BSL.

allenxu404 · 2024-03-14T01:57:29Z

Close it since it has already been fixed by the PR: #6649.

phoenix-bjoern changed the title ~~Backups from BackupStorageLocations in ReadOnly access mode don't get cleaned up~~ Expired Backup objects from BackupStorageLocations with accessMode=ReadOnly don't get cleaned up Nov 15, 2023

blackpiglet self-assigned this Nov 16, 2023

blackpiglet added Needs info Waiting for information backlog Candidate for close Issues that should be closed and need a team review before closing labels Nov 16, 2023

blackpiglet removed the Candidate for close Issues that should be closed and need a team review before closing label Nov 20, 2023

blackpiglet assigned allenxu404 and unassigned blackpiglet Nov 20, 2023

pradeepkchaturvedi added the 1.14-candidate label Jan 24, 2024

reasonerjt added this to the v1.14 milestone Feb 6, 2024

reasonerjt added Bug and removed Needs info Waiting for information 1.14-candidate labels Feb 6, 2024

blackpiglet closed this as completed Mar 14, 2024

danfengliu added the Needs Testing If you've tested your change but would still like a maintainer to try it in a different environment label Mar 19, 2024

danfengliu added in-test-plan and removed in-test-plan Needs Testing If you've tested your change but would still like a maintainer to try it in a different environment labels Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expired Backup objects from BackupStorageLocations with accessMode=ReadOnly don't get cleaned up #7106

Expired Backup objects from BackupStorageLocations with accessMode=ReadOnly don't get cleaned up #7106

phoenix-bjoern commented Nov 15, 2023

blackpiglet commented Nov 16, 2023

phoenix-bjoern commented Nov 16, 2023

blackpiglet commented Nov 16, 2023 •

edited

Loading

phoenix-bjoern commented Nov 16, 2023

sseago commented Nov 16, 2023

blackpiglet commented Nov 17, 2023

phoenix-bjoern commented Nov 17, 2023

phoenix-bjoern commented Nov 17, 2023

phoenix-bjoern commented Nov 18, 2023

phoenix-bjoern commented Nov 19, 2023

blackpiglet commented Nov 20, 2023

phoenix-bjoern commented Nov 20, 2023

blackpiglet commented Nov 20, 2023

Lyndon-Li commented Nov 20, 2023

phoenix-bjoern commented Nov 20, 2023

phoenix-bjoern commented Dec 8, 2023

blackpiglet commented Dec 8, 2023

reasonerjt commented Feb 6, 2024

allenxu404 commented Mar 14, 2024

Expired Backup objects from BackupStorageLocations with accessMode=ReadOnly don't get cleaned up #7106

Expired Backup objects from BackupStorageLocations with accessMode=ReadOnly don't get cleaned up #7106

Comments

phoenix-bjoern commented Nov 15, 2023

blackpiglet commented Nov 16, 2023

phoenix-bjoern commented Nov 16, 2023

blackpiglet commented Nov 16, 2023 • edited Loading

phoenix-bjoern commented Nov 16, 2023

sseago commented Nov 16, 2023

blackpiglet commented Nov 17, 2023

phoenix-bjoern commented Nov 17, 2023

phoenix-bjoern commented Nov 17, 2023

phoenix-bjoern commented Nov 18, 2023

phoenix-bjoern commented Nov 19, 2023

blackpiglet commented Nov 20, 2023

phoenix-bjoern commented Nov 20, 2023

blackpiglet commented Nov 20, 2023

Lyndon-Li commented Nov 20, 2023

phoenix-bjoern commented Nov 20, 2023

phoenix-bjoern commented Dec 8, 2023

blackpiglet commented Dec 8, 2023

reasonerjt commented Feb 6, 2024

allenxu404 commented Mar 14, 2024

blackpiglet commented Nov 16, 2023 •

edited

Loading