Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete Velero Backup when user sets the deleteVeleroBackup #67

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mpryc
Copy link
Collaborator

@mpryc mpryc commented May 14, 2024

Enhancement which will delete original VeleroBackup when the user sets the deleteVeleroBackup witin NonAdminBackup Spec.

Fixes #58

This allows NonAdminBackup object to have additional field within spec which will trigger deletion of the original VeleroBackup object:

apiVersion: nac.oadp.openshift.io/v1alpha1
kind: NonAdminBackup
metadata:
  name: example
  namespace: nacproject
spec:
  deleteVeleroBackup: true # <<<<<< NEW 
  [...]

To test:


Test 1:
deleteVeleroBackup on non existing Velero Backup object

  1. Create NonAdminBackup without properspec
apiVersion: nac.oadp.openshift.io/v1alpha1
kind: NonAdminBackup
metadata:
  name: backupwithoutspec
  namespace: nacproject
spec: {}
  1. Ensure NonAdminBackup is in:
  • Phase:
    BackingOff
  • Has the Condition:
    Type: Accepted
    Status: False
    Reason: InvalidBackupSpec
    Message: BackupSpec is not defined
  1. Update the NonAdminBackup object to include:
[...]
spec:
  deleteVeleroBackup: true
[...]
  1. Expected result:
    • NonAdminBackup object should be deleted by the controller.
    • Reconcile loop should exit with the following DEBUG log messages:
      DEBUG Deleted NonAdminBackup: backupwithoutspec
      DEBUG NonAdminBackupPredicate: Accepted Delete event
      DEBUG >>> Reconcile NonAdminBackup - loop start
      DEBUG Non existing NonAdminBackup CR

Test 2:
deleteVeleroBackup on existing Velero Backup object set to true

  1. Create NonAdminBackup without properspec
apiVersion: nac.oadp.openshift.io/v1alpha1
kind: NonAdminBackup
metadata:
  name: backupwithspec
  namespace: nacproject
spec:
  backupSpec: {}
  1. Ensure NonAdminBackup is in:
  • Phase:
    Created
  • Has the Conditions:
    • Type: Accepted
      Status: True
      Reason: BackupAccepted
      Message: Backup accepted
    • Type: Accepted
      Status: True
      Reason: BackupAccepted
      Message: Backup accepted
  1. Ensure VeleroBackup is created in the openshift-adp namespace:
    $ oc get Backup -n openshift-adp nab-nacproject-eeaa9cd35c77f4
    NAME AGE
    nab-nacproject-eeaa9cd35c77f4 12s

  2. Update the NonAdminBackup object to include:

[...]
spec:
  deleteVeleroBackup: true
[...]
  1. Expected result:
    • VeleroBackup should be deleted by the controller
    • NonAdminBackup object should be deleted by the controller.
    • Reconcile loop should exit with the following DEBUG log messages:
      DEBUG Deleted NonAdminBackup: backupwithoutspec
      DEBUG NonAdminBackupPredicate: Accepted Delete event
      DEBUG >>> Reconcile NonAdminBackup - loop start
      DEBUG Non existing NonAdminBackup CR

Test 3:
deleteVeleroBackup on existing Velero Backup object set to false

  1. Perform the same steps as 1,2,3 from the Test 2
  2. Update the NonAdminBackup object to include:
[...]
spec:
  deleteVeleroBackup: false
[...]
  1. Reconcile loop should have the DEBUG logs:
    DEBUG: NonAdminBackup BackupSpec and BackupStatus - nothing to update

Test 4:
deleteVeleroBackup on NON existing Velero Backup object set to true

  1. Perform the same steps as 1,2,3 from the Test 2
  2. Delete Velero Backup object:
    $ oc delete -n openshift-adp nab-nacproject-eeaa9cd35c77f4
  3. Update the NonAdminBackup object to include:
[...]
spec:
  deleteVeleroBackup: true
[...]
  1. Expected result:
    • NonAdminBackup object should be deleted by the controller.
    • Reconcile loop should exit with the following DEBUG log messages:
      DEBUG Deleted NonAdminBackup: backupwithoutspec
      DEBUG NonAdminBackupPredicate: Accepted Delete event
      DEBUG >>> Reconcile NonAdminBackup - loop start
      DEBUG Non existing NonAdminBackup CR

Copy link

openshift-ci bot commented May 14, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mpryc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mpryc
Copy link
Collaborator Author

mpryc commented May 14, 2024

Depends on #65 as it's rebased on it.

@mpryc mpryc force-pushed the delete-velero-backup-issue_58 branch 3 times, most recently from 34133b5 to 72ef619 Compare May 14, 2024 13:45
@mateusoliveira43
Copy link
Contributor

@mpryc I am confused, how this fixes related issue?

This will delete Velero Backup on demand, right? But NAB will be still around

Should not be a mechanism that when NAB is deleted, Velero Backup is also deleted?

@mpryc
Copy link
Collaborator Author

mpryc commented May 14, 2024

@mpryc I am confused, how this fixes related issue?

This will delete Velero Backup on demand, right? But NAB will be still around

Should not be a mechanism that when NAB is deleted, Velero Backup is also deleted?

No, we discussed this with @shubham-pampattiwar and @shawn-hurley. We really can't do the above as we don't want end up in situation where there is namespace removed that triggers the NAB to be deleted and we also loose Bacups from which namespace could be restored. In k8s there isn't really separation of the events that triggered NAB deletion.

@mpryc mpryc self-assigned this May 14, 2024
@shubham-pampattiwar
Copy link
Member

@mpryc Thoughts on why we made this feature enablement via spec and not via annotation ?

@@ -280,6 +281,29 @@ func CheckVeleroBackupLabels(backup *velerov1api.Backup) bool {
return exists && value == constant.ManagedByLabelValue
}

// DeleteVeleroBackup deletes Velero Backup object based on the NonAdminBackup object name and namespace
func DeleteVeleroBackup(ctx context.Context, r client.Client, logger logr.Logger, nab *nacv1alpha1.NonAdminBackup) (bool, error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the Deletion logic flawed here ? Why are we using GenerateVeleroBackupName function to fetch the backup to be deleted ? Isnt there a high possibility that the Generated Name won't match the actual VeleroBackup Name that we want to delete ? Shouldn't we List the Velero backups and then check the annotations of these Velero backup objects and match the nab name with the annotation value of the key "openshift.io/oadp-nab-origin-name" to get the VeleroBackup to be deleted ? Let me know if I am missing something ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our velero backup name is always generated based on https://github.com/migtools/oadp-non-admin/blob/master/internal/common/function/function.go#L115-L145

fmt.Sprintf("%s-%s-%s", constant.VeleroBackupNamePrefix, namespace, nameHash)

So we have always nab-NAMESPACE-HASH where the HASH is sha256 from NonAdminBackup name.

VeleroBackup could get deleted only if one matches that entire 3 parts.

If we want to add extra protection we could, but then I would argue the UUID of the NonAdminBackup needs to be used from the VeleroBackup to match the UUID of the requestor.

It is currently possible, because we store this annotation within VeleroBackup:

openshift.io/oadp-nab-origin-uuid=<UUID_OF_NAB>

Let me know if I should add that protection on top of current name match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not to just use

VeleroBackupName string `json:"veleroBackupName,omitempty"`

?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but with the current implementation I can:

  1. Create NAB
  2. VeleroBackup will get created
  3. Remove NAB

At this stage I have still can remove VeleroBackup from 2 if I create new NAB with the same name as the previous one with flag deleteVeleroBackup: true.

Using status will disallow me to do so, because 2 will never get created as this object will be there forever until sysadmin removes it. It's because Status can only be updated by controller when 2 is created.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to have corner case where for some reason NAB did not got proper update of the Status.

To be honest we may want also add some safety feature to only delete VeleroBackup if it has the UID or other linkage to VeleroBackup from NAB object, this imo is separate design ?

There needs to be one source of truth here and I've chosen our function, where it could also be Status, so either way. Function is less check if status exists as we have name and only check if VeleroBackup exists.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NAB will not have a status only when the NAB controller is not running right ? or if the NAB controller is running and the status update call fails ? I think this makes it even more prudent that we check for the status of NAB first:

  • check if NAB has status
  • lookup of VeleroBackup object from NAB.status.VeleroBackupName
  • VeleroBackup is in terminal state
  • then delete the VeleroBackup object and then subsequently NAB
    Picking up the VeleroBackupName from NAB status at the very least makes sure that VeleroBackup exists for that NAB object and Delete Call makes sense from user perspective.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a time period when the NAB was created, but Backup was not yet created. If we are in that time window there is a problem with getting proper velero backup name from status. If that happens we won't reconcile again and that object will be out of sync. That is the only additional use case which I can think of when it comes to inconsistency.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See thats the thing, we are not allowing Delete for such cases, grabbing the name from status would not allow a delete for such cases(because there is no VeleroBackup object in terminal phase)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, more then half of myself tend to agree with you now :-) Thanks.

@@ -86,31 +87,91 @@ func (r *NonAdminBackupReconciler) Reconcile(ctx context.Context, req ctrl.Reque
return ctrl.Result{}, err
}

reconcileExit, reconcileErr := r.DeleteVeleroBackup(ctx, rLog, &nab)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't we have a case to requeue here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lint job complained that the requeue was not used in the r.DeleteVeleroBackup function, so I removed it from the logic.

@mateusoliveira43
Copy link
Contributor

/hold
rebase on top of master (so PR will get smaller)
and rework deletion process as discussed in meeting Wednesday

@mpryc
Copy link
Collaborator Author

mpryc commented May 20, 2024

Rebase and small update to the PR to include design.

@mateusoliveira43 if you could check the design update and let me know if this is fine.

To be done before is ready to merge:

  • removal of NonAdminBackup object after VeleroBackup is removed.

@mateusoliveira43
Copy link
Contributor

@mpryc I though we would update design here #70. Could you push these changes there?

But can be here and I rebase there afterwards as well

(the design changes look correct to me)

@mpryc
Copy link
Collaborator Author

mpryc commented May 20, 2024

@mpryc I though we would update design here #70. Could you push these changes there?

But can be here and I rebase there afterwards as well

(the design changes look correct to me)

Sure will move that update there, but now I wonder if we should completely drop status updates while deleting?

Now the way it currently works is:
1 -> Adds Condition Deletion to True
2 -> Remove VeleroBackup
3 -> Once main Velero Backup is deleted adds phase: Deleted
4 -> Additions after Mateus idea: Removes NonAdminBackup
Now the step 3 requires requeue of reconcile loop, otherwise there is an error that the object can not be updated because it's not existing
Should we completely drop 1 and 3, because we are now removing NonAdminBackup and we don't care about phase of NAB ? What do you think ?

@shubham-pampattiwar
Copy link
Member

We need some sort of status update on NAB object once VeleroBackup is deleted because if there is an err for VeleroBackup deletion call we should be able to tell about that to the user via NAB status.

@mpryc
Copy link
Collaborator Author

mpryc commented May 20, 2024

We need some sort of status update on NAB object once VeleroBackup is deleted because if there is an err for VeleroBackup deletion call we should be able to tell about that to the user via NAB status.

If there is any error with VeleroBackup deletion, I think we should not try to delete NAB and only then update NAB status with:
Phase: BackingOff
Condition: Deletion -> with error what went wrong ?

@shubham-pampattiwar
Copy link
Member

We need some sort of status update on NAB object once VeleroBackup is deleted because if there is an err for VeleroBackup deletion call we should be able to tell about that to the user via NAB status.

If there is any error with VeleroBackup deletion, I think we should not try to delete NAB and only then update NAB status with: Phase: BackingOff Condition: Deletion -> with error what went wrong ?

yes correct that's exactly what I meant, deletion of NAB should be stopped and only status should be updated for NAB object.

Enhancement which will delete original VeleroBackup when
the user sets the deleteVeleroBackup witin NonAdminBackup Spec.

Fixes Issue migtools#58

Signed-off-by: Michal Pryc <[email protected]>
@mpryc mpryc force-pushed the delete-velero-backup-issue_58 branch from 3db9cc3 to c5fb526 Compare May 21, 2024 11:16
@mpryc
Copy link
Collaborator Author

mpryc commented May 21, 2024

Updated the logic to delete NonAdminBackup when there is no error while removing VeleroBackup and when the VeleroBackup was not in the cluster. Updated manual tests in the description of this PR that were performed.

@@ -39,6 +39,10 @@ type NonAdminBackupSpec struct {
// BackupSpec defines the specification for a Velero backup.
BackupSpec *velerov1api.BackupSpec `json:"backupSpec,omitempty"`

// DeleteVeleroBackup tells the controller to remove created Velero Backup.
// +optional
DeleteVeleroBackup *bool `json:"deleteVeleroBackup,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also deletes NAB itself

Would be better to inform this in doc comment and change field name to simply delete?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete isn't enough imo, we need to be more precise here, however I agree that current name isn't ideal as we also remove NAB so we need to find out better one.

How about DeleteBackup ?

Copy link
Member

@shubham-pampattiwar shubham-pampattiwar May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah DeleteBackup sounds good !

// if one exists in the s3 storage, and this will as well cause NonAdminBackup
// to be recreated by the NAC sync controller, but at this stage we have current
// proper state reflected.
logger.V(1).Info(fmt.Sprintf("Velero Backup not deleted for NAB: %s", nab.Name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be interesting to add reason here?

something like "Velero Backup %s, referenced in for %s, was not found in %s namespace and was not deleted"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this will cause NAB to be in phase Backing Off, and all the info about the referenced velero backup is in the status of that NAB, do we want to duplicate this info within debug logs?

logger := logrLogger.WithValues("DeleteVeleroBackup", nab.Namespace)

// Delete Velero Backup if the deleteVeleroBackup was set to true
if nab.Spec.DeleteVeleroBackup != nil && *nab.Spec.DeleteVeleroBackup {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check here NAB status?

for example, I create a NAB with delete: true, it will be immediately deleted, right? should we only delete finished NABs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, if someone want to abuse spec then it's free to remove object that has been just created. Similarly to:

mkdir /tmp/test && rmdir /tmp/test

Nobody stops anyone from doing above

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mateusoliveira43 we should only delete VeleroBackup that are in terminal phase. Currently, I don't think we can delete an in-progress VeleroBackup

}
} else if apierrors.IsNotFound(err) {
return false, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think implementation is flawed here

we ignore case that err is not nil and different than apierrors.IsNotFound, not?

that means that Velero Backup was not even retrieved, but we say it got deleted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a suggestion is to always check if err != nil when dealing with errors to avoid forgetting anything

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree on that, will add that case.

Copy link
Contributor

@mateusoliveira43 mateusoliveira43 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found a little error in code flow checking err

Small fix for the situation where fetching Backup object
gave error other then Non existing backup object.

Also reverte nab_status_update.md design.

Signed-off-by: Michal Pryc <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Facilitate deletion of VeleroBackup object via NAB object
4 participants