Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restic-wait init container not inserted during restic restore #6471

Closed
ameerdtm opened this issue Jul 6, 2023 · 10 comments
Closed

restic-wait init container not inserted during restic restore #6471

ameerdtm opened this issue Jul 6, 2023 · 10 comments
Assignees
Labels
Needs investigation PodVolume Restic Relates to the restic integration

Comments

@ameerdtm
Copy link

ameerdtm commented Jul 6, 2023

Describe the problem/challenge you have
We run a nightly backup and restore of a namespace. This was working without issue on Kubernetes 1.23.9 and Velero version v1.9.1. When we upgraded to Kubernetes 1.24.3, the jobs started failing because the data in the PVC was not restored via restic.

What did you expect to happen:
Data in the restic restore to be available to the container

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help
bundle-2023-07-06-15-38-35.tar.gz
se VM
If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:
In the backup logs, the PVC is identified and successfully backed up via restic. When the restore runs, none of the restic information is listed in describe, there is no restic-wait init container in the restored deployment, and there is an empty volume attached to the container. This is consistent but other namespaces I can restore from the same backup and the PVC is restored and the restic-wait init container exists.

Environment:

  • Velero version (use velero version): v1.9.1
  • Velero features (use velero client config get features): features:
  • Kubernetes version (use kubectl version): v1.24.3
  • Kubernetes installer & version: Custom installer
  • Cloud provider or hardware configuration: Onpremise VM
  • OS (e.g. from /etc/os-release): Ubuntu 20.04.4 LTS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@allenxu404
Copy link
Contributor

allenxu404 commented Jul 7, 2023

This is consistent but other namespaces I can restore from the same backup and the PVC is restored and the restic-wait init container exists.

@ameerdtm little confuse about this statement. can you provide further clarification on this?

@allenxu404
Copy link
Contributor

Taking a quick look at debug information you provided. I notice that the restores were created from a PartiallyFailed backup which met some issues when backing up items. Is that desired?

@ameerdtm
Copy link
Author

ameerdtm commented Jul 7, 2023

Yes, a restore consistently fails for a specific namespace, but seems to work for the rest. The PartiallyFailed are not desired and also increased greatly after the 1.24.3 upgrade. As I was looking further, the nightly scheduled (schedules.velero.io) backup does not appear to get this PVC and it should. Below are the errors from the PartiallyFailed backup:
$ $velero backup logs px-snd2101-20230706080528 | grep -i error level=info msg="1 errors encountered backup up item" backup=velero/px-snd2101-20230706080528 logSource="pkg/backup/backup.go:413" name=palette-controller-manager-7c758f9489-gjmp2 time="2023-07-06T12:06:25Z" level=error msg="Error backing up item" backup=velero/px-snd2101-20230706080528 error="timed out waiting for all PodVolumeBackups to complete" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:192" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:417" name=palette-controller-manager-7c758f9489-gjmp2 time="2023-07-06T12:06:33Z" level=info msg="1 errors encountered backup up item" backup=velero/px-snd2101-20230706080528 logSource="pkg/backup/backup.go:413" name=system-upgrade-controller-d964bb794-wmhmn time="2023-07-06T12:06:33Z" level=error msg="Error backing up item" backup=velero/px-snd2101-20230706080528 error="timed out waiting for all PodVolumeBackups to complete" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:192" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:417" name=system-upgrade-controller-d964bb794-wmhmn time="2023-07-06T12:06:34Z" level=info msg="1 errors encountered backup up item" backup=velero/px-snd2101-20230706080528 logSource="pkg/backup/backup.go:413" name=dex-dex-7cf45444fc-7f48x time="2023-07-06T12:06:34Z" level=error msg="Error backing up item" backup=velero/px-snd2101-20230706080528 error="timed out waiting for restic repository to become ready" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/repository_ensurer.go:182" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*repositoryEnsurer).EnsureRepo" logSource="pkg/backup/backup.go:417" name=dex-dex-7cf45444fc-7f48x time="2023-07-06T12:06:34Z" level=info msg="1 errors encountered backup up item" backup=velero/px-snd2101-20230706080528 logSource="pkg/backup/backup.go:413" name=dex-dex-7cf45444fc-mkzdl time="2023-07-06T12:06:34Z" level=error msg="Error backing up item" backup=velero/px-snd2101-20230706080528 error="restic repository is not ready: " error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/repository_ensurer.go:144" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*repositoryEnsurer).EnsureRepo" logSource="pkg/backup/backup.go:417" name=dex-dex-7cf45444fc-mkzdl time="2023-07-06T12:06:34Z" level=info msg="1 errors encountered backup up item" backup=velero/px-snd2101-20230706080528 logSource="pkg/backup/backup.go:413" name=dex-dex-7cf45444fc-xzddx time="2023-07-06T12:06:34Z" level=error msg="Error backing up item" backup=velero/px-snd2101-20230706080528 error="restic repository is not ready: " error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/repository_ensurer.go:144" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*repositoryEnsurer).EnsureRepo" logSource="pkg/backup/backup.go:417" name=dex-dex-7cf45444fc-xzddx time="2023-07-06T12:06:34Z" level=info msg="1 errors encountered backup up item" backup=velero/px-snd2101-20230706080528 logSource="pkg/backup/backup.go:413" name=helm-controller-7844d55644-4vptq time="2023-07-06T12:06:34Z" level=error msg="Error backing up item" backup=velero/px-snd2101-20230706080528 error="timed out waiting for restic repository to become ready" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/repository_ensurer.go:182" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*repositoryEnsurer).EnsureRepo" logSource="pkg/backup/backup.go:417" name=helm-controller-7844d55644-4vptq time="2023-07-06T12:06:34Z" level=info msg="1 errors encountered backup up item" backup=velero/px-snd2101-20230706080528 logSource="pkg/backup/backup.go:413" name=image-automation-controller-58bb6c5bf-nd8vm time="2023-07-06T12:06:34Z" level=error msg="Error backing up item" backup=velero/px-snd2101-20230706080528 error="restic repository is not ready: " error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/repository_ensurer.go:144" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*repositoryEnsurer).EnsureRepo" logSource="pkg/backup/backup.go:417" name=image-automation-controller-58bb6c5bf-nd8vm time="2023-07-06T12:06:34Z" level=info msg="1 errors encountered backup up item" backup=velero/px-snd2101-20230706080528 logSource="pkg/backup/backup.go:413" name=image-reflector-controller-6f887bcdd9-qnndj time="2023-07-06T12:06:34Z" level=error msg="Error backing up item" backup=velero/px-snd2101-20230706080528 error="restic repository is not ready: " error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/repository_ensurer.go:144" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*repositoryEnsurer).EnsureRepo" logSource="pkg/backup/backup.go:417" name=image-reflector-controller-6f887bcdd9-qnndj

@ameerdtm
Copy link
Author

I see a few "timed out waiting for all PodVolumeBackups to complete" issues in the past, and that seems to be a common thread here. I have turned debug logging on my restic daemonset but I get no errors regarding the timeout. Is there any way to get more detail on what is happening here?

@sseago
Copy link
Collaborator

sseago commented Jul 10, 2023

@ameerdtm There won't be any errors on the daemonset related to the timeout. The timeout just means that during velero backup processing, the timeout was reached without a completed PVB. This means one of two things:

  1. PVB stalled out and never completes (hopefully restic logs will help, looking in the pod on the same node as the pod mounting the volume)
  2. PVB wasn't done processing when timeout was reached -- if this is the case, you'll see that the PVB status eventually completes successfully.

Either way, PVB status is an important thing to check here. If the only problem is that restic is taking too long, then increasing the timeout may allow your backups to complete successfully.

@ameerdtm
Copy link
Author

The amount of data for most of these PVs is small, a few GB at most. The backups seem to take four hours, which to me look like its stuck on a restic backup that isn't progressing, then eventually times out. That's the odd thing; I can't get any logging either velero or restic logs turned to debug that tells me if it can't access the bucket or something else. Any ideas on how I can see more in this situation?

@Lyndon-Li
Copy link
Contributor

@ameerdtm
The time to finish a Restic backup is related to not only the data size but also to the amount of the file. Moreover, there are some known issues that Restic backup in 1.19 is with low performance generally or even worse in some special cases.

Therefore, here are some suggestions:

  1. Upgrade to the latest version of Velero which is shipping the latest version of Restic, so quite some performance issues have been fixed
  2. If it still doesn't work after upgrade, you can switch to Kopia path. There are two paths for file system backup (a.k.a, pod volume backup or formerly Restic backup), and the Kopia path is believed to have better performance
  3. If you have to stay with 1.19, you need to increase the timeout (the default value is 4 hours) and you also need to increase the memory limit for Velero Restic pods

@ameerdtm
Copy link
Author

I upgraded to v1.11.0 and things seem to have improved. I will need a few days to confirm that the new version resolved all the issues. Thanks for the suggestions.

@ameerdtm
Copy link
Author

This issue is resolved. For anyone upgrading from 1.23 to 1.24; version v1.9.1 went from working fine to timing out waiting restic podvolumebackups. Even with debugging on there was no indication as to why it wasn't moving. Upgrading to v1.11.0 works much better, despite the fact I am still running into 8-10 hour backups for some larger clusters.

@ameerdtm
Copy link
Author

closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs investigation PodVolume Restic Relates to the restic integration
Projects
None yet
Development

No branches or pull requests

5 participants