Skip to content

Commit

Permalink
Merge pull request #7604 from Lyndon-Li/resource-consumption-in-doc
Browse files Browse the repository at this point in the history
Add resource consumption in fs-backup and data mover doc
  • Loading branch information
qiuming-best authored Apr 3, 2024
2 parents d974cd3 + 49cd345 commit c7c59db
Show file tree
Hide file tree
Showing 3 changed files with 78 additions and 22 deletions.
55 changes: 50 additions & 5 deletions site/content/docs/main/csi-snapshot-data-movement.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,14 +241,11 @@ kubectl -n velero get datadownloads -l velero.io/restore-name=YOUR_RESTORE_NAME

## Limitations

- CSI and CSI snapshot support both file system volume mode and block volume mode. At present, Velero built-in data mover doesn't support
block mode volume or volume snapshot.
- CSI and CSI snapshot support both file system volume mode and block volume mode. At present, block mode is only supported for non-Windows platforms, because the block mode code invokes some system calls that are not present in the Windows platform.
- [Velero built-in data mover] At present, Velero uses a static, common encryption key for all backup repositories it creates. **This means
that anyone who has access to your backup storage can decrypt your backup data**. Make sure that you limit access
to the backup storage appropriately.
- [Velero built-in data mover] Even though the backup data could be incrementally preserved, for a single file data, Velero built-in data mover leverages on deduplication to find the difference to be saved. This means that large files (such as ones storing a database) will take a long time to scan for data deduplication, even if the actual difference is small.
- [Velero built-in data mover] You may need to [customize the resource limits][11] to make sure backups complete successfully for massive small files or large backup size cases, for more details refer to [Velero file system level backup performance guide][12].
- The block mode is supported by the Kopia uploader, but it only supports non-Windows platforms, because the block mode code invokes some system calls that are not present in the Windows platform.
- [Velero built-in data mover] Even though the backup data could be incrementally preserved, for a single file data, Velero built-in data mover leverages on deduplication to find the difference to be saved. This means that large files (such as ones storing a database) will take a long time to scan for data deduplication, even if the actual difference is small.

## Troubleshooting

Expand Down Expand Up @@ -387,6 +384,53 @@ However, Velero cancels the `DataUpload`/`DataDownload` in below scenarios autom

Customized data movers that support cancellation could cancel their ongoing tasks and clean up any intermediate resources. If you are using Velero built-in data mover, the cancellation is supported.

### Support ReadOnlyRootFilesystem setting
When the Velero server/node-agent pod's SecurityContext sets the `ReadOnlyRootFileSystem` parameter to true, the Velero server/node-agent pod's filesystem is running in read-only mode. Then the backup/restore may fail, because the uploader/repository needs to write some cache and configuration data into the pod's root filesystem.

```
Errors: Velero: name: /mongodb-0 message: /Error backing up item error: /failed to wait BackupRepository: backup repository is not ready: error to connect to backup repo: error to connect repo with storage: error to connect to repository: unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system name: /mongodb-1 message: /Error backing up item error: /failed to wait BackupRepository: backup repository is not ready: error to connect to backup repo: error to connect repo with storage: error to connect to repository: unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system name: /mongodb-2 message: /Error backing up item error: /failed to wait BackupRepository: backup repository is not ready: error to connect to backup repo: error to connect repo with storage: error to connect to repository: unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system Cluster: <none>
```

The workaround is making those directories as ephemeral k8s volumes, then those directories are not counted as pod's root filesystem.
The `user-name` is the Velero pod's running user name. The default value is `cnb`.

``` yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: velero
namespace: velero
spec:
template:
spec:
containers:
- name: velero
......
volumeMounts:
......
- mountPath: /home/<user-name>/udmrepo
name: udmrepo
- mountPath: /home/<user-name>/.cache
name: cache
......
volumes:
......
- emptyDir: {}
name: udmrepo
- emptyDir: {}
name: cache
......
```

### Resource Consumption

Both the uploader and repository consume remarkable CPU/memory during the backup/restore, especially for massive small files or large backup size cases.
Velero node-agent uses [BestEffort as the QoS][13] for node-agent pods (so no CPU/memory request/limit is set), so that backups/restores wouldn't fail due to resource throttling in any cases.
If you want to constraint the CPU/memory usage, you need to [customize the resource limits][11]. The CPU/memory consumption is always related to the scale of data to be backed up/restored, refer to [Performance Guidance][12] for more details, so it is highly recommended that you perform your own testing to find the best resource limits for your data.

During the restore, the repository may also cache data/metadata so as to reduce the network footprint and speed up the restore. The repository uses its own policy to store and clean up the cache.
For Kopia repository, the cache is stored in the node-agent pod's root file system and the cleanup is triggered for the data/metadata that are older than 10 minutes (not configurable at present). So you should prepare enough disk space, otherwise, the node-agent pod may be evicted due to running out of the ephemeral storage.


[1]: https://github.com/vmware-tanzu/velero/pull/5968
[2]: csi.md
Expand All @@ -400,3 +444,4 @@ Customized data movers that support cancellation could cancel their ongoing task
[10]: restore-reference.md#changing-pv/pvc-Storage-Classes
[11]: customize-installation.md#customize-resource-requests-and-limits
[12]: performance-guidance.md
[13]: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/
22 changes: 12 additions & 10 deletions site/content/docs/main/customize-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ If you've already run `velero install` without the `--use-node-agent` flag, you

## CSI Snapshot Data Movement

Velero node-agent is required by CSI snapshot data movement when Velero built-in data mover is used. By default, `velero install` does not install Velero's node-agent. To enable it, specify the `--use-node-agent` flag.
Velero node-agent is required by [CSI Snapshot Data Movement][12] when Velero built-in data mover is used. By default, `velero install` does not install Velero's node-agent. To enable it, specify the `--use-node-agent` flag.

For some use cases, Velero node-agent requires to run under privileged mode. For example, when backing up block volumes, it is required to allow the node-agent to access the block device. To enable it set velero install flags `--privileged-node-agent`.

Expand Down Expand Up @@ -95,20 +95,20 @@ the config file setting.

## Customize resource requests and limits

At installation, Velero sets default resource requests and limits for the Velero pod and the node-agent pod, if you using the [File System Backup][3].
At installation, You could set resource requests and limits for the Velero pod and the node-agent pod, if you are using the [File System Backup][3] or [CSI Snapshot Data Movement][12].

{{< table caption="Velero Customize resource requests and limits defaults" >}}
|Setting|Velero pod defaults|node-agent pod defaults|
|--- |--- |--- |
|CPU request|500m|500m|
|Memory requests|128Mi|512Mi|
|CPU limit|1000m (1 CPU)|1000m (1 CPU)|
|Memory limit|512Mi|1024Mi|
|CPU request|500m|N/A|
|Memory requests|128Mi|N/A|
|CPU limit|1000m (1 CPU)|N/A|
|Memory limit|512Mi|N/A|
{{< /table >}}

Depending on the cluster resources, you may need to increase these defaults. Through testing, the Velero maintainers have found these defaults work well when backing up and restoring 1000 or less resources and total size of files is 100GB or below. If the resources you are planning to backup or restore exceed this, you will need to increase the CPU or memory resources available to Velero. In general, the Velero maintainer's testing found that backup operations needed more CPU & memory resources but were less time-consuming than restore operations, when comparing backing up and restoring the same amount of data. The exact CPU and memory limits you will need depend on the scale of the files and directories of your resources and your hardware. It's recommended that you perform your own testing to find the best resource limits for your clusters and resources.

You may need to increase the resource limits if you are using File System Backup, see the details in [File System Backup][3].
For Velero pod, through testing, the Velero maintainers have found these defaults work well when backing up and restoring 1000 or less resources.
For node-agent pod, by default it doesn't have CPU/memory request/limit, so that the backups/restores won't break due to resource throttling. The Velero maintainers have also done some [Performance Tests][13] to show the relationship of CPU/memory usage and the scale of data being backed up/restored.
You don't have to change the defaults all the time, but if you need, it's recommended that you perform your own testing to find the best resource limits for your clusters and resources.

### Install with custom resource requests and limits

Expand Down Expand Up @@ -421,3 +421,5 @@ If you get an error like `complete:13: command not found: compdef`, then add the
[9]: self-signed-certificates.md
[10]: csi.md
[11]: https://github.com/vmware-tanzu/velero/blob/main/pkg/apis/velero/v1/constants.go
[12]: csi-snapshot-data-movement.md
[13]: performance-guidance.md
23 changes: 16 additions & 7 deletions site/content/docs/main/file-system-backup.md
Original file line number Diff line number Diff line change
Expand Up @@ -346,9 +346,6 @@ to be defined by its pod.
- Even though the backup data could be incrementally preserved, for a single file data, FSB leverages on deduplication
to find the difference to be saved. This means that large files (such as ones storing a database) will take a long time
to scan for data deduplication, even if the actual difference is small.
- You may need to [customize the resource limits](customize-installation/#customize-resource-requests-and-limits)
to make sure backups complete successfully for massive small files or large backup size cases, for more details refer to
[Velero File System Backup Performance Guide](/docs/main/performance-guidance).
- Velero's File System Backup reads/writes data from volumes by accessing the node's filesystem, on which the pod is running.
For this reason, FSB can only backup volumes that are mounted by a pod and not directly from the PVC. For orphan PVC/PV pairs
(without running pods), some Velero users overcame this limitation running a staging pod (i.e. a busybox or alpine container
Expand Down Expand Up @@ -586,10 +583,10 @@ Velero does not provide a mechanism to detect persistent volume claims that are

To solve this, a controller was written by Thomann Bits&Beats: [velero-pvc-watcher][7]

## Support ReadOnlyRootFilesystem setting on Velero server pod
## Support ReadOnlyRootFilesystem setting
### Kopia
When the Velero server pod's SecurityContext sets the `ReadOnlyRootFileSystem` parameter to true, the Velero server pod's filesystem is running in read-only mode.
If the user creates a backup with Kopia as the uploader, or a backup enabling snapshot data mover function, the backup will fail, because the Kopia needs to write some cache and configuration data into the pod filesystem.
When the Velero server/node-agent pod's SecurityContext sets the `ReadOnlyRootFileSystem` parameter to true, the Velero server/node-agent pod's filesystem is running in read-only mode.
If the user creates a backup with Kopia as the uploader, the backup will fail, because the Kopia needs to write some cache and configuration data into the pod filesystem.

```
Errors: Velero: name: /mongodb-0 message: /Error backing up item error: /failed to wait BackupRepository: backup repository is not ready: error to connect to backup repo: error to connect repo with storage: error to connect to repository: unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system name: /mongodb-1 message: /Error backing up item error: /failed to wait BackupRepository: backup repository is not ready: error to connect to backup repo: error to connect repo with storage: error to connect to repository: unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system name: /mongodb-2 message: /Error backing up item error: /failed to wait BackupRepository: backup repository is not ready: error to connect to backup repo: error to connect repo with storage: error to connect to repository: unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system Cluster: <none>
Expand Down Expand Up @@ -624,7 +621,16 @@ spec:
- emptyDir: {}
name: cache
......
```
```

## Resource Consumption

Both the uploader and repository consume remarkable CPU/memory during the backup/restore, especially for massive small files or large backup size cases.
Velero node-agent uses [BestEffort as the QoS][14] for node-agent pods (so no CPU/memory request/limit is set), so that backups/restores wouldn't fail due to resource throttling in any cases.
If you want to constraint the CPU/memory usage, you need to [customize the resource limits][15]. The CPU/memory consumption is always related to the scale of data to be backed up/restored, refer to [Performance Guidance][16] for more details, so it is highly recommended that you perform your own testing to find the best resource limits for your data.

During the restore, the repository may also cache data/metadata so as to reduce the network footprint and speed up the restore. The repository uses its own policy to store and clean up the cache.
For Kopia repository, the cache is stored in the node-agent pod's root file system and the cleanup is triggered for the data/metadata that are older than 10 minutes (not configurable at present). So you should prepare enough disk space, otherwise, the node-agent pod may be evicted due to running out of the ephemeral storage.


[1]: https://github.com/restic/restic
Expand All @@ -640,3 +646,6 @@ spec:
[11]: https://www.vcluster.com/
[12]: csi.md
[13]: csi-snapshot-data-movement.md
[14]: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/
[15]: customize-installation.md#customize-resource-requests-and-limits
[16]: performance-guidance.md

0 comments on commit c7c59db

Please sign in to comment.