Merge pull request #7604 from Lyndon-Li/resource-consumption-in-doc

Add resource consumption in fs-backup and data mover doc
vmware-tanzu · Apr 3, 2024 · c7c59db · c7c59db
2 parents d974cd3 + 49cd345
commit c7c59db
Show file tree

Hide file tree

Showing 3 changed files with 78 additions and 22 deletions.
diff --git a/site/content/docs/main/csi-snapshot-data-movement.md b/site/content/docs/main/csi-snapshot-data-movement.md
@@ -241,14 +241,11 @@ kubectl -n velero get datadownloads -l velero.io/restore-name=YOUR_RESTORE_NAME
 
 ## Limitations
 
-- CSI and CSI snapshot support both file system volume mode and block volume mode. At present, Velero built-in data mover doesn't support 
-block mode volume or volume snapshot. 
+- CSI and CSI snapshot support both file system volume mode and block volume mode. At present, block mode is only supported for non-Windows platforms, because the block mode code invokes some system calls that are not present in the Windows platform.  
 - [Velero built-in data mover] At present, Velero uses a static, common encryption key for all backup repositories it creates. **This means 
 that anyone who has access to your backup storage can decrypt your backup data**. Make sure that you limit access 
 to the backup storage appropriately. 
-- [Velero built-in data mover] Even though the backup data could be incrementally preserved, for a single file data, Velero built-in data mover leverages on deduplication to find the difference to be saved. This means that large files (such as ones storing a database) will take a long time to scan for data  deduplication, even if the actual difference is small.
-- [Velero built-in data mover] You may need to [customize the resource limits][11] to make sure backups complete successfully for massive small files or large backup size cases, for more details refer to [Velero file system level backup performance guide][12]. 
-- The block mode is supported by the Kopia uploader, but it only supports non-Windows platforms, because the block mode code invokes some system calls that are not present in the Windows platform.
+- [Velero built-in data mover] Even though the backup data could be incrementally preserved, for a single file data, Velero built-in data mover leverages on deduplication to find the difference to be saved. This means that large files (such as ones storing a database) will take a long time to scan for data  deduplication, even if the actual difference is small.  
 
 ## Troubleshooting
 
@@ -387,6 +384,53 @@ However, Velero cancels the `DataUpload`/`DataDownload` in below scenarios autom
 
 Customized data movers that support cancellation could cancel their ongoing tasks and clean up any intermediate resources. If you are using Velero built-in data mover, the cancellation is supported.  
 
+### Support ReadOnlyRootFilesystem setting
+When the Velero server/node-agent pod's SecurityContext sets the `ReadOnlyRootFileSystem` parameter to true, the Velero server/node-agent pod's filesystem is running in read-only mode. Then the backup/restore may fail, because the uploader/repository needs to write some cache and configuration data into the pod's root filesystem.
+
+```
+Errors: Velero:    name: /mongodb-0 message: /Error backing up item error: /failed to wait BackupRepository: backup repository is not ready: error to connect to backup repo: error to connect repo with storage: error to connect to repository: unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system name: /mongodb-1 message: /Error backing up item error: /failed to wait BackupRepository: backup repository is not ready: error to connect to backup repo: error to connect repo with storage: error to connect to repository: unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system name: /mongodb-2 message: /Error backing up item error: /failed to wait BackupRepository: backup repository is not ready: error to connect to backup repo: error to connect repo with storage: error to connect to repository: unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system Cluster:    <none>
+```
+
+The workaround is making those directories as ephemeral k8s volumes, then those directories are not counted as pod's root filesystem.
+The `user-name` is the Velero pod's running user name. The default value is `cnb`.
+
+``` yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: velero
+  namespace: velero
+spec:
+  template:
+    spec:
+      containers:
+      - name: velero
+        ......
+        volumeMounts:
+          ......
+          - mountPath: /home/<user-name>/udmrepo
+            name: udmrepo
+          - mountPath: /home/<user-name>/.cache
+            name: cache
+          ......
+      volumes:
+        ......
+        - emptyDir: {}
+          name: udmrepo
+        - emptyDir: {}
+          name: cache
+        ......
+```
+
+### Resource Consumption
+
+Both the uploader and repository consume remarkable CPU/memory during the backup/restore, especially for massive small files or large backup size cases.  
+Velero node-agent uses [BestEffort as the QoS][13] for node-agent pods (so no CPU/memory request/limit is set), so that backups/restores wouldn't fail due to resource throttling in any cases.  
+If you want to constraint the CPU/memory usage, you need to [customize the resource limits][11]. The CPU/memory consumption is always related to the scale of data to be backed up/restored, refer to [Performance Guidance][12] for more details, so it is highly recommended that you perform your own testing to find the best resource limits for your data.   
+
+During the restore, the repository may also cache data/metadata so as to reduce the network footprint and speed up the restore. The repository uses its own policy to store and clean up the cache.  
+For Kopia repository, the cache is stored in the node-agent pod's root file system and the cleanup is triggered for the data/metadata that are older than 10 minutes (not configurable at present). So you should prepare enough disk space, otherwise, the node-agent pod may be evicted due to running out of the ephemeral storage.  
+
 
 [1]: https://github.com/vmware-tanzu/velero/pull/5968
 [2]: csi.md
@@ -400,3 +444,4 @@ Customized data movers that support cancellation could cancel their ongoing task
 [10]: restore-reference.md#changing-pv/pvc-Storage-Classes
 [11]: customize-installation.md#customize-resource-requests-and-limits
 [12]: performance-guidance.md
+[13]: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/
diff --git a/site/content/docs/main/customize-installation.md b/site/content/docs/main/customize-installation.md
@@ -25,7 +25,7 @@ If you've already run `velero install` without the `--use-node-agent` flag, you
 
 ## CSI Snapshot Data Movement
 
-Velero node-agent is required by CSI snapshot data movement when Velero built-in data mover is used. By default, `velero install` does not install Velero's node-agent. To enable it, specify the `--use-node-agent` flag.
+Velero node-agent is required by [CSI Snapshot Data Movement][12] when Velero built-in data mover is used. By default, `velero install` does not install Velero's node-agent. To enable it, specify the `--use-node-agent` flag.
 
 For some use cases, Velero node-agent requires to run under privileged mode. For example, when backing up block volumes, it is required to allow the node-agent to access the block device. To enable it set velero install flags `--privileged-node-agent`.
 
@@ -95,20 +95,20 @@ the config file setting.
 
 ## Customize resource requests and limits
 
-At installation, Velero sets default resource requests and limits for the Velero pod and the node-agent pod, if you using the [File System Backup][3].
+At installation, You could set resource requests and limits for the Velero pod and the node-agent pod, if you are using the [File System Backup][3] or [CSI Snapshot Data Movement][12].  
 
 {{< table caption="Velero Customize resource requests and limits defaults" >}}
 |Setting|Velero pod defaults|node-agent pod defaults|
 |--- |--- |--- |
-|CPU request|500m|500m|
-|Memory requests|128Mi|512Mi|
-|CPU limit|1000m (1 CPU)|1000m (1 CPU)|
-|Memory limit|512Mi|1024Mi|
+|CPU request|500m|N/A|
+|Memory requests|128Mi|N/A|
+|CPU limit|1000m (1 CPU)|N/A|
+|Memory limit|512Mi|N/A|
 {{< /table >}}
-
-Depending on the cluster resources, you may need to increase these defaults. Through testing, the Velero maintainers have found these defaults work well when backing up and restoring 1000 or less resources and total size of files is 100GB or below. If the resources you are planning to backup or restore exceed this, you will need to increase the CPU or memory resources available to Velero. In general, the Velero maintainer's testing found that backup operations needed more CPU & memory resources but were less time-consuming than restore operations, when comparing backing up and restoring the same amount of data. The exact CPU and memory limits you will need depend on the scale of the files and directories of your resources and your hardware. It's recommended that you perform your own testing to find the best resource limits for your clusters and resources.
-
-You may need to increase the resource limits if you are using File System Backup, see the details in [File System Backup][3].
+  
+For Velero pod, through testing, the Velero maintainers have found these defaults work well when backing up and restoring 1000 or less resources.  
+For node-agent pod, by default it doesn't have CPU/memory request/limit, so that the backups/restores won't break due to resource throttling. The Velero maintainers have also done some [Performance Tests][13] to show the relationship of CPU/memory usage and the scale of data being backed up/restored.  
+You don't have to change the defaults all the time, but if you need, it's recommended that you perform your own testing to find the best resource limits for your clusters and resources.   
 
 ### Install with custom resource requests and limits
 
@@ -421,3 +421,5 @@ If you get an error like `complete:13: command not found: compdef`, then add the
 [9]: self-signed-certificates.md
 [10]: csi.md
 [11]: https://github.com/vmware-tanzu/velero/blob/main/pkg/apis/velero/v1/constants.go
+[12]: csi-snapshot-data-movement.md
+[13]: performance-guidance.md
diff --git a/site/content/docs/main/file-system-backup.md b/site/content/docs/main/file-system-backup.md
@@ -346,9 +346,6 @@ to be defined by its pod.
 - Even though the backup data could be incrementally preserved, for a single file data, FSB leverages on deduplication 
 to find the difference to be saved. This means that large files (such as ones storing a database) will take a long time 
 to scan for data deduplication, even if the actual difference is small.
-- You may need to [customize the resource limits](customize-installation/#customize-resource-requests-and-limits) 
-to make sure backups complete successfully for massive small files or large backup size cases, for more details refer to 
-[Velero File System Backup Performance Guide](/docs/main/performance-guidance).
 - Velero's File System Backup reads/writes data from volumes by accessing the node's filesystem, on which the pod is running. 
 For this reason, FSB can only backup volumes that are mounted by a pod and not directly from the PVC. For orphan PVC/PV pairs 
 (without running pods), some Velero users overcame this limitation running a staging pod (i.e. a busybox or alpine container 
@@ -586,10 +583,10 @@ Velero does not provide a mechanism to detect persistent volume claims that are
 
 To solve this, a controller was written by Thomann Bits&Beats: [velero-pvc-watcher][7]
 
-## Support ReadOnlyRootFilesystem setting on Velero server pod
+## Support ReadOnlyRootFilesystem setting
 ### Kopia
-When the Velero server pod's SecurityContext sets the `ReadOnlyRootFileSystem` parameter to true, the Velero server pod's filesystem is running in read-only mode.
-If the user creates a backup with Kopia as the uploader, or a backup enabling snapshot data mover function, the backup will fail, because the Kopia needs to write some cache and configuration data into the pod filesystem.
+When the Velero server/node-agent pod's SecurityContext sets the `ReadOnlyRootFileSystem` parameter to true, the Velero server/node-agent pod's filesystem is running in read-only mode.
+If the user creates a backup with Kopia as the uploader, the backup will fail, because the Kopia needs to write some cache and configuration data into the pod filesystem.
 
 ```
 Errors: Velero:    name: /mongodb-0 message: /Error backing up item error: /failed to wait BackupRepository: backup repository is not ready: error to connect to backup repo: error to connect repo with storage: error to connect to repository: unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system name: /mongodb-1 message: /Error backing up item error: /failed to wait BackupRepository: backup repository is not ready: error to connect to backup repo: error to connect repo with storage: error to connect to repository: unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system name: /mongodb-2 message: /Error backing up item error: /failed to wait BackupRepository: backup repository is not ready: error to connect to backup repo: error to connect repo with storage: error to connect to repository: unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system Cluster:    <none>
@@ -624,7 +621,16 @@ spec:
         - emptyDir: {}
           name: cache
         ......
-```
+```  
+
+## Resource Consumption
+
+Both the uploader and repository consume remarkable CPU/memory during the backup/restore, especially for massive small files or large backup size cases.  
+Velero node-agent uses [BestEffort as the QoS][14] for node-agent pods (so no CPU/memory request/limit is set), so that backups/restores wouldn't fail due to resource throttling in any cases.  
+If you want to constraint the CPU/memory usage, you need to [customize the resource limits][15]. The CPU/memory consumption is always related to the scale of data to be backed up/restored, refer to [Performance Guidance][16] for more details, so it is highly recommended that you perform your own testing to find the best resource limits for your data.   
+
+During the restore, the repository may also cache data/metadata so as to reduce the network footprint and speed up the restore. The repository uses its own policy to store and clean up the cache.  
+For Kopia repository, the cache is stored in the node-agent pod's root file system and the cleanup is triggered for the data/metadata that are older than 10 minutes (not configurable at present). So you should prepare enough disk space, otherwise, the node-agent pod may be evicted due to running out of the ephemeral storage.  
 
 
 [1]: https://github.com/restic/restic
@@ -640,3 +646,6 @@ spec:
 [11]: https://www.vcluster.com/
 [12]: csi.md
 [13]: csi-snapshot-data-movement.md
+[14]: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/
+[15]: customize-installation.md#customize-resource-requests-and-limits
+[16]: performance-guidance.md