v1.14.0+: Memory leaks on backups jobs (CSI backups + cloud plugin) #7925

jkroepke · 2024-06-25T15:44:11Z

What steps did you take and what happened:

Setup scheduled Backups via CSI driver. No restic, kopia or other file based copy action are in use.

Memory leak.

We have an additional to the failed daily schedule a hourly schedule. In grafana, I can see a memory increase after each backup.

What did you expect to happen:

No memory leak

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

Does work from a pod with in-cluster authentication:

velero debug  --backup opsstack-velero-daily-opsstack-20240625003049 --namespace opsstack
2024/06/25 15:38:09 Collecting velero resources in namespace: opsstack
An error occurred: exec failed: Traceback (most recent call last):
  velero-debug-collector:23:13: in <toplevel>
  <builtin>: in kube_get
Error: could not initialize search client: stat /tmp/.kube/config: no such file or directory

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

velero.zip

Anything else you would like to add:

Environment:

Velero version (use velero version): v1.14.0
Velero features (use velero client config get features): <NOT SET>
Kubernetes version (use kubectl version): v1.29.4
Kubernetes installer & version: AKS
Cloud provider or hardware configuration: AKS
OS (e.g. from /etc/os-release): Azure Linux

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

blackpiglet · 2024-06-26T02:57:32Z

Could you give more information about the two metrics velero REQESTS and velero LIMITS?
Are those metrics for the Velero server pods?

They seems have default values.

	// default values for Velero server pod resource request/limit
	DefaultVeleroPodCPURequest = "500m"
	DefaultVeleroPodMemRequest = "128Mi"
	DefaultVeleroPodCPULimit   = "1000m"
	DefaultVeleroPodMemLimit   = "512Mi"

jkroepke · 2024-06-26T06:12:11Z

Yes, thats the Velero Pod:

Mem Request: 512.00 MiB
Mem Limit: 1.00 GiB

CPU Req: 0.1000
CPU Limit: 1.0000

Memory Usage:

After each Peak, velero was OOMed

History Memory Usage (we do updates each week):

Last 3 hours (mention we have one hourly schedule)

blackpiglet · 2024-06-26T06:43:54Z

Could you share more details on the backup workload?
How many volumes are involved?

jkroepke · 2024-06-26T08:05:33Z

10 Volumes via CSI Driver.

A working backups typically saves 1513 Kubernetes resources.

We experience this issue in approx. 10 Kubernetes clusters. All 10 clusters configured identically (10 volumes, ~1500 resources)

jkroepke · 2024-06-26T08:15:31Z

I guess, I found the root cause.

For each backup, the azure plugin is spawned, but not terminated.

root     3353897  0.0  0.0 1236380 11452 ?       Sl   Jun19   3:56 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 98762dde1ae7a3f1a11e0894ba14c8b4c1f596017280b388b97a104816907683 -address /run/containerd/containerd.sock
65527    3353931  0.0  0.0    972     4 ?        Ss   Jun19   0:00  \_ /pause
65527    4167441  0.3  0.6 5561796 107432 ?      Ssl  Jun25   6:59  \_ /velero server --uploader-type=kopia --default-backup-storage-location=opsstack --default-volume-snapshot-locations=velero.io/azure:opsstack --log-format=json --features=EnableCSI --namespace=opsstack
65527       9334  0.0  0.2 1395916 38048 ?       Sl   Jun25   0:02      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527      78788  0.0  0.2 1395916 37500 ?       Sl   Jun25   0:02      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     148224  0.0  0.2 1395916 36780 ?       Sl   Jun25   0:02      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     217993  0.0  0.2 1395916 38040 ?       Sl   Jun25   0:02      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     287368  0.0  0.2 1395916 36944 ?       Sl   Jun25   0:02      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     356756  0.0  0.2 1395916 36800 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     426057  0.0  0.2 1395916 37744 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     495363  0.0  0.2 1395916 37848 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     564793  0.0  0.2 1395916 36780 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     633070  0.0  0.2 1395916 37848 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     702396  0.0  0.2 1395916 38216 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     771675  0.0  0.2 1395916 37144 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     840974  0.0  0.2 1395916 36844 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     910286  0.0  0.2 1395916 36400 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     979589  0.0  0.2 1395916 37984 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1048695  0.0  0.2 1395916 37004 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1118222  0.0  0.2 1395916 36428 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1187685  0.0  0.2 1395916 37888 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1257170  0.0  0.2 1395916 37488 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1326865  0.0  0.2 1395916 36652 ?       Sl   Jun25   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1397481  0.0  0.2 1395916 38296 ?       Sl   Jun25   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1465597  0.0  0.2 1395916 36980 ?       Sl   Jun25   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1535111  0.0  0.2 1395916 37020 ?       Sl   Jun25   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1604535  0.0  0.2 1395916 37452 ?       Sl   00:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1638944  0.0  0.2 1395916 37176 ?       Sl   00:31   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1674221  0.0  0.2 1395916 40172 ?       Sl   01:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1743626  0.0  0.2 1395916 35636 ?       Sl   02:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1812856  0.0  0.2 1395916 37392 ?       Sl   03:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1882257  0.0  0.2 1396172 37320 ?       Sl   04:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1951690  0.0  0.2 1395916 36572 ?       Sl   05:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    2021143  0.0  0.2 1396172 38152 ?       Sl   06:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    2090248  0.0  0.2 1395916 37876 ?       Sl   07:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    2159473  0.0  0.2 1395916 37324 ?       Sl   08:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI

blackpiglet · 2024-06-26T08:21:06Z

I see.
The Azure plugin allocates 100M on startup, so if enough Azure plugin instances pile up, it could cause the OOM.

jkroepke · 2024-06-26T08:28:07Z

In our case, we will remove the Azure Plugin in for now. It was a left over from the migration to CSI based backups and all old azure plugin based Backups are out of the rotation.

blackpiglet · 2024-06-26T08:36:16Z

velero-plugin-for-microsoft-azure is needed for communicating with Azure Object Storage to store the backup and restore metadata.
So maybe it cannot be removed for the AKS environment.

jkroepke · 2024-06-26T08:41:07Z

Azure Object Storage to store the backup and restore metadata.

I though, that part is bundled into velero?

https://github.com/vmware-tanzu/velero/blob/main/pkg/util/azure/storage.go

blackpiglet · 2024-06-26T09:04:33Z

That part of code is used for filesystem backup to interact with cloud providers.

jkroepke · 2024-06-26T09:09:00Z

okay ,thanks for confirmation. I guess then we may stuck

sseago · 2024-06-26T14:22:30Z

@blackpiglet Hmm. In my experience, the plugin process typically gets terminated after each backup -- or more specifically, after the controller reconcile exits, since the GRPC connection is terminated. Maybe we should investigate whether 1) something has changed in 1.14 or 2) some issue with one or more plugins is preventing their exit?

jkroepke · 2024-06-26T14:30:52Z

We have only Azure Plugin active. We had CSI Plugin in history, but its now merged in 1.14

blackpiglet · 2024-06-27T01:57:27Z

@sseago
Agree. I will investigate whether something has changed in the plugin workflow, but we had met a similar situation in the E2E test.
https://github.com/vmware-tanzu/velero-plugin-for-microsoft-azure/pull/210/files
This PR was created to mitigate the OOM issue met in the AKS environment, but it's still possible to hit this issue when object storage consumes a significant amount of memory.

So it seems the plugins do not always exist right after it completes, and that behavior is not introduced recently.

jkroepke · 2024-06-27T06:52:44Z

Our current workaround is now a cronjob that restart velero all 2 hours.

jkroepke · 2024-06-27T07:34:45Z

So it seems the plugins do not always exist right after it completes, and that behavior is not introduced recently

We check a cluster running with 1.13 velero and we can observe that no plugin instances are left.

sseago · 2024-06-27T14:59:08Z

Our current workaround is now a cronjob that restart velero all 2 hours.

@jkroepke The risk here is that if you restart velero while a backup or restore is running, that backup/restore will be marked failed since an in-progress action can't be resumed after restart/crash.

jkroepke · 2024-06-27T15:02:09Z

Yep, but without that cronjob, it's guaranteed that Velero get OOMed on backup.

Does velero not support graceful shutdown? If a backup/restore is running, I would expect that SIGTERM is delayed until the backup/restore process is finished.

blackpiglet · 2024-07-03T07:23:23Z

@jkroepke
Could you please check the version of the Velero Azure plugin?

jkroepke · 2024-07-03T09:13:11Z

@blackpiglet velero/velero-plugin-for-microsoft-azure:v1.10.0 - on initial report, we used 1.9 but then we also upgrade to 1.10 without any difference

anshulahuja98 · 2024-07-05T10:24:09Z

We have been seeing frequent OOMKilled in restores, where we have Limit set as 1Gi.
Tracking this if we figure out some pattern.
We use 1.13 in downstream

jkroepke · 2024-07-05T10:26:06Z

We have been seeing frequent OOMKilled in restores, where we have Limit set as 1Gi. Tracking this if we figure out some pattern. We use 1.13 in downstream

Please maybe create an dedicated issue for this.

anshulahuja98 · 2024-07-05T10:40:09Z

Sure, will avoid any confusion here.

duduvaa · 2024-07-09T13:58:50Z

The issue isn't related only to Azure plugin.
I ran CSI backups, using Velero scheduling. All backups were completed OK.

Total Backup: 21
Total plugins instances:21

$ ps -ef | grep velero
root      137860       1  0 09:02 ?        00:00:03 /usr/bin/conmon -b /run/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata -c 658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-adp_velero-77b95849d5-44cjw_2cf677d9-b902-47ae-9c06-42971ac78ab3/velero/0.log --log-level info -n k8s_velero_velero-77b95849d5-44cjw_openshift-adp_2cf677d9-b902-47ae-9c06-42971ac78ab3_0 -P /run/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata/conmon-pidfile -p /run/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio --syslog -u 658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b -s
1000830+  137872  137860  1 09:02 ?        00:01:13 /velero server -features=EnableCSI --uploader-type=kopia --fs-backup-timeout=2880m --restore-resource-priorities=securitycontextconstraints,customresourcedefinitions,klusterletconfigs.config.open-cluster-management.io,managedcluster.cluster.open-cluster-management.io,namespaces,roles,rolebindings,clusterrolebindings,klusterletaddonconfig.agent.open-cluster-management.io,managedclusteraddon.addon.open-cluster-management.io,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io --log-level debug --disable-informer-cache=false
1000830+  180946  137872  0 09:21 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  192363  137872  0 09:26 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  203868  137872  0 09:31 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  215300  137872  0 09:36 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  226758  137872  0 09:41 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  238226  137872  0 09:46 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  249688  137872  0 09:51 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  261103  137872  0 09:56 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  272539  137872  0 10:01 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  283993  137872  0 10:06 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  295429  137872  0 10:11 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  306872  137872  0 10:16 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  318247  137872  0 10:21 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  329665  137872  0 10:26 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  341128  137872  0 10:31 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  352609  137872  0 10:36 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  364052  137872  0 10:41 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  375590  137872  0 10:46 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  387024  137872  0 10:51 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  398526  137872  0 10:56 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  409917  137872  0 11:01 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
core      418429  139617  0 11:05 pts/0    00:00:00 grep --color=auto velero
root      751578  751549  0 Jun24 ?        01:21:51 /velero node-agent server

blackpiglet · 2024-07-11T09:19:32Z

The difference is caused by how the velero server and velero debug CLI work.
The velero server calls the client-go function to read the k8s configuration to connect to the kube-apiserver.
The velero debug CLI calls a tool called crashd to collect the related k8s resources, so it's not possible to use the same mechanism as velero server to use the InCluster way.

jkroepke · 2024-07-11T11:48:22Z

Did you see, that I attached a zip file with all debug logs? According to the issue template, I do all the things manually, what velero debug does.

duduvaa · 2024-07-11T12:31:56Z

@jkroepke,

Can you pls share the commands?

thanks

jkroepke · 2024-07-11T12:57:22Z

velero/.github/ISSUE_TEMPLATE/bug_report.md

Lines 18 to 24 in 255a51f

    
           _If you are using earlier versions:_   
        
           Please provide the output of the following commands (Pasting long output into a [GitHub gist](https://gist.github.com) or other pastebin is fine.) 
        
           - `kubectl logs deployment/velero -n velero` 
        
           - `velero backup describe <backupname>` or `kubectl get backup/<backupname> -n velero -o yaml` 
        
           - `velero backup logs <backupname>` 
        
           - `velero restore describe <restorename>` or `kubectl get restore/<restorename> -n velero -o yaml` 
        
           - `velero restore logs <restorename>`

sseago · 2024-07-11T13:10:08Z

@jkroepke @duduvaa Since both of you have seen this -- are there specific circumstances that reproduce it? Only with datamover? Or CSI-no-datamover? Or fs-backup? With all workloads or only some?

duduvaa · 2024-07-11T13:38:35Z

it reproduced with:
CSI
CSI+DataMover

Kopia Backup - working fine

jkroepke · 2024-07-11T14:32:23Z

We have 2 scheudles with only CSI backups enabled. + Azure Blob Storage for manifests.

sseago · 2024-07-11T16:04:51Z

@jkroepke @duduvaa That's good information. If kopia backup doesn't do this, this suggests that we're closing the object store plugin processes properly in the regular backup controller. The backup operation controller only reconciles when we have async operations -- which only happens for CSI and Datamover backups, so it may be in that code where this is occurring. It gives us a more focused area to dig into this.

jkroepke · 2024-07-11T21:33:51Z

In our case, we do Azure File snapshots vis csi. That might be special.

sseago · 2024-07-11T21:59:41Z

@jkroepke I'm seeing it in my dev cluster with aws object store plugin and CSI snapshots (no datamover). If I use fs-backup, I don't see it.

sseago · 2024-07-11T22:32:42Z

Some debug progress, but still more to do. We know this happens only for CSI or CSI+datamover backups. I've isolated this to the backup finalizer controller. From the added logging, we see in the other controllers we initialize the needed plugins, then at end of reconcile, the relevant plugins are killed. In the finalizer controller case, though it's a bit different. In my cluster, 3 plugins were initialized (internal, object store, and the OADP openshift BIA/RIA plugin) -- then all 3 of these were killed. However, immediately after killing the aws plugin (again this is due to the defer pluginManager.CleanupClients() call at end of finalizer reconcile), a new plugin process starts up. This is triggered by a restartableProcess.ResetIfNeeded() call -- I still need to dig into why this is happening at the end of the reconcile. Maybe some embedded defer call somewhere I'm not seeing? I'm not sure yet. I'll dig into it again tomorrow.

jkroepke · 2024-07-12T05:59:22Z

If it helps, you can provide an extensive log build that I will run on my cluster.

xlgao-zju · 2024-07-12T06:02:03Z

We met the same issue here.

yuanqijing · 2024-07-12T08:48:13Z

@jkroepke @sseago hi, i have the same issue, i suspect the codes here introduce the processes leak.

velero/pkg/backup/backup.go

Lines 729 to 743 in a91d2cb

    
           backupStore, volumeInfos, err := kb.getVolumeInfos(*backupRequest.Backup, log) 
        
           if err != nil { 
        
           	log.WithError(err).Errorf("fail to get the backup VolumeInfos for backup %s", backupRequest.Name) 
        
           	return err 
        
           } 
        
           if err := updateVolumeInfos(volumeInfos, unstructuredDataUploads, asyncBIAOperations, log); err != nil { 
        
           	log.WithError(err).Errorf("fail to update VolumeInfos for backup %s", backupRequest.Name) 
        
           	return err 
        
           } 
        
           if err := putVolumeInfos(backupRequest.Name, volumeInfos, backupStore); err != nil { 
        
           	log.WithError(err).Errorf("fail to put the VolumeInfos for backup %s", backupRequest.Name) 
        
           	return err 
        
           }

On my machine, I deleted this section of the code, and the issue of leaks did not occur. but, the exact reason why this code was causing a process leak is unclear.

yuanqijing · 2024-07-12T10:16:58Z

I think i found the reason. The issue appears to be that each FinalizeBackup call invokes getVolumeInfos to obtain a storage client and refresh volume information. The getVolumeInfos function creates a new pluginManager and uses defer CleanupClients to clean up the process. However, after obtaining the client, the caller proceeds to upload the backup data, during which a new process is created but not cleaned up. Consequently, this newly created process remains persistent.

anshulahuja98 · 2024-07-12T10:58:10Z

I think i found the reason. The issue appears to be that each FinalizeBackup call invokes getVolumeInfos to obtain a storage client and refresh volume information. The getVolumeInfos function creates a new pluginManager and uses defer CleanupClients to clean up the process. However, after obtaining the client, the caller proceeds to upload the backup data, during which a new process is created but not cleaned up. Consequently, this newly created process remains persistent.

Which all versions are impacted due to this?

yuanqijing · 2024-07-12T11:15:38Z

I think i found the reason. The issue appears to be that each FinalizeBackup call invokes getVolumeInfos to obtain a storage client and refresh volume information. The getVolumeInfos function creates a new pluginManager and uses defer CleanupClients to clean up the process. However, after obtaining the client, the caller proceeds to upload the backup data, during which a new process is created but not cleaned up. Consequently, this newly created process remains persistent.

Which all versions are impacted due to this?

The problem is introduced in #7554 , i think all verstions after that commit are impacted.

sseago · 2024-07-12T13:20:56Z

@anshulahuja98 @yuanqijing Yes, I see the problem. The pattern of getting a plugin manager and deferring cleanup clients is used elsewhere to guarantee that the plugin process exits. However, in this case we're killing the plugin process while at the same time returning the backup store -- so the next time that someone uses that backup store, it re-initializes the plugin, but at this point it's no longer connected to a plugin manager, so it's never cleaned up.

	pluginManager := kb.pluginManager(log)
	defer pluginManager.CleanupClients()

	backupStore, storeErr := kb.backupStoreGetter.Get(location, pluginManager, log)
	if storeErr != nil {
		return nil, nil, storeErr
	}

	volumeInfos, err := backupStore.GetBackupVolumeInfos(backup.Name)
	if err != nil {
		return nil, nil, err
	}

	return backupStore, volumeInfos, nil
}

We need to get the plugin manager at a higher level in the call stack so we clean up after we're done with the backup store.

sseago · 2024-07-12T14:16:52Z

I've submitted #8012 which passes in the already-initialized plugin manager from the finalizer controller, so we don't need to spawn a new plugin process for get/put volume info. I've tested this locally and it resolves the leak.

sseago · 2024-07-12T14:17:44Z

Once #8012 merges we should CP to release-1.14 for 1.14.1.

Lyndon-Li · 2024-07-15T02:23:55Z

Reopen for cherry-pick.

Lyndon-Li · 2024-07-17T08:58:56Z

Close as completed.

blackpiglet added Area/CSI Related to Container Storage Interface support Performance labels Jun 26, 2024

blackpiglet self-assigned this Jun 26, 2024

blackpiglet assigned ywk253100 Jul 1, 2024

jkroepke changed the title ~~v1.14.0: Memory leaks~~ v1.14.0: Memory leaks on backups jobs (CSI backups + azure plugin) Jul 5, 2024

jkroepke changed the title ~~v1.14.0: Memory leaks on backups jobs (CSI backups + azure plugin)~~ v1.14.0+: Memory leaks on backups jobs (CSI backups + azure plugin) Jul 5, 2024

yuanqijing mentioned this issue Jul 12, 2024

fix: storage plugin processes leak. #8011

Closed

3 tasks

sseago mentioned this issue Jul 12, 2024

Reuse existing plugin manager for get/put volume info #8012

Merged

sseago added the target/1.14.1 label Jul 12, 2024

Lyndon-Li added this to the v1.15 milestone Jul 15, 2024

blackpiglet closed this as completed in #8012 Jul 15, 2024

Lyndon-Li reopened this Jul 15, 2024

sseago mentioned this issue Jul 15, 2024

[release-1.14] Reuse existing plugin manager for get/put volume info #8016

Merged

Lyndon-Li closed this as completed Jul 17, 2024

v1.14.0+: Memory leaks on backups jobs (CSI backups + cloud plugin) #7925

v1.14.0+: Memory leaks on backups jobs (CSI backups + cloud plugin) #7925

Comments

jkroepke commented Jun 25, 2024 • edited Loading

blackpiglet commented Jun 26, 2024

jkroepke commented Jun 26, 2024

blackpiglet commented Jun 26, 2024 • edited Loading

jkroepke commented Jun 26, 2024 • edited Loading

jkroepke commented Jun 26, 2024

blackpiglet commented Jun 26, 2024

jkroepke commented Jun 26, 2024

blackpiglet commented Jun 26, 2024

jkroepke commented Jun 26, 2024

blackpiglet commented Jun 26, 2024

jkroepke commented Jun 26, 2024

sseago commented Jun 26, 2024

jkroepke commented Jun 26, 2024

blackpiglet commented Jun 27, 2024

jkroepke commented Jun 27, 2024

jkroepke commented Jun 27, 2024

sseago commented Jun 27, 2024

jkroepke commented Jun 27, 2024

blackpiglet commented Jul 3, 2024

jkroepke commented Jul 3, 2024 • edited Loading

anshulahuja98 commented Jul 5, 2024

jkroepke commented Jul 5, 2024

anshulahuja98 commented Jul 5, 2024

duduvaa commented Jul 9, 2024 • edited by blackpiglet Loading

blackpiglet commented Jul 11, 2024

jkroepke commented Jul 11, 2024

duduvaa commented Jul 11, 2024

jkroepke commented Jul 11, 2024

sseago commented Jul 11, 2024

duduvaa commented Jul 11, 2024

jkroepke commented Jul 11, 2024

sseago commented Jul 11, 2024

jkroepke commented Jul 11, 2024

sseago commented Jul 11, 2024

sseago commented Jul 11, 2024

jkroepke commented Jul 12, 2024

xlgao-zju commented Jul 12, 2024

yuanqijing commented Jul 12, 2024 • edited Loading

yuanqijing commented Jul 12, 2024

anshulahuja98 commented Jul 12, 2024

yuanqijing commented Jul 12, 2024

sseago commented Jul 12, 2024

sseago commented Jul 12, 2024

sseago commented Jul 12, 2024

Lyndon-Li commented Jul 15, 2024

Lyndon-Li commented Jul 17, 2024

jkroepke commented Jun 25, 2024 •

edited

Loading

blackpiglet commented Jun 26, 2024 •

edited

Loading

jkroepke commented Jun 26, 2024 •

edited

Loading

jkroepke commented Jul 3, 2024 •

edited

Loading

duduvaa commented Jul 9, 2024 •

edited by blackpiglet

Loading

yuanqijing commented Jul 12, 2024 •

edited

Loading