stargz-snapshotter uses up all available disk space #1349

bodgit · 2023-08-21T15:26:17Z

I have version 0.14.3 of the snapshotter installed on some EKS nodes, some of which have been running for around 16 days. They have started to run out of disk space and it seems the majority of this is consumed by /var/lib/containerd-stargz-grpc/snapshotter/snapshots.

Is there a way to prune/clean this up automatically?

The text was updated successfully, but these errors were encountered:

ktock · 2023-08-21T16:08:22Z

@bodgit Thanks for reporting this. Snapshots are automatically cleaned up when the image is removed. You can also manually remove images (using ctr image rm) and snapshots (using ctr snapshot rm).

What contents does consume large space under /var/lib/containerd-stargz-grpc/ (visible by smtg like du -hxd 2 /var/lib/containerd-stargz-grpc/)?

bodgit · 2023-08-21T16:23:14Z

Hi @ktock

Here's the output from du -hxd 2 /var/lib/containerd-stargz-grpc/:

[root@ip-10-202-107-137 ~]# du -hxd 2 /var/lib/containerd-stargz-grpc/
34G     /var/lib/containerd-stargz-grpc/snapshotter/snapshots
34G     /var/lib/containerd-stargz-grpc/snapshotter
0       /var/lib/containerd-stargz-grpc/stargz/httpcache
0       /var/lib/containerd-stargz-grpc/stargz/fscache
0       /var/lib/containerd-stargz-grpc/stargz
34G     /var/lib/containerd-stargz-grpc/

The nodes have a 50 GB disk, 12 GB of that is consumed by /var/lib/containerd, so that and the 34 GB above accounts for most of the disk.

I tried running ctr -n k8s.io snapshot ls and there are no snapshots. There are about 800 images returned by ctr -n k8s.io images ls but historically we haven't had to worry about this.

ktock · 2023-08-22T00:07:21Z

@bodgit Thanks for the info.

34G /var/lib/containerd-stargz-grpc/snapshotter/snapshots

What does consume the large space under this directory? Are there many snapshot dirs or is there a large snapshot dir (or a file)?

ctr -n k8s.io snapshot ls and there are no snapshots.

You need --snapshotter=stargz to get the list of snapshots (i.e. ctr-remote snapshot --snapshotter=stargz ls).

800 images returned by ctr -n k8s.io images ls

Are there active snapshot mounts (mount | grep stargz) on the node?

bodgit · 2023-08-22T08:38:23Z

@bodgit Thanks for the info.

34G /var/lib/containerd-stargz-grpc/snapshotter/snapshots

What does consume the large space under this directory? Are there many snapshot dirs or is there a large snapshot dir (or a file)?

Lots of snapshot directories. All of them are under 1 GB but there are about 6-700 of them.

ctr -n k8s.io snapshot ls and there are no snapshots.

You need --snapshotter=stargz to get the list of snapshots (i.e. ctr-remote snapshot --snapshotter=stargz ls).

Ah, that worked. Running ctr-remote -n k8s.io snapshot --snapshotter=stargz ls returns the same number of entries as there are directories above.

On this particular host, there 612 snapshots. 117 of them are "Active", 495 of them are "Committed". Some of the committed snapshots don't have a parent SHA256.

800 images returned by ctr -n k8s.io images ls

Are there active snapshot mounts (mount | grep stargz) on the node?

That's picking up any mount that has /var/lib/containerd-stargz-grpc/snapshotter/... in the output rather than a particular mount type? There's 109 matching entries, all seem to be something similar to this:

overlay on /run/containerd/io.containerd.runtime.v2.task/k8s.io/15ef6c38b6cac6dffc8dfece99257066d85ab7eb23fe8ffb1ea96fb7e33cfe92/rootfs type overlay (rw,relatime,lowerdir=/var/lib/containerd-stargz-grpc/snapshotter/snapshots/14256/fs:/var/lib/containerd-stargz-grpc/snapshotter/snapshots/10277/fs:/var/lib/containerd-stargz-grpc/snapshotter/snapshots/131/fs:/var/lib/containerd-stargz-grpc/snapshotter/snapshots/79/fs:/var/lib/containerd-stargz-grpc/snapshotter/snapshots/77/fs:/var/lib/containerd-stargz-grpc/snapshotter/snapshots/60/fs,upperdir=/var/lib/containerd-stargz-grpc/snapshotter/snapshots/17211/fs,workdir=/var/lib/containerd-stargz-grpc/snapshotter/snapshots/17211/work)

They're all "overlay" mounts and they seem to vary by the number of lowerdir entries.

Is it a case of cleaning up the committed snapshots and keeping the active ones? Assuming the number of active mounts seems roughly the same as the number of overlay mounts?

To be clear, we're not (yet) trying to use any stargz images, I just installed the snapshotter on the EKS nodes to make sure everything still worked as before with our existing workloads.

Everything seems to be working fine, but it's now using more disk space and it seems relative to how long the node has been running. So eventually, the node runs out of disk space and needs to be recycled, which isn't ideal.

bodgit · 2023-08-22T13:50:07Z

I think I've found the problem. I noticed we were getting this message logged often:

kubelet: E0820 03:07:11.954394    3800 cri_stats_provider.go:455] "Failed to get the info of the filesystem with mountpoint" err="failed to get device for dir \"/var/lib/containerd/io.containerd.snapshotter.v1.stargz\": stat failed on /var/lib/containerd/io.containerd.snapshotter.v1.stargz with error: no such file or directory" mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.stargz"

Every five minutes I was also seeing this:

kubelet: E0820 03:11:39.556258    3800 kubelet.go:1386] "Image garbage collection failed multiple times in a row" err="invalid capacity 0 on image filesystem"

On a hunch I manually created the /var/lib/containerd/io.containerd.snapshotter.v1.stargz directory and the first error message stopped repeating, and then within five minutes there was a flurry of logs and then I saw:

kubelet: I0822 13:31:40.606893    3800 kubelet.go:1400] "Image garbage collection succeeded"

The disk usage had gone from 94% down to 40%

I've gone through the install documentation and I can't see any mention of having to create this missing directory, but it seems critical that it exists otherwise image garbage collection stops working. Is it just a case of manually creating it or should it be being created automatically?

Here's the contents of /var/lib/containerd:

# ls -l /var/lib/containerd/
total 0
drwxr-xr-x 4 root root 33 Jul 11 14:51 io.containerd.content.v1.content
drwxr-xr-x 4 root root 41 Aug  4 14:51 io.containerd.grpc.v1.cri
drwx------ 2 root root 18 Aug  4 14:52 io.containerd.grpc.v1.introspection
drwx--x--x 2 root root 21 Jul 11 14:51 io.containerd.metadata.v1.bolt
drwx--x--x 2 root root  6 Jul 11 14:51 io.containerd.runtime.v1.linux
drwx--x--x 3 root root 20 Aug  4 14:51 io.containerd.runtime.v2.task
drwx------ 2 root root  6 Jul 11 14:51 io.containerd.snapshotter.v1.btrfs
drwx------ 3 root root 23 Jul 11 14:51 io.containerd.snapshotter.v1.native
drwx------ 3 root root 23 Jul 11 14:51 io.containerd.snapshotter.v1.overlayfs
drwx------ 2 root root  6 Aug 22 13:29 io.containerd.snapshotter.v1.stargz
drwx------ 2 root root  6 Aug 22 13:44 tmpmounts

The other *snapshotter* directories already existed and are either empty or just have an empty snapshots directory within them, nothing else.

ktock · 2023-08-23T15:39:38Z

Thanks for finding the root cause and the workaround. That directory should be handled by containerd (or cri plugin) so I think we need to fix containerd for completely fixing this issue.

maxpain · 2023-09-08T06:39:22Z

The same problem.
Any updates on this?

jonathanbeber · 2024-09-24T14:14:08Z

Is there any other issues where this problem is being tracked? I'm seeing the same problem.

ktock added the needs-more-info label Aug 21, 2023

rsmitty mentioned this issue May 2, 2024

stargz-snapshotter doesn't work siderolabs/extensions#245

Open

daper mentioned this issue Aug 15, 2024

fix(stargz-snapshotter): set default root path siderolabs/extensions#452

Merged

ktock linked a pull request Dec 10, 2024 that will close this issue

Fix GC failure of CRI plugin #1893

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stargz-snapshotter uses up all available disk space #1349

stargz-snapshotter uses up all available disk space #1349

bodgit commented Aug 21, 2023

ktock commented Aug 21, 2023 •

edited

Loading

bodgit commented Aug 21, 2023

ktock commented Aug 22, 2023

bodgit commented Aug 22, 2023

bodgit commented Aug 22, 2023

ktock commented Aug 23, 2023

maxpain commented Sep 8, 2023

jonathanbeber commented Sep 24, 2024

stargz-snapshotter uses up all available disk space #1349

stargz-snapshotter uses up all available disk space #1349

Comments

bodgit commented Aug 21, 2023

ktock commented Aug 21, 2023 • edited Loading

bodgit commented Aug 21, 2023

ktock commented Aug 22, 2023

bodgit commented Aug 22, 2023

bodgit commented Aug 22, 2023

ktock commented Aug 23, 2023

maxpain commented Sep 8, 2023

jonathanbeber commented Sep 24, 2024

ktock commented Aug 21, 2023 •

edited

Loading