-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Description
Yesterday I learned about the Build garbage collector, unfortunately because it was removing layers immediately after they were built.
Take the following Dockerfile, which is based on a unpolished, experimental Dockerfile I was iterating on:
FROM debian:trixie
RUN apt-get update
RUN apt-get install -y ca-certificates curl git git-lfs
RUN apt-get install -y python3 python3-pip
RUN pip3 install --break-system-packages torch
# Assume this is a real run command, but it has an error
RUN something with a mistake
Normally I'd expect the build to cache up to and including the pip3 torch install, so when I fix the error and re-run docker build
it will not rebuild the torch layer, or any previous layers. However, due to layer size I found that the later cache was immediately being discarded. Note the layer sizes:
IMAGE CREATED CREATED BY SIZE COMMENT
6f0d75294b62 18 minutes ago RUN /bin/sh -c pip3 install --break-system-p… 10.7GB buildkit.dockerfile.v0
<missing> 15 hours ago RUN /bin/sh -c apt-get install -y python3 py… 398MB buildkit.dockerfile.v0
<missing> 20 hours ago RUN /bin/sh -c apt-get install -y ca-certifi… 149MB buildkit.dockerfile.v0
<missing> 20 hours ago RUN /bin/sh -c apt-get update # buildkit 21.1MB buildkit.dockerfile.v0
<missing> 9 days ago # debian.sh --arch 'amd64' out/ 'trixie' '@1… 120MB debuerreotype 0.16
As this build has a layer above the default 10GB cache size I noticed it was being GC'd immediately. I saw this behavior both with this big pip
install, and another Dockerfile that was downloading a LLM model containing multiple 8GB safetensor files.
This 10GB threshold caught me by surprise. As I have relatively slow internet some of these layers took hours to download, having them immediately be discarded was frustrating. I think this behavior could use some improvement, such as:
- Do not enforce the build cache size, unless there is disk pressure.
- Do not discard layers completed within the last hour.
Basically, as a user, if there's no disk pressure I don't want to see the gc taking action against fresh cache layers. I had over 100GB free in /var/lib/docker when my layers were gc'd. These points are soft suggestions, I know each of the suggestion points will come with caveats. However, under certain circumstances the gc 10GB default threshold is a hidden sharp edge, so it would be nice to dull that.