You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have run into cases where bazel, in conjunction with bazel-remote, reports that files are missing in the remote cache, even though we can see they were present at the time of the build in the s3 bucket bazel-remote is configured to use. We are not certain of the cause, but suspect it may happen when the local bazel-remote disk cache is full, and cannot make sufficient space through garbage collection due to reservations.
It would be great if:
These were reported as a server error rather than a file-not-found, for less confusion
Bazel-remote printed more debugging information to its logs in these situations, so we could be confident in the underlying cause.
Details of what we observed
We had a recent bazel build fail with the following error:
ERROR: [...]/BUILD:54:13: scala [...] failed: Exec failed due to IOException: 59 errors during bulk transfer:
java.io.IOException: Failed to fetch file with hash '1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609' because it does not exist remotely. --remote_download_outputs=minimal does not work if your remote cache evicts files during builds.
java.io.IOException: Failed to fetch file with hash '6da9ed4d305424f7a35c4d0492307e287098a31ac44bfb4625d3691f706afde9' because it does not exist remotely. --remote_download_outputs=minimal does not work if your remote cache evicts files during builds.
... many more missing files ...
The s3 console verified 1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609 existed in the cas, and had for several days.
The bazel-remote logs we keep showed:
> grep 1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609 bazel-remote.log
2023/03/08 21:14:42 S3 CONTAINS asana-sandbox-testville-bazel-cache-us-west-2 cas.v2/1a/1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609 OK
2023/03/08 21:14:42 GRPC CAS HEAD 1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609 OK
2023/03/08 21:50:23 GRPC BYTESTREAM READ BLOB NOT FOUND: 1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609
2023/03/08 22:07:14 S3 CONTAINS asana-sandbox-testville-bazel-cache-us-west-2 cas.v2/1a/1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609 OK
2023/03/08 22:07:14 GRPC CAS HEAD 1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609 OK
2023/03/08 22:07:14 S3 DOWNLOAD asana-sandbox-testville-bazel-cache-us-west-2 cas.v2/1a/1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609 OK
2023/03/08 22:07:14 GRPC BYTESTREAM READ COMPLETED blobs/1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609/1166808
We observed
BYTESTREAM READ BLOB NOT FOUND error
bazel-remote was later able to download the same file
Further investigation showed many other instances of BYTESTREAM READ BLOB NOT FOUND for different blobs at the same time.
Hypothesis
The relevant blob not found error comes from here:
msg:=fmt.Sprintf("GRPC BYTESTREAM READ BLOB NOT FOUND: %s", hash)
It looks like it happens if we don't find the file, but also haven't otherwise encountered an error. Based on the logs it looks like it doesn't get as far as checking S3 for the file. We suspect, but cannot verify, this is hitting the disk space check here:
Thanks for the bug report- I think your diagnosis is correct.
This PR makes bazel-remote log something in this situation: #650
But you might need to increase your cache size to avoid this problem with the amount of load you're placing on the server.
mostynb
added a commit
to mostynb/bazel-remote
that referenced
this issue
Mar 11, 2023
Summary
We have run into cases where bazel, in conjunction with bazel-remote, reports that files are missing in the remote cache, even though we can see they were present at the time of the build in the s3 bucket bazel-remote is configured to use. We are not certain of the cause, but suspect it may happen when the local bazel-remote disk cache is full, and cannot make sufficient space through garbage collection due to reservations.
It would be great if:
Details of what we observed
We had a recent bazel build fail with the following error:
The s3 console verified
1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609
existed in thecas
, and had for several days.The bazel-remote logs we keep showed:
We observed
BYTESTREAM READ BLOB NOT FOUND
errorFurther investigation showed many other instances of
BYTESTREAM READ BLOB NOT FOUND
for different blobs at the same time.Hypothesis
The relevant blob not found error comes from here:
bazel-remote/server/grpc_bytestream.go
Lines 125 to 126 in c5bf6e1
It looks like it happens if we don't find the file, but also haven't otherwise encountered an error. Based on the logs it looks like it doesn't get as far as checking S3 for the file. We suspect, but cannot verify, this is hitting the disk space check here:
bazel-remote/cache/disk/lru.go
Lines 222 to 227 in c5bf6e1
The text was updated successfully, but these errors were encountered: