Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containertool: Use same gzip headers on Linux and macOS #37

Merged
merged 1 commit into from
Dec 4, 2024

Conversation

euanh
Copy link
Collaborator

@euanh euanh commented Dec 4, 2024

Motivation

Packaging the same binary using the same version of containertool produces different application image layers on macOS and Linux:

linux% swift run containertool --verbose --repository registry.test:5000/hello hello-world --from scratch
...
Uploading application layer
application layer: sha256:54a282d5cd082320d2d4976e7d9a952da46e3bc4bab3ce1e0b3931ccf945b849 (80394382 bytes)
image configuration: sha256:fdcb887ef6e27a09456419b03b1d8353b15d68d088b8ea023f38af892fca69be (462 bytes)
...

macos% swift run containertool --verbose --repository registry.test:5000/hello hello-world --from scratch
...
Uploading application layer
application layer: sha256:08a21093e79423c17b58325decc48d7196481ed55276c2d168de23a75d38727e (80394382 bytes)
image configuration: sha256:2648cd8cca1cad7ec5b386e8433e36ca77a40e31859e5994260b2ef1d07f0753 (462 bytes)
...

The application layer hashes are different, even though they
contain the same binary. The image configuration metadata blob
hashes also differ, but they contain timestamps so this will continue
to happen even after this PR is merged. A future change could
make these timestamps default to the epoch, allowing identical
metadata blobs to be created on Linux and macOS as well.

The image layer is a gzipped TAR archive containing the executable. Saving the intermediate steps shows that the TAR archives are identical and the gzipped streams are different, but only by one byte:

% diff <(hexdump -X linux-image.tar.gz) <(hexdump -X darwin-image.tar.gz)
1c1
< 0000000  1f  8b  08  00  00  00  00  00  00  03  ed  57  eb  6e  1c  b7
---
> 0000000  1f  8b  08  00  00  00  00  00  00  13  ed  57  eb  6e  1c  b7```

The difference is in the 10th byte of the gzip header: the OS
field
. RFC
1952 defines a list of known operating
systems
:
0x03 is the OS code for Unix, however the RFC was written in 1996
so Macintosh refers to the classic MacOS. Zlib uses an updated
operating system list
madler/zlib@ce12c5c
which defines 19 / 0x13 as the OS code for Darwin.

Interestingly, using gzip to compress a file directly produces identical results on macOS and Linux (-n is needed to prevent gzip from including the current timestamp on macOS):

linux% cat hello-world | gzip -n | md5sum
ef64adbee9e89e78114000442a804e0e  -

macos% cat hello-world | gzip -n | md5sum
ef64adbee9e89e78114000442a804e0e  -

Modifications

By default, Zlib uses the value of OS_CODE set at compile time. This commit uses
deflateSetHeader() to override the default gzip header, forcing the OS code to be 0x03 (Unix) on both Linux and macOS.

Result

After this change, image layers containing the same binary will use identical gzip headers and should have the same hash whether they
are built on Linux or macOS. It is still possible that different
versions of Zlib might produce different compressed data, causing
the overall hashes to change.

Test Plan

Tested manually on macOS and Linux, verifying that image layers containing identical binaries have identical hashes.
Added a test for containertool's gzip function.

@euanh euanh force-pushed the reproducible-images branch 2 times, most recently from 93f8d15 to 47ff86b Compare December 4, 2024 11:49
@euanh euanh added kind/enhancement New feature or request semver/minor Adds new public API. area/interoperability Improvements to compatibility with other systems. labels Dec 4, 2024
@euanh euanh force-pushed the reproducible-images branch 3 times, most recently from 5f4b9d9 to b7abac8 Compare December 4, 2024 11:58
Motivation
----------

Packaging the same binary using the same version of `containertool`
produces different application image layers on macOS and Linux:

```
linux% swift run containertool --verbose --repository registry.test:5000/hello hello-world --from scratch
...
Uploading application layer
application layer: sha256:54a282d5cd082320d2d4976e7d9a952da46e3bc4bab3ce1e0b3931ccf945b849 (80394382 bytes)
image configuration: sha256:fdcb887ef6e27a09456419b03b1d8353b15d68d088b8ea023f38af892fca69be (462 bytes)
...

macos% swift run containertool --verbose --repository registry.test:5000/hello hello-world --from scratch
...
Uploading application layer
application layer: sha256:08a21093e79423c17b58325decc48d7196481ed55276c2d168de23a75d38727e (80394382 bytes)
image configuration: sha256:2648cd8cca1cad7ec5b386e8433e36ca77a40e31859e5994260b2ef1d07f0753 (462 bytes)
...
```
The `application layer` hashes are different, even though they
contain the same binary.   The `image configuration` metadata blob
hashes also differ, but they contain timestamps so this will continue
to happen even after this PR is merged.    A future change could
make these timestamps default to the epoch, allowing identical
metadata blobs to be created on Linux and macOS as well.

The image layer is a gzipped TAR archive containing the executable.
Saving the intermediate steps shows that the TAR archives are
identical and the gzipped streams are different, but only by one
byte:

```
% diff <(hexdump -X linux-image.tar.gz) <(hexdump -X darwin-image.tar.gz)
1c1
< 0000000  1f  8b  08  00  00  00  00  00  00  03  ed  57  eb  6e  1c  b7
---
> 0000000  1f  8b  08  00  00  00  00  00  00  13  ed  57  eb  6e  1c  b7```
```

The difference is in the 10th byte of the gzip header: the [OS
field](https://datatracker.ietf.org/doc/html/rfc1952#page-5).     RFC
1952 defines a list of [known operating
systems](https://datatracker.ietf.org/doc/html/rfc1952#page-8):
`0x03` is the OS code for Unix, however the RFC was written in 1996
so `Macintosh` refers to the classic MacOS.   Zlib uses an updated
operating system list
madler/zlib@ce12c5c
which defines `19` / `0x13` as the OS code for Darwin.

Interestingly, using `gzip` to compress a file directly produces
identical results on macOS and Linux (`-n` is needed to prevent
`gzip` from including the current timestamp on macOS):

```
linux% cat hello-world | gzip -n | md5sum
ef64adbee9e89e78114000442a804e0e  -

macos% cat hello-world | gzip -n | md5sum
ef64adbee9e89e78114000442a804e0e  -
```

Modifications
-------------

By default, Zlib uses the value of `OS_CODE` [set at compile
time](https://github.com/madler/zlib/blob/ef24c4c7502169f016dcd2a26923dbaf3216748c/deflate.c#L1054).
This commit uses
[deflateSetHeader()](https://github.com/madler/zlib/blob/ef24c4c7502169f016dcd2a26923dbaf3216748c/deflate.c#L705)
to override the default gzip header, forcing the OS code to be 0x03
(Unix) on both Linux and macOS.

Result
------

After this change, image layers containing the same binary will use
identical gzip headers and should have the same hash whether they
are built on Linux or macOS.   It is still possible that different
versions of Zlib might produce different compressed data, causing
the overall hashes to change.

Test Plan
---------

Tested manually on macOS and Linux, verifying that image layers
containing identical binaries have identical hashes.
@euanh euanh force-pushed the reproducible-images branch from b7abac8 to 9fe0113 Compare December 4, 2024 12:04
@euanh
Copy link
Collaborator Author

euanh commented Dec 4, 2024

Discovered while investigating #34, however I don't think this difference is the cause of the reported problem because in both cases the checksum which is sent is a valid SHA256 checksum, just of a slightly different stream of data. When the registry recomputes the checksum over the data it has received, it should produce the same result.

I think the problem in #34 is more likely to come from the serialisation of the JSON blobs, where reordering or reformatting of the JSON can cause checksum differences. Even there, though, the plugin should calculate the checksum after serialising the JSON data and the registry should check the checksum against the serialised data without parsing it, so there are few opportunities for inconsistencies.

@euanh euanh marked this pull request as ready for review December 4, 2024 12:06
@euanh euanh merged commit 403e0fe into apple:main Dec 4, 2024
19 checks passed
@euanh euanh deleted the reproducible-images branch December 4, 2024 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/interoperability Improvements to compatibility with other systems. kind/enhancement New feature or request semver/minor Adds new public API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant