How should future extensions to the manifest be handled? Extensions will likely take the form of additional manifest keys, either at the top level, or in manifest entries. Currently, unrecognized keys are a hard error. Ideally it would be possible to partially relax this, such that unrecognized keys could be categorized as required or optional, and only required unrecognized keys would trigger an error.
With serialization formats using integer keys, it is common to designate even keys as required and odd keys as optional.
It's a bit odd, but we could designate lowercase keys as required and uppercase keys as optional.
Currently, empty directories are an error for both filepack create
and
filepack verify
. Should they be ignored, or recorded in the manifest by
filepack create
and required to be present by filepack verify
?
Currently, each manifest entry the complete relative path to the file. This means that when a directory contains more than one file, the directory path will be repeated.
{
"files": {
"foo/bar": {
…
},
"foo/baz": {
…
}
}
}
We could instead opt to nest entries to avoid repetition:
{
"files": {
"foo": {
"bar": {
…
},
"baz": {
…
}
}
}
}
This would save space in cases when a single long path component is repeated
many times, and provide a way to represent empty directories, but would make
manipulation with tools like jq
more difficult.
Currently, filepack
uses the BLAKE3 hash function, a fast, parallelizable,
general purpose cryptographic hash function.
There are other reasonable choices for hash functions:
-
SHA-2 is used in many other systems, so by using SHA-2, those systems could adopt
filepack
without making any additional assumptions about the security of other hash functions. SHA-2 however is susceptible to length-extension attacks, which, although not a problem forfilepack
, does require additional caution in potential downstream applications. -
SHA-3 is less common than SHA-2, but hardware acceleration is more common for SHA-3 than BLAKE3.
-
KangarooTwelve is a tree hash based on SHA-3, and so benefits from SHA-3 hardware acceleration.
Another choice would be to allow users to select the hash function that they wish to use, and record the hash function used in the manifest. I am very much inclined not to do this. I would much rather support a single, all-purpose hash function, to simplify implementations, and avoid additional security considerations.
Overall, I'm happy with the current choice of BLAKE3. BLAKE3 is not incredible common, but its membership in the BLAKE family has made it relatively well reviewed and scrutinized.
A tree hash like BLAKE3 also allows for incremental streaming or random access verification, which was actually the impetus for the creation of BLAKE3, and is implemented in the bao crate.
One slightly annoying aspect of BLAKE3 is that the chunk offset is used as an input to produce the leaf hash, so it is not possible to deduplicate leaf chunks.
filepack
manifests are currently JSON, but it would be worth considering
other serialization formats.
JSON was chosen because it is plain-text, more-or-less human readable, and
widely supported, with many libraries and tools like jq
available.
Some alternative serialization formats are:
-
TSV: the traditional format for
shasum
and.sfv
. TSV is plain-text and even more human readable, but cannot be easily extended. Fields are identified by position, so it would not be possible to add additional fields in a backwards compatible fashion, and additional top-level keys are not supported. -
CBOR: a binary serialization format more-or-less morally equivalent to JSON. CBOR is less common and is not human readable, however, it is more compact than JSON. I suspect, however, that compactness is less of a benefit than human readability and ubiquitous support.
CBOR would allow for easily embedding large amounts of additional binary data, for example, parity data or hash tiers.
-
YAML: Easier for a human to read and write than JSON, but less widely supported. And, YAML canonicalization would be much harder, in the event that the manifest is canonicalized for hashing and signing.
Filepack manifests may optionally include semantic metadata describing the package. For example, its title and author.
Metadata is currently included in the manifest itself, but could alternatively be placed in a file alongside the manifest.
Including it in the manifest would have the advantage of being a single file.
Including it alongside the manifest would have the advantage of being easier to author, since it separates the manually created metadata and the automatically generated manifest, and possibly allow the metadata to be written in a more human-friendly format, such as YAML.
A planned feature is the addition of signing and signature verification.
When signing, signatures would be written to files in a signatures
directory
alongside the manifest with the filename KEY.SCHEME
, where KEY
the public
key the signature was made with, and SCHEME
is a short string identifying the
signature scheme, for example, pgp
or ssh
.
filepack create
would exclude the signatures
directory from hashing, since
signatures must be made over the hash of the manifest, filepack
would provide
helpers for creating signatures and writing them to the appropriate file in the
signatures
directory.
filepack verify
would exclude the signatures
directory from extraneous file
checks, since it would not be present in the manifest, and for each
KEY.SCHEME
file, verify that the signature contained in the file is valid
according to SCHEME
and was made with the private key corresponding to the
public key KEY
.
Initially, existing schemes will be used, instead of trying to create a new
signing scheme, so filepack create
, or the helper used to create signatures
will call to the appropriate binary to create signatures, and filepack verify
will call the appropriate binary to verify signatures.
There are a number of open questions related to signatures:
-
Should signatures be made over the manifest or over the BLAKE3 hash of the manifest?
-
Should the manifest be required to be in a canonical format? If not, there will be multiple valid manifests that describe the same file contents, but which have different hashes. Requiring the manifest to be canonical would require a canonical serialization and deserialization scheme.
-
Should signatures be made over a message which attempts to ensure domain separation? For example, signatures could be made over the string
filepack:
prepended to the manifest hash. Decent signature schemes will already provide for domain separation, but if we ever want to introduce different kinds of signatures that, for example, represent different intents, then domain separation would be desirable.