-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Codec metadata #572
Comments
Thanks for raising this issue, I think this point is pretty important. I wonder if it's helpful to first solve this problem without thinking about Zarr, then figure out what changes we would need to make to that solution, or to zarr, to make this work. Suppose To reverse this process, If Now lets introduce a second function, But this breaks if Bringing this to zarr, I think the last point is key: we can allow codecs to return arrays + metadata, as long as every codec appends, or every codec prepends, that metadata. If we took this approach, then I think we would need to clarify the language of the spec -- instead of stating that "x -> array codecs must return an N-dimensional array", we would state that "x -> array codecs must return bytestreams that contain an N-dimensional array in the last array_size * dtype_size bytes". Someone should check my assumptions here. Without chunk headers, I don't think there's any other place to put this information. We don't want to store |
Thank you for your detailed reply! While such a byte-level embedding of optional metadata alongside the array might work, I think that it should only be framed as such once you get to a storage layer. On the API level, metadata should be fully separate from the array or byte buffer data so that no codec accidentally transforms or lossily compresses metadata. Conceptually, each codec should have a metadata type (None by default, it should also have some binary embedding) take data and return a tuple[data, meta] on encoding, where tuple[data, None] is special-cased to equal data for compatibility, and take tuple[data, meta] on decoding and return data. A compression pipeline would then store a stack of metas and push on encoding and pop on decoding, which it could embed as byte stream headers or footers. |
Maybe with some typing and class wrapping shenanigans the API could also be written so that passing tuple[data, metaA] into encode would result in tuple[data, metaB, metaA] so that method calls could be chained easily (with special handling for None's again such that all existing implementations would have no metadata) |
Codecs sometimes need to save metadata that is not part of their config and not part of the array shape/type. For instance, a standardisation codec would need to save the mean and standard deviation of the data.
Byte codecs can get away with simply adding a byte header that includes such information, since the general expectation is that the bytes are opaque and should only be losslessly compressed. However, a codec that performs a transformation should ideally retain the data shape and dtype (unless it’s part of what the codec does), so adding a header becomes really awkward.
Is there an established practice for how to handle such metadata?
If not, how possible would a future API evolution be (that is compatible with usage in Zarr) where each codec can add some metadata (that is JSON-serialisable) on encoding and receives it back on decoding (and for compatibility, no metadata would remain the default)?
The text was updated successfully, but these errors were encountered: