-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial files from: #1
Conversation
kwargs = copy.deepcopy(obj_dict) | ||
args = [] | ||
type_string = kwargs.pop('type_string') | ||
if not hasattr(zarr.storage, type_string): | ||
raise NotImplementedError(f"zarr.storage.Store subclass {type_string} not supported") | ||
if 'fs' in kwargs and type_string == 'FSStore': | ||
kwargs['fs'] = fsspec.AbstractFileSystem.from_json(kwargs['fs']) | ||
args.append(kwargs.pop('path')) | ||
return getattr(zarr.storage, type_string)(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is code like this in astropy that we've regretted because it couples the ASDF format to a Python package implementation. Not that it matters for a prototype...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be curious to know more and to hear if you have a suggestion on how to change this to avoid a similar issue. I initially wrote this to hopefully be quite generic (to accept all zarr storage types that we can support). One alternative that comes to mind would be to explicitly list the ones we support but I don't think that's addressing your concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I know we're considering this "non-archival" I'm not worried about it. Could store some kind of enum value and maintain a mapping to the storage class in the converter code, but idk if it's worth the inconvenience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a mapping sounds like a good way to add some flexibility in case zarr adds new storage, etc. We should decide on this before making a '1.0' of this package.
I opened an issue for this and added it to a new '1.0' milestone:
#2
obj_dict = {} | ||
|
||
# include the meta data in the tree | ||
obj_dict['.zarray'] = meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is more along the lines of "let's store the zarr format within an ASDF file" and not "let's use zarr as the Python API for ASDF's own chunked array format", right? Is that now the direction that y'all are heading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. Given the widespread support for zarr and that it appears to meet most of our 'chunking' needs it seems worthwhile to have an extension that supports the format. It's currently unclear to me if additional options for 'chunking' (in the form of fetching sub-arrays from single blocks) will also be worth implementing for some use cases. I don't currently see a need to create a new chunked array format specifically for ASDF but I'm happy to discuss this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address the last issue first (fetching sub-arrays from single blocks), is this something that is necessary? Why won't the general slice mechanism handle this? I find it difficult to understand why someone wants to know the details of chunking if they already know what region they want. If it happens to fall within one block, the API should handle that. I wonder if I am missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address the first point, this raises two relevant issues (both of which are general in nature). 1) is the zarr format being referred to suitable as an archive format? E.g., is it possible that they will change it and deprecate previous versions? If so, it isn't archival. There are lots of possibilities for ASDF that don't have to be archival (e.g., lots of wild compression schemes). It makes me think we should be explicit about what we consider archival (and have a public table of archival features). Perhaps chunking in general shouldn't be archival. You want the data archival, save it in a vanilla format. You can always convert into a chunked format if that is useful for your needs. The point is we need to be explicit about this particular issue and perhaps I should write up something to explain it (unless someone else wants to do it :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address the second relevant issue, does the zarr format have an existing API for other languages? This issue is somewhat orthogonal the the previous one in that language support presumably can be added later. But whatever other ASDF language libraries that appear probably will be constrained by the availability of language support for the zarr format (unless someone else takes upon themself to supply that as well). So we need to also be aware of what the consequences of this are too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened an issue for us to discuss if files that use this extension are suitable for an archive:
#3
As far as zarr support for other languages, this repository lists implementation:
https://github.com/zarr-developers/zarr_implementations
including c++ java javascript rust (and python).
|
||
def _generate_chunk_data_callback(zarray, chunk_key): | ||
def chunk_data_callback(zarray=zarray, chunk_key=chunk_key): | ||
return numpy.frombuffer(zarray.chunk_store.get(chunk_key), dtype='uint8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does chunk_store.get return None when the chunk hasn't been initialized? Can that happen in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this context chunk_store.get
should never return None as _generate_chunk_data_callback
is only called for initialized chunks:
https://github.com/asdf-format/asdf-zarr/pull/1/files#diff-0f30ea15c813b0aa2b51c29065c2e499c3e930b015edd590a7be571007733b74R34-R35
def __len__(self): | ||
return len(set(self.__iter__())) | ||
|
||
def listdir(self, path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Store base class already has a reasonable implementation of listdir that seems like it would work for our purposes:
https://github.com/zarr-developers/zarr-python/blob/main/zarr/_storage/store.py#L172-L174
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _listdir_from_keys
function called in the default implementation of listdir
you linked doesn't play well with keys that contain /
(like for NestedDirectoryStore
) and will result in listdir
only returning the contents of the base directory.
https://github.com/zarr-developers/zarr-python/blob/8f11656959c920099d8a6dec5c0abf4663a862b5/zarr/_storage/store.py#L660-L669
For example, calling the default listdir
on a Store that has 4 chunks (0/0, 0/1, 1/0, 1/1) will return ['.zarry', '0' '1'].
I'm not sure if we should instead subclass BaseStore
(as I think listdir
is optional) but testing that gives a warning:
UserWarning: Store <asdf_zarr.storage.ReadInternalStore object at 0x28801afe0> has no `listdir` method. From zarr 2.9 onwards may want to inherit from `Store````
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I say merge away!
braingram/asdf_zarr@0c75926