initial files from: #1

braingram · 2023-04-21T21:46:45Z

braingram/asdf_zarr@0c75926

eslavich · 2023-04-23T23:44:22Z

src/asdf_zarr/util.py

+    kwargs = copy.deepcopy(obj_dict)
+    args = []
+    type_string = kwargs.pop('type_string')
+    if not hasattr(zarr.storage, type_string):
+        raise NotImplementedError(f"zarr.storage.Store subclass {type_string} not supported")
+    if 'fs' in kwargs and type_string == 'FSStore':
+        kwargs['fs'] = fsspec.AbstractFileSystem.from_json(kwargs['fs'])
+        args.append(kwargs.pop('path'))
+    return getattr(zarr.storage, type_string)(*args, **kwargs)


There is code like this in astropy that we've regretted because it couples the ASDF format to a Python package implementation. Not that it matters for a prototype...

I would be curious to know more and to hear if you have a suggestion on how to change this to avoid a similar issue. I initially wrote this to hopefully be quite generic (to accept all zarr storage types that we can support). One alternative that comes to mind would be to explicitly list the ones we support but I don't think that's addressing your concern.

Now that I know we're considering this "non-archival" I'm not worried about it. Could store some kind of enum value and maintain a mapping to the storage class in the converter code, but idk if it's worth the inconvenience.

Using a mapping sounds like a good way to add some flexibility in case zarr adds new storage, etc. We should decide on this before making a '1.0' of this package.
I opened an issue for this and added it to a new '1.0' milestone:
#2

eslavich · 2023-04-23T23:47:28Z

src/asdf_zarr/converter.py

+            obj_dict = {}
+
+            # include the meta data in the tree
+            obj_dict['.zarray'] = meta


This implementation is more along the lines of "let's store the zarr format within an ASDF file" and not "let's use zarr as the Python API for ASDF's own chunked array format", right? Is that now the direction that y'all are heading?

That's correct. Given the widespread support for zarr and that it appears to meet most of our 'chunking' needs it seems worthwhile to have an extension that supports the format. It's currently unclear to me if additional options for 'chunking' (in the form of fetching sub-arrays from single blocks) will also be worth implementing for some use cases. I don't currently see a need to create a new chunked array format specifically for ASDF but I'm happy to discuss this.

To address the last issue first (fetching sub-arrays from single blocks), is this something that is necessary? Why won't the general slice mechanism handle this? I find it difficult to understand why someone wants to know the details of chunking if they already know what region they want. If it happens to fall within one block, the API should handle that. I wonder if I am missing something.

To address the first point, this raises two relevant issues (both of which are general in nature). 1) is the zarr format being referred to suitable as an archive format? E.g., is it possible that they will change it and deprecate previous versions? If so, it isn't archival. There are lots of possibilities for ASDF that don't have to be archival (e.g., lots of wild compression schemes). It makes me think we should be explicit about what we consider archival (and have a public table of archival features). Perhaps chunking in general shouldn't be archival. You want the data archival, save it in a vanilla format. You can always convert into a chunked format if that is useful for your needs. The point is we need to be explicit about this particular issue and perhaps I should write up something to explain it (unless someone else wants to do it :-)

To address the second relevant issue, does the zarr format have an existing API for other languages? This issue is somewhat orthogonal the the previous one in that language support presumably can be added later. But whatever other ASDF language libraries that appear probably will be constrained by the availability of language support for the zarr format (unless someone else takes upon themself to supply that as well). So we need to also be aware of what the consequences of this are too.

I opened an issue for us to discuss if files that use this extension are suitable for an archive:
#3

As far as zarr support for other languages, this repository lists implementation:
https://github.com/zarr-developers/zarr_implementations
including c++ java javascript rust (and python).

eslavich · 2023-04-23T23:52:01Z

src/asdf_zarr/storage.py

+
+def _generate_chunk_data_callback(zarray, chunk_key):
+    def chunk_data_callback(zarray=zarray, chunk_key=chunk_key):
+        return numpy.frombuffer(zarray.chunk_store.get(chunk_key), dtype='uint8')


Does chunk_store.get return None when the chunk hasn't been initialized? Can that happen in this context?

In this context chunk_store.get should never return None as _generate_chunk_data_callback is only called for initialized chunks:
https://github.com/asdf-format/asdf-zarr/pull/1/files#diff-0f30ea15c813b0aa2b51c29065c2e499c3e930b015edd590a7be571007733b74R34-R35

eslavich · 2023-04-24T00:12:43Z

src/asdf_zarr/storage.py

+    def __len__(self):
+        return len(set(self.__iter__()))
+
+    def listdir(self, path):


The Store base class already has a reasonable implementation of listdir that seems like it would work for our purposes:

https://github.com/zarr-developers/zarr-python/blob/main/zarr/_storage/store.py#L172-L174

The _listdir_from_keys function called in the default implementation of listdir you linked doesn't play well with keys that contain / (like for NestedDirectoryStore) and will result in listdir only returning the contents of the base directory.
https://github.com/zarr-developers/zarr-python/blob/8f11656959c920099d8a6dec5c0abf4663a862b5/zarr/_storage/store.py#L660-L669

For example, calling the default listdir on a Store that has 4 chunks (0/0, 0/1, 1/0, 1/1) will return ['.zarry', '0' '1'].

I'm not sure if we should instead subclass BaseStore (as I think listdir is optional) but testing that gives a warning:

UserWarning: Store <asdf_zarr.storage.ReadInternalStore object at 0x28801afe0> has no `listdir` method. From zarr 2.9 onwards may want to inherit from `Store````

eslavich

I say merge away!

initial files from:

6ee5315

braingram/asdf_zarr@0c75926

braingram requested a review from eslavich April 21, 2023 21:46

eslavich reviewed Apr 24, 2023

View reviewed changes

eslavich approved these changes May 7, 2023

View reviewed changes

This was referenced May 8, 2023

Add mapping between ASDF storage classes and zarr storage classes #2

Open

Add description of 'archivability' of files that use this extension #3

Open

braingram merged commit ef0ccf2 into asdf-format:main May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial files from: #1

initial files from: #1

braingram commented Apr 21, 2023

eslavich Apr 23, 2023

braingram Apr 24, 2023

eslavich May 7, 2023

braingram May 8, 2023

eslavich Apr 23, 2023

braingram Apr 24, 2023

perrygreenfield Apr 25, 2023

perrygreenfield Apr 25, 2023

perrygreenfield Apr 25, 2023

braingram May 8, 2023

eslavich Apr 23, 2023

braingram Apr 24, 2023

eslavich Apr 24, 2023

braingram Apr 24, 2023

eslavich left a comment

initial files from: #1

initial files from: #1

Conversation

braingram commented Apr 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eslavich left a comment

Choose a reason for hiding this comment