Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial files from: #1

Merged
merged 1 commit into from
May 8, 2023
Merged

initial files from: #1

merged 1 commit into from
May 8, 2023

Conversation

braingram
Copy link
Contributor

@braingram braingram requested a review from eslavich April 21, 2023 21:46
Comment on lines +54 to +62
kwargs = copy.deepcopy(obj_dict)
args = []
type_string = kwargs.pop('type_string')
if not hasattr(zarr.storage, type_string):
raise NotImplementedError(f"zarr.storage.Store subclass {type_string} not supported")
if 'fs' in kwargs and type_string == 'FSStore':
kwargs['fs'] = fsspec.AbstractFileSystem.from_json(kwargs['fs'])
args.append(kwargs.pop('path'))
return getattr(zarr.storage, type_string)(*args, **kwargs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is code like this in astropy that we've regretted because it couples the ASDF format to a Python package implementation. Not that it matters for a prototype...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be curious to know more and to hear if you have a suggestion on how to change this to avoid a similar issue. I initially wrote this to hopefully be quite generic (to accept all zarr storage types that we can support). One alternative that comes to mind would be to explicitly list the ones we support but I don't think that's addressing your concern.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I know we're considering this "non-archival" I'm not worried about it. Could store some kind of enum value and maintain a mapping to the storage class in the converter code, but idk if it's worth the inconvenience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a mapping sounds like a good way to add some flexibility in case zarr adds new storage, etc. We should decide on this before making a '1.0' of this package.
I opened an issue for this and added it to a new '1.0' milestone:
#2

obj_dict = {}

# include the meta data in the tree
obj_dict['.zarray'] = meta

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is more along the lines of "let's store the zarr format within an ASDF file" and not "let's use zarr as the Python API for ASDF's own chunked array format", right? Is that now the direction that y'all are heading?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. Given the widespread support for zarr and that it appears to meet most of our 'chunking' needs it seems worthwhile to have an extension that supports the format. It's currently unclear to me if additional options for 'chunking' (in the form of fetching sub-arrays from single blocks) will also be worth implementing for some use cases. I don't currently see a need to create a new chunked array format specifically for ASDF but I'm happy to discuss this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address the last issue first (fetching sub-arrays from single blocks), is this something that is necessary? Why won't the general slice mechanism handle this? I find it difficult to understand why someone wants to know the details of chunking if they already know what region they want. If it happens to fall within one block, the API should handle that. I wonder if I am missing something.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address the first point, this raises two relevant issues (both of which are general in nature). 1) is the zarr format being referred to suitable as an archive format? E.g., is it possible that they will change it and deprecate previous versions? If so, it isn't archival. There are lots of possibilities for ASDF that don't have to be archival (e.g., lots of wild compression schemes). It makes me think we should be explicit about what we consider archival (and have a public table of archival features). Perhaps chunking in general shouldn't be archival. You want the data archival, save it in a vanilla format. You can always convert into a chunked format if that is useful for your needs. The point is we need to be explicit about this particular issue and perhaps I should write up something to explain it (unless someone else wants to do it :-)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address the second relevant issue, does the zarr format have an existing API for other languages? This issue is somewhat orthogonal the the previous one in that language support presumably can be added later. But whatever other ASDF language libraries that appear probably will be constrained by the availability of language support for the zarr format (unless someone else takes upon themself to supply that as well). So we need to also be aware of what the consequences of this are too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened an issue for us to discuss if files that use this extension are suitable for an archive:
#3

As far as zarr support for other languages, this repository lists implementation:
https://github.com/zarr-developers/zarr_implementations
including c++ java javascript rust (and python).


def _generate_chunk_data_callback(zarray, chunk_key):
def chunk_data_callback(zarray=zarray, chunk_key=chunk_key):
return numpy.frombuffer(zarray.chunk_store.get(chunk_key), dtype='uint8')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does chunk_store.get return None when the chunk hasn't been initialized? Can that happen in this context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context chunk_store.get should never return None as _generate_chunk_data_callback is only called for initialized chunks:
https://github.com/asdf-format/asdf-zarr/pull/1/files#diff-0f30ea15c813b0aa2b51c29065c2e499c3e930b015edd590a7be571007733b74R34-R35

def __len__(self):
return len(set(self.__iter__()))

def listdir(self, path):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Store base class already has a reasonable implementation of listdir that seems like it would work for our purposes:

https://github.com/zarr-developers/zarr-python/blob/main/zarr/_storage/store.py#L172-L174

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _listdir_from_keys function called in the default implementation of listdir you linked doesn't play well with keys that contain / (like for NestedDirectoryStore) and will result in listdir only returning the contents of the base directory.
https://github.com/zarr-developers/zarr-python/blob/8f11656959c920099d8a6dec5c0abf4663a862b5/zarr/_storage/store.py#L660-L669

For example, calling the default listdir on a Store that has 4 chunks (0/0, 0/1, 1/0, 1/1) will return ['.zarry', '0' '1'].

I'm not sure if we should instead subclass BaseStore (as I think listdir is optional) but testing that gives a warning:

UserWarning: Store <asdf_zarr.storage.ReadInternalStore object at 0x28801afe0> has no `listdir` method. From zarr 2.9 onwards may want to inherit from `Store```` 

Copy link

@eslavich eslavich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say merge away!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants