Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Add support for overriding backend configuration in HDF5 datasets #1170

Closed
pauladkisson opened this issue Aug 14, 2024 · 3 comments · Fixed by #1172
Closed

[Feature]: Add support for overriding backend configuration in HDF5 datasets #1170

pauladkisson opened this issue Aug 14, 2024 · 3 comments · Fixed by #1172
Assignees
Labels
category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of users

Comments

@pauladkisson
Copy link
Contributor

What would you like to see added to HDMF?

I am working on a new helper feature for neuroconv, in which users can repack an NWB file with new backend configurations (catalystneuro/neuroconv#1003). However when I try to export the NWB file with the new backend configurations, I get a user warning and the new backend configuration is ignored.

/opt/anaconda3/envs/neuroconv_tdtfp_env/lib/python3.12/site-packages/hdmf/utils.py:668: UserWarning: chunks in H5DataIO will be ignored with H5DataIO.data being an HDF5 dataset

What solution would you like?

I was able to solve this problem by simply converting the HDF5 dataset to a numpy array like so:

  # hdmf.container.Container
  def set_data_io(self, dataset_name: str, data_io_class: Type[DataIO], data_io_kwargs: dict = None, **kwargs):
          """
          Apply DataIO object to a dataset field of the Container.
  
          Parameters
          ----------
          dataset_name: str
              Name of dataset to wrap in DataIO
          data_io_class: Type[DataIO]
              Class to use for DataIO, e.g. H5DataIO or ZarrDataIO
          data_io_kwargs: dict
              keyword arguments passed to the constructor of the DataIO class.
          **kwargs:
              DEPRECATED. Use data_io_kwargs instead.
              kwargs are passed to the constructor of the DataIO class.
          """
          if kwargs or (data_io_kwargs is None):
              warn(
                  "Use of **kwargs in Container.set_data_io() is deprecated. Please pass the DataIO kwargs as a "
                  "dictionary to the `data_io_kwargs` parameter instead.",
                  DeprecationWarning,
                  stacklevel=2
              )
              data_io_kwargs = kwargs
          data = self.fields.get(dataset_name)
+         data = np.array(data)
          if data is None:
              raise ValueError(f"{dataset_name} is None and cannot be wrapped in a DataIO class")
          self.fields[dataset_name] = data_io_class(data=data, **data_io_kwargs)

I would appreciate some kind of alternative set_data_io() function that supports overwriting HDF5 data sets in this manner (or something similar).

Do you have any interest in helping implement the feature?

Yes.

@oruebel
Copy link
Contributor

oruebel commented Aug 15, 2024

To copy datasets on export the HDF5IO backend uses the copy method from h5py:

else:
# TODO add option for case where there are multiple links to the same dataset within a file:
# instead of copying the dset N times, copy it once and create soft links to it within the file
self.logger.debug(" Copying data from '%s://%s' to '%s/%s'"
% (data.file.filename, data.name, parent.name, name))
parent.copy(source=data,
dest=parent,
name=name,
expand_soft=False,
expand_external=False,
expand_refs=False,
without_attrs=True)
dset = parent[name]

which does not support changing of chunking, compression etc.. Converting to np.array is not ideal because it loads the entire data into memory, which is problematic for large arrays. Instead, to avoid loading all the data at once, you could wrap the dataset with some variant of AbstractDataChunkIterator so that the data is being loaded and written in larger data blocks (instead of all at once). However, if the shape of the chunks in the source dataset A and the target dataset B are not well aligned then copying the data iteratively can become quite expensive.

A possible option may be to modify set_data_io to support both wrapping with DataIO and AbstractDataChunkIterator, i.e.,:

  1. Add dci_cls: Type[AbstractDataChunkIterator] = hdmf.data_utils.GenericDataChunkIterator as a parameter so that a user can specify what type of iterator to use and default to the GenericDataChunkIterator which would be a good default
  2. Add dci_kwargs: dict = None so that a user can optionally provide the parameters for GenericDataChunkIterator
  3. If if dci_kwargs is not None then wrap the dataset first with the DataChunkIterator before wrapping it with DataIO

@oruebel
Copy link
Contributor

oruebel commented Aug 15, 2024

@pauladkisson would you want to take a stab at making a PR for this?

@pauladkisson
Copy link
Contributor Author

@oruebel, thanks for the detailed explanation! I figured the np.array solution wouldn't be ideal, but I wasn't totally sure why.

would you want to take a stab at making a PR for this?

Yeah, I can give it a go.

@mavaylon1 mavaylon1 added category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of users labels Aug 21, 2024
@rly rly closed this as completed in #1172 Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of users
Projects
None yet
3 participants