[Feature]: Add support for overriding backend configuration in HDF5 datasets #1170

pauladkisson · 2024-08-14T21:32:40Z

What would you like to see added to HDMF?

I am working on a new helper feature for neuroconv, in which users can repack an NWB file with new backend configurations (catalystneuro/neuroconv#1003). However when I try to export the NWB file with the new backend configurations, I get a user warning and the new backend configuration is ignored.

/opt/anaconda3/envs/neuroconv_tdtfp_env/lib/python3.12/site-packages/hdmf/utils.py:668: UserWarning: chunks in H5DataIO will be ignored with H5DataIO.data being an HDF5 dataset

What solution would you like?

I was able to solve this problem by simply converting the HDF5 dataset to a numpy array like so:

  # hdmf.container.Container
  def set_data_io(self, dataset_name: str, data_io_class: Type[DataIO], data_io_kwargs: dict = None, **kwargs):
          """
          Apply DataIO object to a dataset field of the Container.
  
          Parameters
          ----------
          dataset_name: str
              Name of dataset to wrap in DataIO
          data_io_class: Type[DataIO]
              Class to use for DataIO, e.g. H5DataIO or ZarrDataIO
          data_io_kwargs: dict
              keyword arguments passed to the constructor of the DataIO class.
          **kwargs:
              DEPRECATED. Use data_io_kwargs instead.
              kwargs are passed to the constructor of the DataIO class.
          """
          if kwargs or (data_io_kwargs is None):
              warn(
                  "Use of **kwargs in Container.set_data_io() is deprecated. Please pass the DataIO kwargs as a "
                  "dictionary to the `data_io_kwargs` parameter instead.",
                  DeprecationWarning,
                  stacklevel=2
              )
              data_io_kwargs = kwargs
          data = self.fields.get(dataset_name)
+         data = np.array(data)
          if data is None:
              raise ValueError(f"{dataset_name} is None and cannot be wrapped in a DataIO class")
          self.fields[dataset_name] = data_io_class(data=data, **data_io_kwargs)

I would appreciate some kind of alternative set_data_io() function that supports overwriting HDF5 data sets in this manner (or something similar).

Do you have any interest in helping implement the feature?

Yes.

oruebel · 2024-08-15T04:46:54Z

To copy datasets on export the HDF5IO backend uses the copy method from h5py:

hdmf/src/hdmf/backends/hdf5/h5tools.py

Lines 1179 to 1191 in 49a60df

    
           else: 
        
               # TODO add option for case where there are multiple links to the same dataset within a file: 
        
               # instead of copying the dset N times, copy it once and create soft links to it within the file 
        
               self.logger.debug("    Copying data from '%s://%s' to '%s/%s'" 
        
                                 % (data.file.filename, data.name, parent.name, name)) 
        
               parent.copy(source=data, 
        
                           dest=parent, 
        
                           name=name, 
        
                           expand_soft=False, 
        
                           expand_external=False, 
        
                           expand_refs=False, 
        
                           without_attrs=True) 
        
               dset = parent[name]

which does not support changing of chunking, compression etc.. Converting to np.array is not ideal because it loads the entire data into memory, which is problematic for large arrays. Instead, to avoid loading all the data at once, you could wrap the dataset with some variant of AbstractDataChunkIterator so that the data is being loaded and written in larger data blocks (instead of all at once). However, if the shape of the chunks in the source dataset A and the target dataset B are not well aligned then copying the data iteratively can become quite expensive.

A possible option may be to modify set_data_io to support both wrapping with DataIO and AbstractDataChunkIterator, i.e.,:

Add dci_cls: Type[AbstractDataChunkIterator] = hdmf.data_utils.GenericDataChunkIterator as a parameter so that a user can specify what type of iterator to use and default to the GenericDataChunkIterator which would be a good default
Add dci_kwargs: dict = None so that a user can optionally provide the parameters for GenericDataChunkIterator
If if dci_kwargs is not None then wrap the dataset first with the DataChunkIterator before wrapping it with DataIO

oruebel · 2024-08-15T04:47:42Z

@pauladkisson would you want to take a stab at making a PR for this?

pauladkisson · 2024-08-15T16:52:28Z

@oruebel, thanks for the detailed explanation! I figured the np.array solution wouldn't be ideal, but I wasn't totally sure why.

would you want to take a stab at making a PR for this?

Yeah, I can give it a go.

pauladkisson mentioned this issue Aug 15, 2024

Wrap data in set_data_io with a DataChunkIterator to support overriding hdf5 dataset backend configurations #1172

Merged

4 tasks

mavaylon1 assigned pauladkisson Aug 21, 2024

mavaylon1 added category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of users labels Aug 21, 2024

rly closed this as completed in #1172 Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add support for overriding backend configuration in HDF5 datasets #1170

[Feature]: Add support for overriding backend configuration in HDF5 datasets #1170

pauladkisson commented Aug 14, 2024

oruebel commented Aug 15, 2024

oruebel commented Aug 15, 2024

pauladkisson commented Aug 15, 2024

[Feature]: Add support for overriding backend configuration in HDF5 datasets #1170

[Feature]: Add support for overriding backend configuration in HDF5 datasets #1170

Comments

pauladkisson commented Aug 14, 2024

What would you like to see added to HDMF?

What solution would you like?

Do you have any interest in helping implement the feature?

oruebel commented Aug 15, 2024

oruebel commented Aug 15, 2024

pauladkisson commented Aug 15, 2024