-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataTree #350
Comments
To summarise my thoughts:
|
"can you remind me of the reason why this needs to be present on the DataArray?" This is to allow for having multiple versions of the VISIBILITY/SPECTRUM that have been phase-shifted. The VISIBILITY/SPECTRUM versions are tracked by making use of data_groups. When writing to disk the data group name is appended to the related field_and_source_xds. For example:
To access a field_and_source_xds:
or
On disk, this would be:
Note that on disk the data_group name is appended to the field_and_source_xds but once loaded it is just field_and_source_xds since it is accessed via the VISIBILITY data variable. |
My thoughts on the topic: https://confluence.skatelescope.org/display/SEC/Datatree+proposal |
As discussed in Thursday meetings, xarray.DataTree was proposed as a way of grouping datasets (processing sets in the MSv4 nomenclature). Quoting from the following article: detailing a collaboration between xarray and NASA devs:
The benefits of this approach are that:
As I understand the conversation there are broadly two issues preventing DataTree adoption:
Datasets
inDataset
andDataArray
attributes1. DataTree Structure
The current directory structure used by xradio might be sufficient as is. The output of the
test_alma
test case for example produces the following (cut short for brevity) directory tree.In a DataTree, this might look as follows
and are accessed via for example
Possible optimisations might involve storing the
antenna_xds
andfield_and_source_xds_base
datasets as first-level tree nodes (similar to the way subtables in MSv2 are organised).2. Datasets stored in Dataset and DataArray attributes
When xradio loads the above directory structure into a series of xarray Datasets, the
antenna_xds
andfield_and_source_xds_base
Datasets are stored as attributes of thecorrelated_xds
Dataset, or it's DataArray members. While convenient from a UI perspective xarrayDatasets
andDataArrays
stored in attributes cannot be written or read by xarray's nativeto_zarr/from_zarr
methods. This follows from the documentation:https://docs.xarray.dev/en/latest/user-guide/data-structures.html#dataset-contents
https://docs.xarray.dev/en/latest/getting-started-guide/faq.html#what-is-your-approach-to-metadata
and has been previously discussed here:
xarray.dataset.to_zarr
#2161. Solutions to Datasets as Attributes
Use the DataTree structure as is
Given a familiar structure, users can just refer to the related subtable. I prefer
OBS.ps/OBS_split_00/ANTENNA
overOBS.ps/OBS_split_00/antenna_xds
(less typing, similar to MSv2)2. Use xarray accessors
xarray implements an extension mechanisms for DataArrays, Datasets and DataTrees called accessors:
https://docs.xarray.dev/en/latest/internals/extending-xarray.html
One suggestion was to rather store a relative link to, for example,
antenna_xds
on thecorrelated_xds
Then the following can be used to refer to the associated antenna table
Note that this works as long as
dt
is aDataTree
node (but not if it's a Dataset or DataArray). However, this is only a problem for DataArray's as DataTree nodes transparently proxy Datasets.The only case of a DataArray with a Dataset as an attribute is the
field_and_source_xds
Dataset on theVISIBILITY/SPECTRUM
DataArray incorrelated_xds
. @Jan-Willem, can you remind me of the reason why this needs to be present on the DataArray?/cc @scpmw
The text was updated successfully, but these errors were encountered: