Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataTree #350

Open
sjperkins opened this issue Jan 16, 2025 · 3 comments
Open

DataTree #350

sjperkins opened this issue Jan 16, 2025 · 3 comments

Comments

@sjperkins
Copy link

sjperkins commented Jan 16, 2025

As discussed in Thursday meetings, xarray.DataTree was proposed as a way of grouping datasets (processing sets in the MSv4 nomenclature). Quoting from the following article: detailing a collaboration between xarray and NASA devs:

The DataTree concept allows for organizing heterogeneous collections of scientific data in the same way that a nested directory structure facilitates organizing large numbers of files on disk. It does so in a way that preserves common structure between data in the collections, such as aligned arrays and common coordinates.

The benefits of this approach are that:

  1. The grouping of multiple datasets is offloaded to the now core xarray DataTree structure.
  2. Downstream libraries such as xradio can use xarray's native I/O routines for reading/writing DataTrees to disk, or to cloud stores such as AWS S3/Google Cloud Platform etc.
  3. Following on from (2), once a DataTree is written to zarr format, only xarray is needed to read the DataTree again, further reducing the number of dependencies on the end user.

As I understand the conversation there are broadly two issues preventing DataTree adoption:

  1. The lack of a defined hierarchical structure to which to assign the various MSv4 datasets
  2. The presence of xarray Datasets in Dataset and DataArray attributes

1. DataTree Structure

The current directory structure used by xradio might be sufficient as is. The output of the test_alma test case for example produces the following (cut short for brevity) directory tree.

$ tree -L 2 Antennae_North.cal.lsrk.split.ps/
Antennae_North.cal.lsrk.split.ps/
├── Antennae_North.cal.lsrk.split_00
│   ├── antenna_xds
│   ├── correlated_xds
│   └── field_and_source_xds_base
├── Antennae_North.cal.lsrk.split_01
│   ├── antenna_xds
│   ├── correlated_xds
│   └── field_and_source_xds_base
...
└── Antennae_North.cal.lsrk.split_11
    ├── antenna_xds
    ├── correlated_xds
    ├── field_and_source_xds_base
    └── weather_xds

In a DataTree, this might look as follows

<xarray.DataTree>
Group: /
└── Group: /Antennae_North.cal.lsrk.split.ps/Antennae_North.cal.lsrk.split_00/correlated_xds
|Dimensions:                     (time: 28760, baseline_id: 2775, frequency: 16,
|polarization: 4, uvw_label: 3)
|Coordinates:
|baseline_antenna1_name      (baseline_id) object 22kB ...
|baseline_antenna2_name      (baseline_id) object 22kB ...
|* baseline_id                 (baseline_id) int64 22kB 0 1 2 ... 2773 2774
|* frequency                   (frequency) float64 128B 1.202e+08 ... 1.204e+08
|* polarization                (polarization) <U2 32B 'XX' 'XY' 'YX' 'YY'
|* time                        (time) float64 230kB 1.601e+09 ... 1.601e+09
|Dimensions without coordinates: uvw_label
|Data variables:
|EFFECTIVE_INTEGRATION_TIME  (time, baseline_id) float64 638MB ...
|FLAG                        (time, baseline_id, frequency, polarization) uint8 5GB ...
|TIME_CENTROID               (time, baseline_id) float64 638MB ...
|UVW                         (time, baseline_id, uvw_label) float64 2GB ...
|VISIBILITY                  (time, baseline_id, frequency, polarization) complex64 41GB ...
|WEIGHT                      (time, baseline_id, frequency, polarization) float32 20GB ...
|Attributes:
|version:              4.0.0
|creation_date:        2025-01-16T16:44:24.072119+00:00
|data_description_id:  0
└── Group: /Antennae_North.cal.lsrk.split.ps/Antennae_North.cal.lsrk.split_00/antenna_xds
        Dimensions:           (antenna_name: 74,
                                   cartesian_pos_label/ellipsoid_pos_label: 3)
        Coordinates:
         * antenna_name      (antenna_name) object 592B 'CS001HBA0' ... 'IE613HBA'
            mount             (antenna_name) object 592B ...
            station           (antenna_name) object 592B ...
        Dimensions without coordinates: cartesian_pos_label/ellipsoid_pos_label
        Data variables:
            ANTENNA_POSITION  (antenna_name, cartesian_pos_label/ellipsoid_pos_label) float64 2kB ...

and are accessed via for example

dt["Antennae_North.cal.lsrk.split.ps/Antennae_North.cal.lsrk.split_00/correlated_xds"]
...
dt["Antennae_North.cal.lsrk.split.ps/Antennae_North.cal.lsrk.split_00/antenna_xds"]
...

Possible optimisations might involve storing the antenna_xds and field_and_source_xds_base datasets as first-level tree nodes (similar to the way subtables in MSv2 are organised).

2. Datasets stored in Dataset and DataArray attributes

When xradio loads the above directory structure into a series of xarray Datasets, the antenna_xds and field_and_source_xds_base Datasets are stored as attributes of the correlated_xds Dataset, or it's DataArray members. While convenient from a UI perspective xarray Datasets and DataArrays stored in attributes cannot be written or read by xarray's native to_zarr/from_zarr methods. This follows from the documentation:

and has been previously discussed here:

1. Solutions to Datasets as Attributes

Use the DataTree structure as is

Given a familiar structure, users can just refer to the related subtable. I prefer OBS.ps/OBS_split_00/ANTENNA over OBS.ps/OBS_split_00/antenna_xds (less typing, similar to MSv2)

2. Use xarray accessors

xarray implements an extension mechanisms for DataArrays, Datasets and DataTrees called accessors:

https://docs.xarray.dev/en/latest/internals/extending-xarray.html

One suggestion was to rather store a relative link to, for example, antenna_xds on the correlated_xds

dt["/Antennae_North.cal.lsrk.split.ps/Antennae_North.cal.lsrk.split_00/correlated_xds"].attrs["antenna_xds_link"] = \ 
    "/Antennae_North.cal.lsrk.split.ps/Antennae_North.cal.lsrk.split_00/antenna_xds"


@register_datatree_accessor("links")
class SubTableAccessor:
  def __init__(self, node: DataTree):
    self.node = node

  @property
  def antenna(self) -> DataTree:
    """Returns the antenna dataset"""

    try:
      link = self.node.attrs["antenna_xds_link"]
    except KeyError:
      raise ValueError("antenna_xds_link not found")
    else:
      return self.node.root[link]

Then the following can be used to refer to the associated antenna table

dt["/Antennae_North.cal.lsrk.split.ps/Antennae_North.cal.lsrk.split_00/correlated_xds"].links.antenna

Note that this works as long as dt is a DataTree node (but not if it's a Dataset or DataArray). However, this is only a problem for DataArray's as DataTree nodes transparently proxy Datasets.

The only case of a DataArray with a Dataset as an attribute is the field_and_source_xds Dataset on the VISIBILITY/SPECTRUM DataArray in correlated_xds. @Jan-Willem, can you remind me of the reason why this needs to be present on the DataArray?

/cc @scpmw

@sjperkins
Copy link
Author

To summarise my thoughts:

  1. DataTree should be used for grouping because it's a core xarray structure that uses xarray's native I/O routines. Once written to zarr, the MSv4 DataTree is readable and writeable via xarray only.
  2. I'm not strongly attached to how the resulting hierarchical is organised but
  3. Datasets should not be present as xarray attributes as they break xarray's native I/O routines (zarr in particular)

@dmsadmin137
Copy link
Contributor

"can you remind me of the reason why this needs to be present on the DataArray?" This is to allow for having multiple versions of the VISIBILITY/SPECTRUM that have been phase-shifted. The VISIBILITY/SPECTRUM versions are tracked by making use of data_groups. When writing to disk the data group name is appended to the related field_and_source_xds.

For example:

  • In memory:
Data Variables:
         VISIBILITY
         VISIBILITY_PS
Attrs:
        data_groups: {'base':{'correlated_data':'VISIBILITY',...},
                                 'phase_shifted':{'correlated_data':'VISIBILITY_PS',...}}

To access a field_and_source_xds:

ms_xds.VISIBILITY.field_and_source_xds
ms_xds.VISIBILITY_PS.field_and_source_xds

or

dg = ms_xds.attrs['data_groups']
ms_xds[dg['base']['correlated_data']].field_and_source_xds
ms_xds[dg['phase_shifted']['correlated_data']].field_and_source_xds

On disk, this would be:

Antennae_North.cal.lsrk.split.ps/
├── Antennae_North.cal.lsrk.split_00
│   ├── antenna_xds
│   ├── correlated_xds
│   ├── field_and_source_xds_base
│   └── field_and_source_xds_phase_shifted

Note that on disk the data_group name is appended to the field_and_source_xds but once loaded it is just field_and_source_xds since it is accessed via the VISIBILITY data variable.

@scpmw
Copy link
Collaborator

scpmw commented Jan 23, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants