DataTree #350

sjperkins · 2025-01-16T17:10:34Z

As discussed in Thursday meetings, xarray.DataTree was proposed as a way of grouping datasets (processing sets in the MSv4 nomenclature). Quoting from the following article: detailing a collaboration between xarray and NASA devs:

The DataTree concept allows for organizing heterogeneous collections of scientific data in the same way that a nested directory structure facilitates organizing large numbers of files on disk. It does so in a way that preserves common structure between data in the collections, such as aligned arrays and common coordinates.

The benefits of this approach are that:

The grouping of multiple datasets is offloaded to the now core xarray DataTree structure.
Downstream libraries such as xradio can use xarray's native I/O routines for reading/writing DataTrees to disk, or to cloud stores such as AWS S3/Google Cloud Platform etc.
Following on from (2), once a DataTree is written to zarr format, only xarray is needed to read the DataTree again, further reducing the number of dependencies on the end user.

As I understand the conversation there are broadly two issues preventing DataTree adoption:

The lack of a defined hierarchical structure to which to assign the various MSv4 datasets
The presence of xarray Datasets in Dataset and DataArray attributes

1. DataTree Structure

The current directory structure used by xradio might be sufficient as is. The output of the test_alma test case for example produces the following (cut short for brevity) directory tree.

$ tree -L 2 Antennae_North.cal.lsrk.split.ps/
Antennae_North.cal.lsrk.split.ps/
├── Antennae_North.cal.lsrk.split_00
│   ├── antenna_xds
│   ├── correlated_xds
│   └── field_and_source_xds_base
├── Antennae_North.cal.lsrk.split_01
│   ├── antenna_xds
│   ├── correlated_xds
│   └── field_and_source_xds_base
...
└── Antennae_North.cal.lsrk.split_11
    ├── antenna_xds
    ├── correlated_xds
    ├── field_and_source_xds_base
    └── weather_xds

In a DataTree, this might look as follows

<xarray.DataTree>
Group: /
└── Group: /Antennae_North.cal.lsrk.split.ps/Antennae_North.cal.lsrk.split_00/correlated_xds
|    │   Dimensions:                     (time: 28760, baseline_id: 2775, frequency: 16,
|    │                                    polarization: 4, uvw_label: 3)
|    │   Coordinates:
|    │       baseline_antenna1_name      (baseline_id) object 22kB ...
|    │       baseline_antenna2_name      (baseline_id) object 22kB ...
|    │     * baseline_id                 (baseline_id) int64 22kB 0 1 2 ... 2773 2774
|    │     * frequency                   (frequency) float64 128B 1.202e+08 ... 1.204e+08
|    │     * polarization                (polarization) <U2 32B 'XX' 'XY' 'YX' 'YY'
|    │     * time                        (time) float64 230kB 1.601e+09 ... 1.601e+09
|    │   Dimensions without coordinates: uvw_label
|    │   Data variables:
|    │       EFFECTIVE_INTEGRATION_TIME  (time, baseline_id) float64 638MB ...
|    │       FLAG                        (time, baseline_id, frequency, polarization) uint8 5GB ...
|    │       TIME_CENTROID               (time, baseline_id) float64 638MB ...
|    │       UVW                         (time, baseline_id, uvw_label) float64 2GB ...
|    │       VISIBILITY                  (time, baseline_id, frequency, polarization) complex64 41GB ...
|    │       WEIGHT                      (time, baseline_id, frequency, polarization) float32 20GB ...
|    │   Attributes:
|    │       version:              4.0.0
|    │       creation_date:        2025-01-16T16:44:24.072119+00:00
|    │       data_description_id:  0
└── Group: /Antennae_North.cal.lsrk.split.ps/Antennae_North.cal.lsrk.split_00/antenna_xds
        Dimensions:           (antenna_name: 74,
                                   cartesian_pos_label/ellipsoid_pos_label: 3)
        Coordinates:
         * antenna_name      (antenna_name) object 592B 'CS001HBA0' ... 'IE613HBA'
            mount             (antenna_name) object 592B ...
            station           (antenna_name) object 592B ...
        Dimensions without coordinates: cartesian_pos_label/ellipsoid_pos_label
        Data variables:
            ANTENNA_POSITION  (antenna_name, cartesian_pos_label/ellipsoid_pos_label) float64 2kB ...

and are accessed via for example

dt["Antennae_North.cal.lsrk.split.ps/Antennae_North.cal.lsrk.split_00/correlated_xds"]
...
dt["Antennae_North.cal.lsrk.split.ps/Antennae_North.cal.lsrk.split_00/antenna_xds"]
...

Possible optimisations might involve storing the antenna_xds and field_and_source_xds_base datasets as first-level tree nodes (similar to the way subtables in MSv2 are organised).

2. Datasets stored in Dataset and DataArray attributes

When xradio loads the above directory structure into a series of xarray Datasets, the antenna_xds and field_and_source_xds_base Datasets are stored as attributes of the correlated_xds Dataset, or it's DataArray members. While convenient from a UI perspective xarray Datasets and DataArrays stored in attributes cannot be written or read by xarray's native to_zarr/from_zarr methods. This follows from the documentation:

https://docs.xarray.dev/en/latest/user-guide/data-structures.html#dataset-contents

Xarray does not enforce any restrictions on attributes, but serialization to some file formats may fail if you use objects that are not strings, numbers or numpy.ndarray objects.
https://docs.xarray.dev/en/latest/getting-started-guide/faq.html#what-is-your-approach-to-metadata

In general xarray uses the capabilities of the backends for reading and writing attributes.

and has been previously discussed here:

The processing set partition can not be written back to disk using xarray.dataset.to_zarr #216

1. Solutions to Datasets as Attributes

Use the DataTree structure as is

Given a familiar structure, users can just refer to the related subtable. I prefer OBS.ps/OBS_split_00/ANTENNA over OBS.ps/OBS_split_00/antenna_xds (less typing, similar to MSv2)

2. Use xarray accessors

xarray implements an extension mechanisms for DataArrays, Datasets and DataTrees called accessors:

https://docs.xarray.dev/en/latest/internals/extending-xarray.html

One suggestion was to rather store a relative link to, for example, antenna_xds on the correlated_xds

dt["/Antennae_North.cal.lsrk.split.ps/Antennae_North.cal.lsrk.split_00/correlated_xds"].attrs["antenna_xds_link"] = \ 
    "/Antennae_North.cal.lsrk.split.ps/Antennae_North.cal.lsrk.split_00/antenna_xds"


@register_datatree_accessor("links")
class SubTableAccessor:
  def __init__(self, node: DataTree):
    self.node = node

  @property
  def antenna(self) -> DataTree:
    """Returns the antenna dataset"""

    try:
      link = self.node.attrs["antenna_xds_link"]
    except KeyError:
      raise ValueError("antenna_xds_link not found")
    else:
      return self.node.root[link]

Then the following can be used to refer to the associated antenna table

dt["/Antennae_North.cal.lsrk.split.ps/Antennae_North.cal.lsrk.split_00/correlated_xds"].links.antenna

Note that this works as long as dt is a DataTree node (but not if it's a Dataset or DataArray). However, this is only a problem for DataArray's as DataTree nodes transparently proxy Datasets.

The only case of a DataArray with a Dataset as an attribute is the field_and_source_xds Dataset on the VISIBILITY/SPECTRUM DataArray in correlated_xds. @Jan-Willem, can you remind me of the reason why this needs to be present on the DataArray?

/cc @scpmw

The text was updated successfully, but these errors were encountered:

sjperkins · 2025-01-16T17:15:26Z

To summarise my thoughts:

DataTree should be used for grouping because it's a core xarray structure that uses xarray's native I/O routines. Once written to zarr, the MSv4 DataTree is readable and writeable via xarray only.
I'm not strongly attached to how the resulting hierarchical is organised but
Datasets should not be present as xarray attributes as they break xarray's native I/O routines (zarr in particular)

dmsadmin137 · 2025-01-23T13:18:41Z

"can you remind me of the reason why this needs to be present on the DataArray?" This is to allow for having multiple versions of the VISIBILITY/SPECTRUM that have been phase-shifted. The VISIBILITY/SPECTRUM versions are tracked by making use of data_groups. When writing to disk the data group name is appended to the related field_and_source_xds.

For example:

In memory:

Data Variables:
         VISIBILITY
         VISIBILITY_PS
Attrs:
        data_groups: {'base':{'correlated_data':'VISIBILITY',...},
                                 'phase_shifted':{'correlated_data':'VISIBILITY_PS',...}}

To access a field_and_source_xds:

ms_xds.VISIBILITY.field_and_source_xds
ms_xds.VISIBILITY_PS.field_and_source_xds

or

dg = ms_xds.attrs['data_groups']
ms_xds[dg['base']['correlated_data']].field_and_source_xds
ms_xds[dg['phase_shifted']['correlated_data']].field_and_source_xds

On disk, this would be:

Antennae_North.cal.lsrk.split.ps/
├── Antennae_North.cal.lsrk.split_00
│   ├── antenna_xds
│   ├── correlated_xds
│   ├── field_and_source_xds_base
│   └── field_and_source_xds_phase_shifted

Note that on disk the data_group name is appended to the field_and_source_xds but once loaded it is just field_and_source_xds since it is accessed via the VISIBILITY data variable.

scpmw · 2025-01-23T14:04:36Z

My thoughts on the topic: https://confluence.skatelescope.org/display/SEC/Datatree+proposal

sjperkins added the MSv4 Review label Jan 16, 2025

sjperkins mentioned this issue Feb 11, 2025

Datatree prototyping #360

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataTree #350

DataTree #350

sjperkins commented Jan 16, 2025 •

edited

Loading

sjperkins commented Jan 16, 2025

dmsadmin137 commented Jan 23, 2025

scpmw commented Jan 23, 2025

DataTree #350

DataTree #350

Comments

sjperkins commented Jan 16, 2025 • edited Loading

1. DataTree Structure

2. Datasets stored in Dataset and DataArray attributes

1. Solutions to Datasets as Attributes

Use the DataTree structure as is

2. Use xarray accessors

sjperkins commented Jan 16, 2025

dmsadmin137 commented Jan 23, 2025

scpmw commented Jan 23, 2025

sjperkins commented Jan 16, 2025 •

edited

Loading