DataArray: propagate index coordinates with non-array dimensions #10116

benbovy · 2025-03-12T12:20:38Z

support DataArray constructor
support DataArray constructed from Dataset (variable access)
support update or assign_coords
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

This PR allows including coordinates in a DataArray that do not need to have dimensions strictly matching (a subset of) the array dimensions. It still requires such coordinates to be associated with an index that shares at least one dimension with the array dimensions.

This is particularly useful for CF cell boundaries coordinates, see #1475.

Here is an example notebook showing this PR in action with an Xarray IntervalIndex that works with CF 2-dimensional bounds coordinates (#8005).

xarray/core/dataarray.py

benbovy · 2025-03-13T08:40:38Z

Activated benchmarks on this PR. There is more logic now executed for the very common operation of dataarray access from dataset (e.g., ds.foo) involving Indexes.get_all_dims(), which may impact performance. I don't think we could really avoid it but hopefully the impact is small.

xarray/core/coordinates.py

xarray/core/dataarray.py

dcherian · 2025-03-13T19:15:51Z

xarray/core/dataarray.py

            raise ValueError(
                f"coordinate {k} has dimensions {v.dims}, but these "
                "are not a subset of the DataArray "
                f"dimensions {dim}"
            )

        for d, s in v.sizes.items():
-            if s != sizes[d]:
+            if d not in extra_index_dims and s != sizes[d]:


Suggested change

if d not in extra_index_dims and s != sizes[d]:

if d not in extra_index_dims or s != sizes[d]:

?

This should be and.

extra_index_dims corresponds to all non-array dimensions of index coordinates to include in the dataarray (size[d] would return a KeyError).

I renamed it and added comment in 695fb86.

xarray/core/dataset.py

xarray/tests/test_dataarray.py

dcherian · 2025-03-13T19:20:10Z

I was really excited to see this PR! Thanks, Benoit.

Given that this is a fairly major data model change for DataArray, it'd be nice to get @shoyer's input too.

benbovy · 2025-03-14T10:34:18Z

Thanks for the review @dcherian! I addressed most of your comments in 695fb86, hopefully it is much clearer now.

I agree this PR needs further careful review from the Xarray core devs!

I didn't add any documentation for this change. There should be no user-facing change until we also provide custom indexes like IntervalIndex, so we should probably wait before advertising it.

@max-sixty I reverted using dims as argument name for the (renamed) check_dataarray_coords internal helper function to make it consistent with the other helper functions below that also use dims. Using dims in check_dataarray_coords makes senses to me since the value is always a tuple. No strong opinion, though, we can address this here or in a follow-up PR. (Could you point me to the discussion of using dim over dims, please?).

... when check_default_indexes=False.

benbovy · 2025-03-14T10:52:25Z

I also have to do some profiling, at least for the failing repr benchmark.

No need to have a mapping of DataArrays to format those coordinate and data variables sections.

benbovy · 2025-03-14T14:35:53Z

Here is the setup I used for profiling dataarray access from a dataset (based on the repr benchmark):

import numpy as np
import xarray as xr

a = np.arange(0, 100)
data_vars = dict()
for i in a:
    data_vars[f"long_variable_name_{i}"] = xr.DataArray(
        name=f"long_variable_name_{i}",
        data=np.arange(0, 20),
        dims=[f"long_coord_name_{i}_x"],
        coords={f"long_coord_name_{i}_x": np.arange(0, 20) * 2},
    )
ds = xr.Dataset(data_vars)
ds.attrs = {f"attr_{k}": 2 for k in a}

def test_func(ds):
    for k in ds.variables:
        da = ds[k]

As expected, Indexes.get_all_dims has a real impact on performance.

this PR:

main branch:

I'm not sure how we can avoid this, though. The new coordinate propagation rules introduced here require calculating the dimensions of each index.

That said, I'm not sure if this benchmark is really representative of any real-world scenario? It is really the worst-case scenario, with 100 data variables each having their own, single-index dimension coordinate. So the cache for Indexes.get_all_dims added in 528f356 has no effect here. It will have a positive effect on repeated dataarray access but only if .xindexes is cached too (#10074).

I could fix the failing repr benchmarks in 5e8093b as we don't need mappings of DataArray objects to format the coordinates and data variables repr sections (mappings of Variable objects are enough).

dcherian · 2025-03-14T14:38:42Z

xarray/testing/assertions.py

@@ -399,9 +402,14 @@ def _assert_dataarray_invariants(da: DataArray, check_default_indexes: bool):
        da.dims,


Are the variables in _coords different from those on .coords? I'm surprised this is still true now...

We should add one more invariant: something like "a dimension in da.coords.dims must be in da.dims or be associated with an index associated with a dimension in da.dims"

Good catch! The tests I added compare Coordinates objects, which thus skip this invariant check that we indeed need to update.

dcherian · 2025-03-14T14:40:02Z

It is really the worst-case scenario, with 100 data variables each having their own, single-index dimension coordinate.

Do we have a benchmark for O(300) vars with the only 4 dimensions? (this is the ERA5 dataset)

dcherian · 2025-03-14T14:43:59Z

Another thing to fix is that the Indexes section of the repr will need its own list of dimensions, since these are not all shared with DataArray.dims. This could be done in a future PR, but we should at least open an issue in that case.

benbovy · 2025-03-14T14:46:37Z

That would useful to test (I can have a look), although I suspect that with 4 dimensions and no custom index (so maximum 4 default single-coordinate indexes) there won't be any significant impact.

shoyer · 2025-03-14T14:32:34Z

xarray/core/dataarray.py

+    # check dimension names
+    for name, var in coords.items():
+        if name in skip_check_coord_name:
+            continue
+        elif name in indexes:
+            index_dims = indexes.get_all_dims(name)
+            if any(d in dims for d in index_dims):
+                # can safely skip checking index's non-array dimensions
+                # and index's other coordinates since those must be all
+                # included in the dataarray so the index is not corrupted
+                skip_check_coord_name.update(indexes.get_all_coords(name))
+                skip_check_dim_size.update(d for d in index_dims if d not in dims)
+                raise_error = False
+            else:
+                raise_error = True
+        else:
+            raise_error = any(d not in dims for d in var.dims)
+        if raise_error:
            raise ValueError(
-                f"coordinate {k} has dimensions {v.dims}, but these "
+                f"coordinate {name} has dimensions {var.dims}, but these "
                "are not a subset of the DataArray "
-                f"dimensions {dim}"
+                f"dimensions {dims}"
            )


I'm having a hard time following the logic in this loop. Could you please separate it into two separate loops:

Update the sets of all coords/names to skip

Raise any necessary errors

Probably even better checking indexes and non-indexed coordinates in two separate loops, as you suggests in #10116 (comment).

shoyer · 2025-03-14T14:35:12Z

xarray/core/dataarray.py

+
+    for name, var in coords.items():
+        for d, s in var.sizes.items():
+            if d not in skip_check_dim_size and s != sizes[d]:


If I understand correctly, the only case in which a dimension is in skip_check_dim_size is when it is defined only on an index, in which case it won't appear in sizes.

In that case, I think we could simplify this by checking sizes instead of creating the skip_check_dim_size set:

Suggested change

if d not in skip_check_dim_size and s != sizes[d]:

if d in sizes and s != sizes[d]:

shoyer · 2025-03-14T14:40:55Z

xarray/core/dataset.py

+                if (
+                    k not in coords
+                    and k in temp_indexes
+                    and set(temp_indexes.get_all_dims(k)) & needed_dims
+                ):
+                    # add all coordinates of each index that shares at least one dimension
+                    # with the dimensions of the extracted variable
+                    coords.update(temp_indexes.get_all_coords(k))


Coud this use a separate loop after the existing loop instead? e.g.,

for k in self._indexes: if k in coords: coords.update(self.xindexes.get_all_coords(k))

Or if we allow indexes without a coordinate of the same name:

for k in self._indexes: if set(self.xindexes.get_all_dims(k)) & needed_dims: coords.update(self.xindexes.get_all_coords(k))

Ideally, I would like the logic here to be just as simple as the words describing how it works, so a comment is not necessary!

shoyer · 2025-03-14T15:16:02Z

Thanks @benbovy for putting this together. Custom indexes do seem like a great way to solve the coordinate propagation problem for cases like cell bounds.

One edge case that comes to mind are coordinates that share dimensions with an index dimension, but that are not associated with an index directly. For example, consider a longitude dimension with a longitude_bounds coordinate with dimensions (longitude, longitude_bounds) and another coordinate variable flux also with dimensions (longitude, longitude_bounds). Would flux be included automatically in every case where longitude is preserved or not? In other words, does including coordinates from indexes happen before or after including all coordinates with dimensions associated with other variables/coordinates?

Another scenario I am concerned about is a 2D spatial index, e.g., along latitude and longitude. If the result needs only the latitude dimension, would the 2D spatial index and the longitude coordinate still be included? This would probably not be desirable.

A different way to implement the logic of this PR (in a more general way) would be to delegate deciding which coordinates to include to a custom method on the xarray Index class. This might be a good idea if we are not sure we can come up with a good general rule for all cases.

benbovy · 2025-03-14T17:06:19Z

Thanks for your feedback @shoyer!

Would flux be included automatically in every case where longitude is preserved or not? In other words, does including coordinates from indexes happen before or after including all coordinates with dimensions associated with other variables/coordinates?

In that case flux wouldn't be included but we could easily update the rules so that including non-indexed coordinates depends on a set of dimensions calculated after going over all indexed coordinates.

Another scenario I am concerned about is a 2D spatial index, e.g., along latitude and longitude. If the result needs only the latitude dimension, would the 2D spatial index and the longitude coordinate still be included? This would probably not be desirable.

Yes indeed I can see how it would be confusing, in some cases desirable and in other cases not.

A different way to implement the logic of this PR (in a more general way) would be to delegate deciding which coordinates to include to a custom method on the xarray Index class. This might be a good idea if we are not sure we can come up with a good general rule for all cases.

I didn't thought about it but I like this idea! On the plus side this would certainly prevent computing a set of dimensions for each index.

My main concern is that the Index API is already quite flexible (maybe one reason this feature hasn't been much adopted so far? but certainly not the main reason since documentation is still almost non-existent and built-in custom indexes are still missing...), and delegating even more to custom indexes might not help towards driving more adoption. That may be a minor concern, though.

I'm not sure yet how would look like such custom method on the Index class. Do you have any idea? Maybe something like this?

class Index:

    def get_dataarray_coords(
        self,
        variables: dict[Hashable, Variable],
        dims: tuple[Hashable, ...],
    ) -> Coordinates:
        """Select or generate coordinates to be included in a DataArray.

        This method is called when a new DataArray is created either from scratch,
        by assigning new coordinates to an existing DataArray or by extracting a
        variable from a Dataset.

        All the returned coordinate variables (and their indexes) will be included
        in the new DataArray, whether or not their dimensions match (a subset of)
        the array dimensions.

        It is possible to return coordinates with another index instance (possibly of
        an other type) in case the new DataArray has a different (reduced) set of
        dimensions.

        Parameters
        ----------
        variables : dict
            All coordinate variables associated to this index.
        dims: tuple
            Dataarray's dimensions.

        Returns
        -------
        A new :py:class:`xarray.Coordinates` instance that will
        be included in the new DataArray (maybe alongside other
        coordinates).
      
        """
        # probably better to return something meaningful as the default implementation.
        raise NotImplementedError()

The variables argument is needed since The Index base class does not provide any built-in functionality or model that keeps track of coordinate variables (and their dimensions) associated with an index. The return type Coordinates allows maximum flexibility, e.g., return the index itself and all of its coordinates as we would do for an IntervalIndex with center and bounds coordinates, return a 1D PandasIndex from a 2D spatial index if the result is only the latitude dimension, etc.

Note that an index may already control which coordinates (and indexes) are propagated in a DataArray via other Index methods like Index.isel, Index.sel, etc. as shown by the IntervalIndex example linked in the top comment.

benbovy · 2025-03-14T17:27:26Z

probably better to return something meaningful as the default implementation.

It would make sense to return all coordinates and their index if their dimensions are all compatible with the dataarray dimensions, or otherwise raise an error (or drop the index return only the coordinates with compatible dimensions).

benbovy · 2025-03-17T11:37:25Z

A different way to implement the logic of this PR (in a more general way) would be to delegate deciding which coordinates to include to a custom method on the xarray Index class. This might be a good idea if we are not sure we can come up with a good general rule for all cases.

See #10137!

benbovy added 2 commits March 12, 2025 12:59

DataArray constructor: update coords dims check

f93b45a

get datarray from dataset: update coords selection

01fc5fd

This was referenced Mar 12, 2025

Allow DataArray to hold cell boundaries as coordinate variables #1475

Open

Design for IntervalIndex #8005

Open

benbovy added 2 commits March 12, 2025 17:18

fix typing issues

75b9ae5

DataArray update coords: update coords dims check

b647df7

max-sixty approved these changes Mar 12, 2025

View reviewed changes

xarray/core/dataarray.py Show resolved Hide resolved

benbovy added 3 commits March 13, 2025 08:44

rename arg dims > dim

90ce0f9

add tests

ebcfa22

update whats new

f27f2d0

benbovy added the run-benchmark Run the ASV benchmark workflow label Mar 13, 2025

benbovy marked this pull request as ready for review March 13, 2025 08:42

benbovy added 2 commits March 13, 2025 14:24

cache Indexes.get_all_dims

528f356

Merge branch 'main' into dataarray-propagate-non-dim-index-coords

27a1e38