Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataArray: propagate index coordinates with non-array dimensions #10116

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Mar 12, 2025

  • support DataArray constructor
  • support DataArray constructed from Dataset (variable access)
  • support update or assign_coords
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

This PR allows including coordinates in a DataArray that do not need to have dimensions strictly matching (a subset of) the array dimensions. It still requires such coordinates to be associated with an index that shares at least one dimension with the array dimensions.

This is particularly useful for CF cell boundaries coordinates, see #1475.

Here is an example notebook showing this PR in action with an Xarray IntervalIndex that works with CF 2-dimensional bounds coordinates (#8005).

@benbovy benbovy added the run-benchmark Run the ASV benchmark workflow label Mar 13, 2025
@benbovy
Copy link
Member Author

benbovy commented Mar 13, 2025

Activated benchmarks on this PR. There is more logic now executed for the very common operation of dataarray access from dataset (e.g., ds.foo) involving Indexes.get_all_dims(), which may impact performance. I don't think we could really avoid it but hopefully the impact is small.

@benbovy benbovy marked this pull request as ready for review March 13, 2025 08:42
raise ValueError(
f"coordinate {k} has dimensions {v.dims}, but these "
"are not a subset of the DataArray "
f"dimensions {dim}"
)

for d, s in v.sizes.items():
if s != sizes[d]:
if d not in extra_index_dims and s != sizes[d]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if d not in extra_index_dims and s != sizes[d]:
if d not in extra_index_dims or s != sizes[d]:

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be and.

extra_index_dims corresponds to all non-array dimensions of index coordinates to include in the dataarray (size[d] would return a KeyError).

I renamed it and added comment in 695fb86.

@dcherian
Copy link
Contributor

I was really excited to see this PR! Thanks, Benoit.

Given that this is a fairly major data model change for DataArray, it'd be nice to get @shoyer's input too.

@dcherian dcherian requested a review from shoyer March 13, 2025 19:20
@benbovy
Copy link
Member Author

benbovy commented Mar 14, 2025

Thanks for the review @dcherian! I addressed most of your comments in 695fb86, hopefully it is much clearer now.

I agree this PR needs further careful review from the Xarray core devs!

I didn't add any documentation for this change. There should be no user-facing change until we also provide custom indexes like IntervalIndex, so we should probably wait before advertising it.

@max-sixty I reverted using dims as argument name for the (renamed) check_dataarray_coords internal helper function to make it consistent with the other helper functions below that also use dims. Using dims in check_dataarray_coords makes senses to me since the value is always a tuple. No strong opinion, though, we can address this here or in a follow-up PR. (Could you point me to the discussion of using dim over dims, please?).

... when check_default_indexes=False.
@benbovy
Copy link
Member Author

benbovy commented Mar 14, 2025

I also have to do some profiling, at least for the failing repr benchmark.

No need to have a mapping of DataArrays to format those coordinate and
data variables sections.
@benbovy
Copy link
Member Author

benbovy commented Mar 14, 2025

Here is the setup I used for profiling dataarray access from a dataset (based on the repr benchmark):

import numpy as np
import xarray as xr

a = np.arange(0, 100)
data_vars = dict()
for i in a:
    data_vars[f"long_variable_name_{i}"] = xr.DataArray(
        name=f"long_variable_name_{i}",
        data=np.arange(0, 20),
        dims=[f"long_coord_name_{i}_x"],
        coords={f"long_coord_name_{i}_x": np.arange(0, 20) * 2},
    )
ds = xr.Dataset(data_vars)
ds.attrs = {f"attr_{k}": 2 for k in a}

def test_func(ds):
    for k in ds.variables:
        da = ds[k]

As expected, Indexes.get_all_dims has a real impact on performance.

this PR:

Screenshot 2025-03-14 at 15 06 56

main branch:

Screenshot 2025-03-14 at 15 08 40

I'm not sure how we can avoid this, though. The new coordinate propagation rules introduced here require calculating the dimensions of each index.

That said, I'm not sure if this benchmark is really representative of any real-world scenario? It is really the worst-case scenario, with 100 data variables each having their own, single-index dimension coordinate. So the cache for Indexes.get_all_dims added in 528f356 has no effect here. It will have a positive effect on repeated dataarray access but only if .xindexes is cached too (#10074).


I could fix the failing repr benchmarks in 5e8093b as we don't need mappings of DataArray objects to format the coordinates and data variables repr sections (mappings of Variable objects are enough).

@@ -399,9 +402,14 @@ def _assert_dataarray_invariants(da: DataArray, check_default_indexes: bool):
da.dims,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the variables in _coords different from those on .coords? I'm surprised this is still true now...

We should add one more invariant: something like "a dimension in da.coords.dims must be in da.dims or be associated with an index associated with a dimension in da.dims"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! The tests I added compare Coordinates objects, which thus skip this invariant check that we indeed need to update.

@dcherian
Copy link
Contributor

It is really the worst-case scenario, with 100 data variables each having their own, single-index dimension coordinate.

Do we have a benchmark for O(300) vars with the only 4 dimensions? (this is the ERA5 dataset)

@dcherian
Copy link
Contributor

Another thing to fix is that the Indexes section of the repr will need its own list of dimensions, since these are not all shared with DataArray.dims. This could be done in a future PR, but we should at least open an issue in that case.

@benbovy
Copy link
Member Author

benbovy commented Mar 14, 2025

That would useful to test (I can have a look), although I suspect that with 4 dimensions and no custom index (so maximum 4 default single-coordinate indexes) there won't be any significant impact.

Comment on lines +152 to 174
# check dimension names
for name, var in coords.items():
if name in skip_check_coord_name:
continue
elif name in indexes:
index_dims = indexes.get_all_dims(name)
if any(d in dims for d in index_dims):
# can safely skip checking index's non-array dimensions
# and index's other coordinates since those must be all
# included in the dataarray so the index is not corrupted
skip_check_coord_name.update(indexes.get_all_coords(name))
skip_check_dim_size.update(d for d in index_dims if d not in dims)
raise_error = False
else:
raise_error = True
else:
raise_error = any(d not in dims for d in var.dims)
if raise_error:
raise ValueError(
f"coordinate {k} has dimensions {v.dims}, but these "
f"coordinate {name} has dimensions {var.dims}, but these "
"are not a subset of the DataArray "
f"dimensions {dim}"
f"dimensions {dims}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time following the logic in this loop. Could you please separate it into two separate loops:

  1. Update the sets of all coords/names to skip
  2. Raise any necessary errors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably even better checking indexes and non-indexed coordinates in two separate loops, as you suggests in #10116 (comment).


for name, var in coords.items():
for d, s in var.sizes.items():
if d not in skip_check_dim_size and s != sizes[d]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the only case in which a dimension is in skip_check_dim_size is when it is defined only on an index, in which case it won't appear in sizes.

In that case, I think we could simplify this by checking sizes instead of creating the skip_check_dim_size set:

Suggested change
if d not in skip_check_dim_size and s != sizes[d]:
if d in sizes and s != sizes[d]:

Comment on lines +1217 to +1224
if (
k not in coords
and k in temp_indexes
and set(temp_indexes.get_all_dims(k)) & needed_dims
):
# add all coordinates of each index that shares at least one dimension
# with the dimensions of the extracted variable
coords.update(temp_indexes.get_all_coords(k))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coud this use a separate loop after the existing loop instead? e.g.,

for k in self._indexes:
    if k in coords:
        coords.update(self.xindexes.get_all_coords(k))

Or if we allow indexes without a coordinate of the same name:

for k in self._indexes:
    if set(self.xindexes.get_all_dims(k)) & needed_dims:
        coords.update(self.xindexes.get_all_coords(k))

Ideally, I would like the logic here to be just as simple as the words describing how it works, so a comment is not necessary!

@shoyer
Copy link
Member

shoyer commented Mar 14, 2025

Thanks @benbovy for putting this together. Custom indexes do seem like a great way to solve the coordinate propagation problem for cases like cell bounds.

One edge case that comes to mind are coordinates that share dimensions with an index dimension, but that are not associated with an index directly. For example, consider a longitude dimension with a longitude_bounds coordinate with dimensions (longitude, longitude_bounds) and another coordinate variable flux also with dimensions (longitude, longitude_bounds). Would flux be included automatically in every case where longitude is preserved or not? In other words, does including coordinates from indexes happen before or after including all coordinates with dimensions associated with other variables/coordinates?

Another scenario I am concerned about is a 2D spatial index, e.g., along latitude and longitude. If the result needs only the latitude dimension, would the 2D spatial index and the longitude coordinate still be included? This would probably not be desirable.

A different way to implement the logic of this PR (in a more general way) would be to delegate deciding which coordinates to include to a custom method on the xarray Index class. This might be a good idea if we are not sure we can come up with a good general rule for all cases.

@benbovy
Copy link
Member Author

benbovy commented Mar 14, 2025

Thanks for your feedback @shoyer!

Would flux be included automatically in every case where longitude is preserved or not? In other words, does including coordinates from indexes happen before or after including all coordinates with dimensions associated with other variables/coordinates?

In that case flux wouldn't be included but we could easily update the rules so that including non-indexed coordinates depends on a set of dimensions calculated after going over all indexed coordinates.

Another scenario I am concerned about is a 2D spatial index, e.g., along latitude and longitude. If the result needs only the latitude dimension, would the 2D spatial index and the longitude coordinate still be included? This would probably not be desirable.

Yes indeed I can see how it would be confusing, in some cases desirable and in other cases not.

A different way to implement the logic of this PR (in a more general way) would be to delegate deciding which coordinates to include to a custom method on the xarray Index class. This might be a good idea if we are not sure we can come up with a good general rule for all cases.

I didn't thought about it but I like this idea! On the plus side this would certainly prevent computing a set of dimensions for each index.

My main concern is that the Index API is already quite flexible (maybe one reason this feature hasn't been much adopted so far? but certainly not the main reason since documentation is still almost non-existent and built-in custom indexes are still missing...), and delegating even more to custom indexes might not help towards driving more adoption. That may be a minor concern, though.

I'm not sure yet how would look like such custom method on the Index class. Do you have any idea? Maybe something like this?

class Index:

    def get_dataarray_coords(
        self,
        variables: dict[Hashable, Variable],
        dims: tuple[Hashable, ...],
    ) -> Coordinates:
        """Select or generate coordinates to be included in a DataArray.

        This method is called when a new DataArray is created either from scratch,
        by assigning new coordinates to an existing DataArray or by extracting a
        variable from a Dataset.

        All the returned coordinate variables (and their indexes) will be included
        in the new DataArray, whether or not their dimensions match (a subset of)
        the array dimensions.

        It is possible to return coordinates with another index instance (possibly of
        an other type) in case the new DataArray has a different (reduced) set of
        dimensions.

        Parameters
        ----------
        variables : dict
            All coordinate variables associated to this index.
        dims: tuple
            Dataarray's dimensions.

        Returns
        -------
        A new :py:class:`xarray.Coordinates` instance that will
        be included in the new DataArray (maybe alongside other
        coordinates).
      
        """
        # probably better to return something meaningful as the default implementation.
        raise NotImplementedError()
    

The variables argument is needed since The Index base class does not provide any built-in functionality or model that keeps track of coordinate variables (and their dimensions) associated with an index. The return type Coordinates allows maximum flexibility, e.g., return the index itself and all of its coordinates as we would do for an IntervalIndex with center and bounds coordinates, return a 1D PandasIndex from a 2D spatial index if the result is only the latitude dimension, etc.

Note that an index may already control which coordinates (and indexes) are propagated in a DataArray via other Index methods like Index.isel, Index.sel, etc. as shown by the IntervalIndex example linked in the top comment.

@benbovy
Copy link
Member Author

benbovy commented Mar 14, 2025

probably better to return something meaningful as the default implementation.

It would make sense to return all coordinates and their index if their dimensions are all compatible with the dataarray dimensions, or otherwise raise an error (or drop the index return only the coordinates with compatible dimensions).

@benbovy benbovy marked this pull request as draft March 17, 2025 07:28
@benbovy
Copy link
Member Author

benbovy commented Mar 17, 2025

A different way to implement the logic of this PR (in a more general way) would be to delegate deciding which coordinates to include to a custom method on the xarray Index class. This might be a good idea if we are not sure we can come up with a good general rule for all cases.

See #10137!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmark Run the ASV benchmark workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants