-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataArray: propagate index coordinates with non-array dimensions #10116
base: main
Are you sure you want to change the base?
DataArray: propagate index coordinates with non-array dimensions #10116
Conversation
Activated benchmarks on this PR. There is more logic now executed for the very common operation of dataarray access from dataset (e.g., |
xarray/core/dataarray.py
Outdated
raise ValueError( | ||
f"coordinate {k} has dimensions {v.dims}, but these " | ||
"are not a subset of the DataArray " | ||
f"dimensions {dim}" | ||
) | ||
|
||
for d, s in v.sizes.items(): | ||
if s != sizes[d]: | ||
if d not in extra_index_dims and s != sizes[d]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if d not in extra_index_dims and s != sizes[d]: | |
if d not in extra_index_dims or s != sizes[d]: |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be and
.
extra_index_dims
corresponds to all non-array dimensions of index coordinates to include in the dataarray (size[d] would return a KeyError).
I renamed it and added comment in 695fb86.
I was really excited to see this PR! Thanks, Benoit. Given that this is a fairly major data model change for DataArray, it'd be nice to get @shoyer's input too. |
Thanks for the review @dcherian! I addressed most of your comments in 695fb86, hopefully it is much clearer now. I agree this PR needs further careful review from the Xarray core devs! I didn't add any documentation for this change. There should be no user-facing change until we also provide custom indexes like IntervalIndex, so we should probably wait before advertising it. @max-sixty I reverted using |
... when check_default_indexes=False.
I also have to do some profiling, at least for the failing repr benchmark. |
No need to have a mapping of DataArrays to format those coordinate and data variables sections.
Here is the setup I used for profiling dataarray access from a dataset (based on the repr benchmark): import numpy as np
import xarray as xr
a = np.arange(0, 100)
data_vars = dict()
for i in a:
data_vars[f"long_variable_name_{i}"] = xr.DataArray(
name=f"long_variable_name_{i}",
data=np.arange(0, 20),
dims=[f"long_coord_name_{i}_x"],
coords={f"long_coord_name_{i}_x": np.arange(0, 20) * 2},
)
ds = xr.Dataset(data_vars)
ds.attrs = {f"attr_{k}": 2 for k in a}
def test_func(ds):
for k in ds.variables:
da = ds[k] As expected, this PR: main branch: ![]() I'm not sure how we can avoid this, though. The new coordinate propagation rules introduced here require calculating the dimensions of each index. That said, I'm not sure if this benchmark is really representative of any real-world scenario? It is really the worst-case scenario, with 100 data variables each having their own, single-index dimension coordinate. So the cache for I could fix the failing repr benchmarks in 5e8093b as we don't need mappings of DataArray objects to format the coordinates and data variables repr sections (mappings of Variable objects are enough). |
@@ -399,9 +402,14 @@ def _assert_dataarray_invariants(da: DataArray, check_default_indexes: bool): | |||
da.dims, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the variables in _coords
different from those on .coords
? I'm surprised this is still true now...
We should add one more invariant: something like "a dimension in da.coords.dims
must be in da.dims
or be associated with an index associated with a dimension in da.dims
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! The tests I added compare Coordinates objects, which thus skip this invariant check that we indeed need to update.
Do we have a benchmark for O(300) vars with the only 4 dimensions? (this is the ERA5 dataset) |
Another thing to fix is that the Indexes section of the repr will need its own list of dimensions, since these are not all shared with DataArray.dims. This could be done in a future PR, but we should at least open an issue in that case. |
That would useful to test (I can have a look), although I suspect that with 4 dimensions and no custom index (so maximum 4 default single-coordinate indexes) there won't be any significant impact. |
# check dimension names | ||
for name, var in coords.items(): | ||
if name in skip_check_coord_name: | ||
continue | ||
elif name in indexes: | ||
index_dims = indexes.get_all_dims(name) | ||
if any(d in dims for d in index_dims): | ||
# can safely skip checking index's non-array dimensions | ||
# and index's other coordinates since those must be all | ||
# included in the dataarray so the index is not corrupted | ||
skip_check_coord_name.update(indexes.get_all_coords(name)) | ||
skip_check_dim_size.update(d for d in index_dims if d not in dims) | ||
raise_error = False | ||
else: | ||
raise_error = True | ||
else: | ||
raise_error = any(d not in dims for d in var.dims) | ||
if raise_error: | ||
raise ValueError( | ||
f"coordinate {k} has dimensions {v.dims}, but these " | ||
f"coordinate {name} has dimensions {var.dims}, but these " | ||
"are not a subset of the DataArray " | ||
f"dimensions {dim}" | ||
f"dimensions {dims}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time following the logic in this loop. Could you please separate it into two separate loops:
- Update the sets of all coords/names to skip
- Raise any necessary errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably even better checking indexes and non-indexed coordinates in two separate loops, as you suggests in #10116 (comment).
|
||
for name, var in coords.items(): | ||
for d, s in var.sizes.items(): | ||
if d not in skip_check_dim_size and s != sizes[d]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the only case in which a dimension is in skip_check_dim_size
is when it is defined only on an index, in which case it won't appear in sizes
.
In that case, I think we could simplify this by checking sizes
instead of creating the skip_check_dim_size
set:
if d not in skip_check_dim_size and s != sizes[d]: | |
if d in sizes and s != sizes[d]: |
if ( | ||
k not in coords | ||
and k in temp_indexes | ||
and set(temp_indexes.get_all_dims(k)) & needed_dims | ||
): | ||
# add all coordinates of each index that shares at least one dimension | ||
# with the dimensions of the extracted variable | ||
coords.update(temp_indexes.get_all_coords(k)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coud this use a separate loop after the existing loop instead? e.g.,
for k in self._indexes:
if k in coords:
coords.update(self.xindexes.get_all_coords(k))
Or if we allow indexes without a coordinate of the same name:
for k in self._indexes:
if set(self.xindexes.get_all_dims(k)) & needed_dims:
coords.update(self.xindexes.get_all_coords(k))
Ideally, I would like the logic here to be just as simple as the words describing how it works, so a comment is not necessary!
Thanks @benbovy for putting this together. Custom indexes do seem like a great way to solve the coordinate propagation problem for cases like cell bounds. One edge case that comes to mind are coordinates that share dimensions with an index dimension, but that are not associated with an index directly. For example, consider a Another scenario I am concerned about is a 2D spatial index, e.g., along A different way to implement the logic of this PR (in a more general way) would be to delegate deciding which coordinates to include to a custom method on the xarray |
Thanks for your feedback @shoyer!
In that case
Yes indeed I can see how it would be confusing, in some cases desirable and in other cases not.
I didn't thought about it but I like this idea! On the plus side this would certainly prevent computing a set of dimensions for each index. My main concern is that the I'm not sure yet how would look like such custom method on the class Index:
def get_dataarray_coords(
self,
variables: dict[Hashable, Variable],
dims: tuple[Hashable, ...],
) -> Coordinates:
"""Select or generate coordinates to be included in a DataArray.
This method is called when a new DataArray is created either from scratch,
by assigning new coordinates to an existing DataArray or by extracting a
variable from a Dataset.
All the returned coordinate variables (and their indexes) will be included
in the new DataArray, whether or not their dimensions match (a subset of)
the array dimensions.
It is possible to return coordinates with another index instance (possibly of
an other type) in case the new DataArray has a different (reduced) set of
dimensions.
Parameters
----------
variables : dict
All coordinate variables associated to this index.
dims: tuple
Dataarray's dimensions.
Returns
-------
A new :py:class:`xarray.Coordinates` instance that will
be included in the new DataArray (maybe alongside other
coordinates).
"""
# probably better to return something meaningful as the default implementation.
raise NotImplementedError()
The Note that an index may already control which coordinates (and indexes) are propagated in a DataArray via other |
It would make sense to return all coordinates and their index if their dimensions are all compatible with the dataarray dimensions, or otherwise raise an error (or drop the index return only the coordinates with compatible dimensions). |
See #10137! |
assign_coords
whats-new.rst
api.rst
This PR allows including coordinates in a DataArray that do not need to have dimensions strictly matching (a subset of) the array dimensions. It still requires such coordinates to be associated with an index that shares at least one dimension with the array dimensions.
This is particularly useful for CF cell boundaries coordinates, see #1475.
Here is an example notebook showing this PR in action with an Xarray IntervalIndex that works with CF 2-dimensional bounds coordinates (#8005).