Multidimensional non-dimension coordinates in DataArray + Dataset #9579

zerothi · 2024-10-04T09:58:21Z

What is your issue?

I wish to do data-analysis on some data in a, seemingly weird format. I had hoped I could use xarray for this.

I am using a Dataset variable looking like this:

<xarray.Dataset>
Dimensions:        (size: 29, affinity: 5, thread: 10)
Coordinates:
  * size           (size) float64 0.1625 0.2875 0.4125 0.5375 ... 31.5 33.5 35.5
    affinities     (affinity, thread) object '0' '1' '2' '3' ... '3' '8' '4' '9'
Dimensions without coordinates: affinity, thread
Data variables:
    time_min       (size, affinity) float64 3.327e-06 ... 0.0...
    time_max      (size, affinity) float64 3.327e-06 ... 0.0...

Lets explain the details of why this looks like this:

I have a size dimension coordinate which is simple and straightforward.
I have two dimensions, affinity and thread. The affinity is a simple index; equivalent to a linear experiment index.
The thread dimension is the number of threads I ran the experiment on. So this is not a dimension for the data. But an
intrinsic information for each experimente (affinity index).
Then I have a 2nd non-dimension coordinate, called affinities. This affinities¹ is a coordinate of the two above dimensions.

This construction seemed natural to me because:

I ran the experiment for size
for each size I ran the expriment affinity times with 10 threads, and I stored the unique placements of the threads in the affinities coordinate.

I.e. the thread-placements is an intrinsical part of the experiment index, and not a dimension of the data.

The nice thing is that I can do:

for affinity, group in ds.groupby("affinity"):
    # group holds a unique affinity configuration

The problem comes when I need to extract only one of the variables.

>>> ds.time_min
<xarray.DataArray 'time_min' (size: 29, affinity: 5)>
Coordinates:
  * size     (size) float64 0.1625 0.2875 0.4125 0.5375 ... 29.5 31.5 33.5 35.5

so I loose all information related to the affinity. Now, I can understand how this works because any dimension in a DataArray is tightly bound to the variable. So the coordinates must as well.

But is this the wrong way to structure things?

I considered turning affinities into an attribute, but that has the problem that it won't get carried over when extracting the variable (I also tried with xr.set_options(keep_attrs=True) to no avail), and it surely won't select the affinity index.
I.e. I wouldn't be able to do the above groupby action easily...

The affinity here refers to process/thread affinity. ↩

The text was updated successfully, but these errors were encountered:

keewis · 2024-10-04T10:10:03Z

this is the same issue as #8005: the data model for DataArray will only contain coordinates that share the same dimension as its data, which means that your data variable would have to have the affinity dimension for this to work.

In the issue above it has been proposed to extend the data model to use indexes to determine the content of DataArray, but as far as I can tell we didn't make a lot of progress on that since last year.

As a workaround, you could use a single-variable Dataset object: ds[["time_min"]], but I'm aware that this is not nearly as easy to use.

cc @dcherian

zerothi · 2024-10-04T10:19:23Z

@keewis thanks for the heads up on the older topic. Hmm... I'll have to do other work-arounds for now then. :)

zerothi · 2024-10-04T11:21:30Z

For what it's worth, I am currently unpacking the affinities into thread number of coordinates. In this way they can re-use the affinity dimension, and thus be retained..

threads = ds.dims["thread"]
place_names = [f"place_{i}" for i in range(threads)]
ds.assign_coords(
          dict((place_name, ("affinity", aff)) for place_name, aff in zip(place_names, 
                                                                 ds.affinities.values.T)))

keewis · 2024-10-06T13:35:40Z

Good to hear you found a workaround.

Thinking about it a bit more, #8005 would require setting an index on the variable to carry over, which means this might not be suitable for every use case, and I don't think we will extend the data model to allow non-indexed coordinates with additional dimensions on DataArray objects.

Either way, I'm closing in favor of #8005.

zerothi added the needs triage Issue that has not been reviewed by xarray team member label Oct 4, 2024

keewis removed the needs triage Issue that has not been reviewed by xarray team member label Oct 4, 2024

keewis mentioned this issue Oct 5, 2024

string coordinate in DataArray versus Dataset #9583

Closed

keewis closed this as completed Oct 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multidimensional non-dimension coordinates in DataArray + Dataset #9579

Multidimensional non-dimension coordinates in DataArray + Dataset #9579

zerothi commented Oct 4, 2024

keewis commented Oct 4, 2024

zerothi commented Oct 4, 2024

zerothi commented Oct 4, 2024

keewis commented Oct 6, 2024

Multidimensional non-dimension coordinates in DataArray + Dataset #9579

Multidimensional non-dimension coordinates in DataArray + Dataset #9579

Comments

zerothi commented Oct 4, 2024

What is your issue?

Footnotes

keewis commented Oct 4, 2024

zerothi commented Oct 4, 2024

zerothi commented Oct 4, 2024

keewis commented Oct 6, 2024