Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multidimensional non-dimension coordinates in DataArray + Dataset #9579

Closed
zerothi opened this issue Oct 4, 2024 · 4 comments
Closed

Multidimensional non-dimension coordinates in DataArray + Dataset #9579

zerothi opened this issue Oct 4, 2024 · 4 comments

Comments

@zerothi
Copy link
Contributor

zerothi commented Oct 4, 2024

What is your issue?

I wish to do data-analysis on some data in a, seemingly weird format. I had hoped I could use xarray for this.

I am using a Dataset variable looking like this:

<xarray.Dataset>
Dimensions:        (size: 29, affinity: 5, thread: 10)
Coordinates:
  * size           (size) float64 0.1625 0.2875 0.4125 0.5375 ... 31.5 33.5 35.5
    affinities     (affinity, thread) object '0' '1' '2' '3' ... '3' '8' '4' '9'
Dimensions without coordinates: affinity, thread
Data variables:
    time_min       (size, affinity) float64 3.327e-06 ... 0.0...
    time_max      (size, affinity) float64 3.327e-06 ... 0.0...

Lets explain the details of why this looks like this:

  1. I have a size dimension coordinate which is simple and straightforward.
  2. I have two dimensions, affinity and thread. The affinity is a simple index; equivalent to a linear experiment index.
    The thread dimension is the number of threads I ran the experiment on. So this is not a dimension for the data. But an
    intrinsic information for each experimente (affinity index).
  3. Then I have a 2nd non-dimension coordinate, called affinities. This affinities1 is a coordinate of the two above dimensions.

This construction seemed natural to me because:

  • I ran the experiment for size
  • for each size I ran the expriment affinity times with 10 threads, and I stored the unique placements of the threads in the affinities coordinate.

I.e. the thread-placements is an intrinsical part of the experiment index, and not a dimension of the data.

The nice thing is that I can do:

for affinity, group in ds.groupby("affinity"):
    # group holds a unique affinity configuration

The problem comes when I need to extract only one of the variables.

>>> ds.time_min
<xarray.DataArray 'time_min' (size: 29, affinity: 5)>
Coordinates:
  * size     (size) float64 0.1625 0.2875 0.4125 0.5375 ... 29.5 31.5 33.5 35.5

so I loose all information related to the affinity. Now, I can understand how this works because any dimension in a DataArray is tightly bound to the variable. So the coordinates must as well.

But is this the wrong way to structure things?

I considered turning affinities into an attribute, but that has the problem that it won't get carried over when extracting the variable (I also tried with xr.set_options(keep_attrs=True) to no avail), and it surely won't select the affinity index.
I.e. I wouldn't be able to do the above groupby action easily...

Footnotes

  1. The affinity here refers to process/thread affinity.

@zerothi zerothi added the needs triage Issue that has not been reviewed by xarray team member label Oct 4, 2024
@keewis
Copy link
Collaborator

keewis commented Oct 4, 2024

this is the same issue as #8005: the data model for DataArray will only contain coordinates that share the same dimension as its data, which means that your data variable would have to have the affinity dimension for this to work.

In the issue above it has been proposed to extend the data model to use indexes to determine the content of DataArray, but as far as I can tell we didn't make a lot of progress on that since last year.

As a workaround, you could use a single-variable Dataset object: ds[["time_min"]], but I'm aware that this is not nearly as easy to use.

cc @dcherian

@keewis keewis removed the needs triage Issue that has not been reviewed by xarray team member label Oct 4, 2024
@zerothi
Copy link
Contributor Author

zerothi commented Oct 4, 2024

@keewis thanks for the heads up on the older topic. Hmm... I'll have to do other work-arounds for now then. :)

@zerothi
Copy link
Contributor Author

zerothi commented Oct 4, 2024

For what it's worth, I am currently unpacking the affinities into thread number of coordinates. In this way they can re-use the affinity dimension, and thus be retained..

threads = ds.dims["thread"]
place_names = [f"place_{i}" for i in range(threads)]
ds.assign_coords(
          dict((place_name, ("affinity", aff)) for place_name, aff in zip(place_names, 
                                                                 ds.affinities.values.T)))

@keewis
Copy link
Collaborator

keewis commented Oct 6, 2024

Good to hear you found a workaround.

Thinking about it a bit more, #8005 would require setting an index on the variable to carry over, which means this might not be suitable for every use case, and I don't think we will extend the data model to allow non-indexed coordinates with additional dimensions on DataArray objects.

Either way, I'm closing in favor of #8005.

@keewis keewis closed this as completed Oct 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants