some comments on xarray integration

Hi Parcels Folks,

Congratulations on this amazing package! It is awesome to see how far it has evolved since Erik's Lagrangian meeting at Imperial a few years ago. Both the code itself and the documentation / example sets are just beautiful.

I have been reading through the code and documentation, and I have a few comments that I want to share. Please take these comments with a giant grain of salt...they are meant purely for discussion, not as a criticism of your package, which, as I said above, is awesome and amazing.

Parcels frequently seems to make the assumption that the velocity data resides in netCDF files. This may lead to some problems down the line. There are many cases when the velocities might be in other formats. For example:
- in MITgcm, we have data files in a weird MDS format
- in Pangeo, we are putting our data in the cloud in [zarr format](http://zarr.readthedocs.io)

However, all of these formats can be read into xarray. As a specific example, in the [Pangeo Sea Surface Height example](http://pangeo.io/use_cases/physical-oceanography/sea-surface-height.html#initialize-dataset), I can load the entire AVISO dataset from zarr format from google cloud storage in one line. I would **love** to be able to just call ocean parcels on this data and compute Lagrangian trajectories. That is not trivial right now. 

So here are some ideas, in increasing order of difficulty / complexity

1. I believe it might make sense for you to accept xarray objects when creating velocity `FieldSet` objects, i.e. `fieldset.from_xarray(ds)`. Looking at this code, I don't think this would be too hard. In fact, it might even already work with `fieldset.from_data`, since it xarray datasets provide a dictionary-like interface. This would solve some of the problems I described above (but not all). You also use xarray internally to read netcdf data, so maybe that path could be followed instead.

1. It would might not work with the Pangeo AVISO zarr dataset, because you also frequently coerce inputs to numpy arrays. This would trigger dask arrays to compute. It would be great to operate lazily until data is actually required for computation. This might be easy with duck typing, simply by avoiding explicit coercion to numpy arrays. Maybe the netcdf code path is lazier, but, as noted above, it only accepts netcdf files, not generic objects.

1. More generally, it looks like you are essentially re-implementing large portions of xarray in this package. For example `FieldSet` is conceptually similar to `xarray.Dataset` and `Field` is conceptually similar to `xarray.DataArray`. Labelled multi-dimensional arrays are an extremely common pattern in geoscience code. Many packages that once had their own implementations of these data structures have refactored to just use xarray ([satpy](https://satpy.readthedocs.io/en/latest/overview.html#dataarrays) is a great example). Your various `Field.search_indices` methods are very similar to xarray's indexing operations. Xarray now has [multidimensional interpolation](http://xarray.pydata.org/en/stable/interpolation.html) based on scipy, which you also implement in parcels. This would of course imply a major refactor of your internal structure, so this is a very presumptuous suggestion on my part. However, there could be major advantages, including
    - way less code to maintain
    - use xarray's indexers and interpolaters where possible
    - dask integration for free

Of course, xarray does not provide all the functionality you need for working with GCM data. Operations related to grid cells are not part of xarray. I am trying (mostly without success) to rally folks to use and contribute to [xgcm](https://xgcm.readthedocs.io/en/latest/) for this purpose. (So that is part of my ulterior motive here.)

Anyway, of course I don't expect any of these things (other than perhaps 1) to make it onto your todo list any time soon. Clearly this package is amazing and very successful in what it does. But as an xarray developer, oceanographer, and concerned citizen of the scientific python ecosystem, I just can't resist pointing out this opportunity where using xarray more could significantly reduce your development burden and lead to enhanced interoperability.

-Ryan




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

some comments on xarray integration #462

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

some comments on xarray integration #462

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions