-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xarray Dataset Support #1490
base: main
Are you sure you want to change the base?
Xarray Dataset Support #1490
Conversation
We discussed this over zoom, but my basic comments were something along the lines of:
Other than that, it's generally difficult to design a useful base class without lots of examples. I didn't write RasterDataset until we already had several examples of the same pattern in TorchGeo. The base class was an attempt to reduce code duplication and share some of the more complicated data access stuff. I think it's fine to design this base class with the few example datasets you have in mind and change it to add new features later. |
I tested the xarray case I have with time-series indexing from #877 and I think this requires a discussion about assumptions about dateformats which is a bit tricky. Currently, the assumption is that the index and the sampler is working with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Difficult to review without several examples of what these datasets look like in real life. Maybe do a quick literature review and find a few good examples of popular NetCDF datasets from different sources? I want to see the diversity of file metadata. Like, do all datasets encode CRS/res/time/bounds in the same way or are they all different? This greatly influences how much customizability the base class requires.
@@ -31,6 +31,7 @@ radiant-mlhub==0.3.0 | |||
rarfile==4.0 | |||
scikit-image==0.18.0 | |||
scipy==1.6.2 | |||
xarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to determine the minimum version that works before merging
torchgeo/datasets/rioxarray.py
Outdated
from datetime import datetime | ||
from typing import Any, Callable, Optional, cast | ||
|
||
import netCDF4 # noqa: F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't used at the moment and could probably be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was running into this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather ignore that warning in pyproject.toml than add a fake import
torchgeo/datasets/rioxarray.py
Outdated
f"query: {query} not found in index with bounds: {self.bounds}" | ||
) | ||
|
||
data_arrays: list["np.typing.NDArray[np.float32]"] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will they always be float32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there could be cases where you also have integers but I would expect most datasets to have float values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably best to keep it dynamic if we can't predict 100% of the time.
torchgeo/datasets/rioxarray.py
Outdated
for item in items: | ||
with xr.open_dataset(item, decode_cf=True) as ds: | ||
if not ds.rio.crs: | ||
ds.rio.write_crs(self._crs, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does rioxarray automatically reproject to the right CRS if files are in multiple or if the user chooses a different CRS than the default (or if IntersectionDataset changes it)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Point, I think it does not automatically reproject. So far I have only used climate data which doesn't explicitly encode or use a CRS, I should check with some MODIS files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I am just doing this with the climate data, where there is no explicit CRS and ds.rio.reproject()
also only works for 2D and 3D arrays, whereas the CMIP data I have has more dimensions so I get rioxarray.exceptions.TooManyDimensions: Only 2D and 3D data arrays supported. Data variable: tos
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, trying it with MODIS files which are .hdf
files, I can only open them with rioxarray.open_rasterio()
and not xr.open_dataset(engine="rasterio")
so maybe one base class is too ambitious and ugly to support climate and satellite data at once.
torchgeo/datasets/rioxarray.py
Outdated
if hasattr(clipped, variable): | ||
data_arrays.append(clipped[variable].data.squeeze()) | ||
|
||
sample = {"image": torch.from_numpy(np.stack(data_arrays)), "bbox": query} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should support both images and masks. See the is_image
attribute in RasterDataset
for how we do this there. You can also copy the dtype
property to automatically choose what dtype to cast to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this works correctly for multiple overlapping files. We shouldn't be stacking, we should be merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, maybe have to rethink the base class thing and have one for xarray climate type data that is not using crs explicitly (class XarrayDataset(GeoDataset)
) and one that is intended for crs depending data sources like MODIS and more similar to RasterDataset
and using rioxarray
so the current naming class RioXarrayDataset(GeoDataset)
but with the required functionality to handle overlapping files etc.
tests/data/rioxarray/data.py
Outdated
os.makedirs(DIR) | ||
for var_name in VAR_NAMES: | ||
for lats, lons, cf_time in zip(LATS, LONS, CF_TIME): | ||
path = os.path.join(DIR, f"{var_name}_{lats}_{lons}.nc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never thought about the possibility of the filename containing the bounds/res/crs. Have you seen this in the wild? If so, we could add a check for this in the regex and extract it from the filename like we do for time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this was just a dummy naming scheme for test data. Maybe we should explicitly create test data for some of the different data cases like CMIP, ERA5, MODIS and others and write test cases for those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that once we have subclasses for each of those datasets
torchgeo/datasets/rioxarray.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend renaming this file. Otherwise the following become very different things:
import rioxarray
import .rioxarray
Maybe call it rioxr.py
? Or just throw it in geo.py
with the other base classes.
This is probably not exhaustive but I tried finding some sources from climate data and satellite optical data to cover some bases.
I suppose it could potentially be useful to have sub classes for these specific modalities, meaning a |
I found one approach that merges the data back again, but I think it can be quiet slow potentially. Might want to explore that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are failing. Also curious if we can add a dep on only xarray or rioxarray but not both.
Currently getting a |
Potentially interesting utility for data loading |
Seems like things are ongoing but does anyone need some help so that I can also provide a solution to fix this issue. |
I won't have time to work on this in the near future so feel free to suggest an alternative approach or build on top of the ideas found in this PR. |
This PR aims to add support for dataloading based on xarray datasets as the discussion began in #1486.
The application I have for my project is taking in CMIP6 data from separate files that each define a climate variable and I need to extract sampled time-series. The sampling procedure can come in form of certain samplers as I started working on in #877 .
The current draft is based on the closed #509 , but instead of a single xr.DataArray takes in a directory to gather files like
RasterDataset
does.The climate data for example does not have a CRS but they do come in different resolutions. Additionally, it could be that some data is time-series but others is not, so this needs to be merged correctly. Other climate data could also have a "depth" component.
Since there are a lot of things to consider, I would definitely appreciate input from more experienced users that have a better scope of common required work flows.
Closes #1486