Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chunked reading of files #34

Open
GMoncrieff opened this issue Aug 2, 2023 · 2 comments
Open

chunked reading of files #34

GMoncrieff opened this issue Aug 2, 2023 · 2 comments

Comments

@GMoncrieff
Copy link

Currently reading files using emit_xarray from emit_tools.py reads into a nd.array backed xr.dataset. An option to read into a chunked dask.array backed xr.dataset would help prevent out-of-memory errors when reading on machines with limited memory (loading failed on an 8GB SMCE machine) and potentially speed up operations on downstream operations using dask.

Adding chunks='auto' to

ds = xr.open_dataset(filepath,engine = engine)
works when ortho=False but not for ortho=True

@ebolch
Copy link
Contributor

ebolch commented Aug 4, 2023

I will look into this some more. When orthorectifying, the geometry lookup table (GLT) included in the location group is used to reshape the data array to fit the GLT dimensions. I think to do that on chunks would require modification of the GLT. My understanding is that chunking along the bands dimension would likely slow down operations focused on imaging spectroscopy since that would require more reads. Maybe @pgbrodrick has some insight?

At minimum we can at least add notes about memory requirements.

@GMoncrieff
Copy link
Author

GMoncrieff commented Aug 11, 2023

Yes, I tried fiddling with the orthorectification but could not get it to work with dask.arrays. In the version of emit_xarray I am working with in my project utils/emit_tools.py I specify an option to chunk along the crosstrack and downtrack dims that is only accepted if ortho=False.

I think this is worthwhile because the recommended workflow is to perform data manipulations and biophysical modelling on the unorthorectified data, and later orthorectify the variables produced downstream. This works for me because the output variables will typically have far fewer variables than the image cube has bands (I go from 250ish bands to 4 endmembers). It then becomes much more feasible to load the data into memory before orthorectifying.

My guess is that this workflow is common enough to warrant accommodating it by specifying a chucking option if ortho=False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants