-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build virtual Zarr store using Xarray's dataset.to_zarr(region="...") model #308
Comments
This is a really extremely good question that I have also thought about a bit. I think a big difference is that this "map" approach assumes more knowledge the structure of the dataset up front. To use zarr regions you have to know how big the regions are in advance to create an empty zarr array that will be filles, but In the "map-reduce" approach you are able to concatenate files whose shapes you don't know until you open them. Whether that actually matters much in practice is another question. One thing to note is that IIUC having each worker write a region of virtual references would mean one icechunk commit per region, whereas transferring everything back to one node means one commit per logical version of the dataset, which seems a little nicer.
I agree. |
I really like the idea behind this approach @maxrjones! Being able to skip transferring all the references back to a 'reduce' step seem great. Also wondering about how all the commits between workers would work with icechunk. |
This also seems like a good way to protect against a single failed reference killing the entire job. |
👍
It does not, once earth-mover/icechunk#357 is in. Though in practice, that also does map-reduce on changesets which will contain virtual references in this case... |
Thanks for this information! Do you think it's accurate that map-reduce on changesets would still be more performant than map-reduce on virtual datasets when operating on thousands or more of files, such that this feature would be helpful despite not truly by a map operation? Also, do you anticipate there being other prerequisites in addition to your PR before VirtualiZarr could have something like |
I'm wondering how we can build virtualization pipelines as a map rather than a map-reduce process. The map paradigm would follow the structure below from Earthmover's excellent blog post on serverless datacube pipelines, where the virtual dataset from each worker would get written directly to an Icechunk Virtual Store rather than being transferred back to the coordination node for a concatenation step. Similar to their post, the parallelization between workers during virtualization could leverage a serverless approach like lithops or coiled. Dask concurrency could speed up virtualization within each worker. I think the main feature needed for this to work is an analogue to
xr.Dataset.to_zarr(region="...")
, complementary toxr.Dataset.to_zarr(append_dim="...")
(documented in https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html) (xref #21, #272).I tried to check that this feature request doesn't already exist, but apologies if I missed something.
The text was updated successfully, but these errors were encountered: