Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pcat.search(...).to_dataset_dict() sometimes slower than it should #253

Open
coxipi opened this issue Sep 8, 2023 · 5 comments
Open

pcat.search(...).to_dataset_dict() sometimes slower than it should #253

coxipi opened this issue Sep 8, 2023 · 5 comments

Comments

@coxipi
Copy link
Contributor

coxipi commented Sep 8, 2023

Setup Information

  • xscen version: 0.6.12-beta
  • Python version: 3.11.4
  • Operating System: CentOs 7 (Doris)

Context

I store my files on "jarre", which is considered a slow disk AFAIK.

Sometimes, pcat.search(...).to_dataset_dict() will take forever to access my files (bad behaviour), while this homemade function:

def my_search(kwargs):
    paths = list(pcat.search(**kwargs).df.path)
    return {p:xr.open_zarr(p) for p in paths} 

has a speed which is similar to the good expected behaviour of pcat.search(...).to_dataset_dict() .

I can't tell what conditions on the server could be related to this problem. The problem sometimes comes, stays for a bit, and then stops.

Is this issue known?

@coxipi coxipi changed the title Homemade pcat search function sometimes faster pcat.search(...).to_dataset_dict() sometimes slower than it should Sep 8, 2023
@RondeauG
Copy link
Collaborator

RondeauG commented Sep 8, 2023

to_dataset_dict() does more than just open the files. It groups together the files associated to a given dataset based on aggregation controls specified in the JSON (by default: id, processing_level, domain, frequency). There's also a semi-custom call to open_dataset --> combine_by_coords, instead of open_mfdatasets, although I don't quite remember their reasoning behind it.

For very big catalogs, I could thus see a substantial difference in speed compared to simply opening the files.

That being said, we could see if there are speedups to be accomplished.

@aulemahal
Copy link
Collaborator

@coxipi Is your catalog supposed to have aggregation, or is it indeed just a list of independent datasets ?

The aggregation can often be sped up with passing there to to_dataset_dict:

xarray_combine_by_coords_kwargs={'data_vars': 'minimal', 'coords': 'minimal', 'compat': 'override'}

assuming all the elements to be aggregated are well behaved (no overlap between files, all variables of the same name have the same dimensions and the exact same coordinates on the non-appended dims, etc).

@coxipi
Copy link
Contributor Author

coxipi commented Sep 8, 2023

Not sure what you mean by "independent datasets". Each key in the dataset dict represents a different simulation (each with its own single path to a zarr) as created in previous steps of the xscen workflow.

@aulemahal
Copy link
Collaborator

I meant that they are not meant to be unified into a single dataset in the same way a open_mfdataset would act.

In that case, I'm not sure why to_dataset_dict would be dramatically slower than your function...

@aulemahal
Copy link
Collaborator

There's also a semi-custom call to open_dataset --> combine_by_coords, instead of open_mfdatasets, although I don't quite remember their reasoning behind it.

@RondeauG, in to_dataset_dict the aggregation is entirely driven by the catalog columns and configuration. In open_mfdataset, the aggregation is guessed by xarray by analyzing the coordinates.

Note: if the path column contains a *, open_mfdataset will be used, so one can combine both methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants