Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up loading for ACCESS-ESM non-CMOR datasets #2487

Open
rhaegar325 opened this issue Jul 22, 2024 · 4 comments
Open

Speed up loading for ACCESS-ESM non-CMOR datasets #2487

rhaegar325 opened this issue Jul 22, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@rhaegar325
Copy link
Contributor

rhaegar325 commented Jul 22, 2024

Hi, Develop team,
in the last few month we developed a cmoriser for ACCESS-ESM raw data in ESMValCore. However, due to the different way to store the data(typically cmored data was store by single variable in all time-range in a file, ACCESS-ESM data was stored by one timestamp with all variables in one file), if we still use the default way in esmvalcore to load ACCESS-ESM data, that will cause a huge time and memory cost. so I was wondering if we could build a load method for ACCESS-esm raw data that will be super helpful, won't need to be conplex, just a filter to select file within time-range which specified in recipe would be good.

I open this issue to see if anyone have good idea about how to do that. I am willing to implement myself, just need to know which way was the best that both of us will accept.

@rhaegar325 rhaegar325 added the enhancement New feature or request label Jul 22, 2024
@rbeucher
Copy link
Contributor

Hi @bouweandela, @valeriupredoi,

We are encountering an issue with the output of the ACCESS-ESM model, specifically with the atmospheric data. The data is stored as follows:

./atm/netCDF/:
HI-CN-05.pa-185001_mon.nc
HI-CN-05.pa-185002_mon.nc
HI-CN-05.pa-185003_mon.nc
HI-CN-05.pa-185004_mon.nc
HI-CN-05.pa-185005_mon.nc
HI-CN-05.pa-185006_mon.nc
HI-CN-05.pa-185007_mon.nc
HI-CN-05.pa-185008_mon.nc
HI-CN-05.pa-185009_mon.nc
HI-CN-05.pa-185010_mon.nc

All monthly variables are stored in a single netCDF file.

Currently, our config-developer.yml is configured as follows:

ACCESS:
  cmor_strict: false
  input_dir:
    default:
      - '{dataset}/{sub_dataset}/{exp}/{modeling_realm}/netCDF'
  input_file:
      default: '{sub_dataset}.{special_attr}-*.nc'
  output_file: '{project}_{dataset}_{mip}_{exp}_{institute}_{sub_dataset}_{special_attr}_{short_name}'
  cmor_type: 'CMIP6'
  cmor_default_table_prefix: 'CMIP6_'

This configuration results in ESMValCore analyzing all files and variables, which consumes excessive time and resources.

I have suggested the following to @rhaegar325:

  1. Modify the input_file/default to include a time facet. This change should prevent loading more data than necessary when working within a constrained time range.
  2. Utilize the timerange facet in the input_file name. We could implement the approach described here.
  3. Leverage IRIS constrained loading capabilities. Details are available here. This could potentially speed up the process.

Given that all variables are stored in a single file, we are aware that this setup is not optimal. However, we currently have no alternative.

Any advice would be greatly welcome!

Thanks,
R

@rbeucher rbeucher changed the title A specific method to load ACCESS-ESM raw data Speed up loading for ACCESS-ESM non-CMOR datasets Jul 23, 2024
@bouweandela
Copy link
Member

Selecting the files within the specified timerange should already work, as it does for CMIP6 etc. Did you check in the main_log_debug.txt log file which files are actually getting loaded? In order for this to work, your timerange does need to be recognized by the code here:

def _get_start_end_date(
file: str | Path | LocalFile | ESGFFile) -> tuple[str, str]:
"""Get the start and end dates as a string from a file name.

Leverage IRIS constrained loading capabilities.

You could probably implement this in the fix_file method of

and make it return a cube instead of a filename and then modify esmvalore.preprocessor.load so it skips the actual load step if the input is already a cube. Similar to what I tried out in #2454. In the longer term, we would like to implement a more flexible loading mechanism (see #2371), but we will first need to find funding for that.

@rbeucher
Copy link
Contributor

Thanks @bouweandela, that is really useful. We are going to look into this.

@valeriupredoi
Copy link
Contributor

time gating is one side of the problem, as Bouwe points out, another is variable selection which we don't do it anymore at load point (we used to have an iris Constraint at load raw point, though), what you can do about it though, you can overload it with a constraint, see load_raw and its usage - if this is a bit too much of a hassle, you can perform the single-variable loading via a fix, so that it runs ahead of everything else, a rather agricultural solution, but a fairly hassle-free one in me books 🍺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants