Speed up loading for ACCESS-ESM non-CMOR datasets #2487

rhaegar325 · 2024-07-22T04:33:16Z

Hi, Develop team,
in the last few month we developed a cmoriser for ACCESS-ESM raw data in ESMValCore. However, due to the different way to store the data(typically cmored data was store by single variable in all time-range in a file, ACCESS-ESM data was stored by one timestamp with all variables in one file), if we still use the default way in esmvalcore to load ACCESS-ESM data, that will cause a huge time and memory cost. so I was wondering if we could build a load method for ACCESS-esm raw data that will be super helpful, won't need to be conplex, just a filter to select file within time-range which specified in recipe would be good.

I open this issue to see if anyone have good idea about how to do that. I am willing to implement myself, just need to know which way was the best that both of us will accept.

rbeucher · 2024-07-22T23:59:42Z

Hi @bouweandela, @valeriupredoi,

We are encountering an issue with the output of the ACCESS-ESM model, specifically with the atmospheric data. The data is stored as follows:

./atm/netCDF/:
HI-CN-05.pa-185001_mon.nc
HI-CN-05.pa-185002_mon.nc
HI-CN-05.pa-185003_mon.nc
HI-CN-05.pa-185004_mon.nc
HI-CN-05.pa-185005_mon.nc
HI-CN-05.pa-185006_mon.nc
HI-CN-05.pa-185007_mon.nc
HI-CN-05.pa-185008_mon.nc
HI-CN-05.pa-185009_mon.nc
HI-CN-05.pa-185010_mon.nc

All monthly variables are stored in a single netCDF file.

Currently, our config-developer.yml is configured as follows:

ACCESS:
  cmor_strict: false
  input_dir:
    default:
      - '{dataset}/{sub_dataset}/{exp}/{modeling_realm}/netCDF'
  input_file:
      default: '{sub_dataset}.{special_attr}-*.nc'
  output_file: '{project}_{dataset}_{mip}_{exp}_{institute}_{sub_dataset}_{special_attr}_{short_name}'
  cmor_type: 'CMIP6'
  cmor_default_table_prefix: 'CMIP6_'

This configuration results in ESMValCore analyzing all files and variables, which consumes excessive time and resources.

I have suggested the following to @rhaegar325:

Modify the input_file/default to include a time facet. This change should prevent loading more data than necessary when working within a constrained time range.
Utilize the timerange facet in the input_file name. We could implement the approach described here.
Leverage IRIS constrained loading capabilities. Details are available here. This could potentially speed up the process.

Given that all variables are stored in a single file, we are aware that this setup is not optimal. However, we currently have no alternative.

Any advice would be greatly welcome!

Thanks,
R

bouweandela · 2024-07-23T09:59:02Z

Selecting the files within the specified timerange should already work, as it does for CMIP6 etc. Did you check in the main_log_debug.txt log file which files are actually getting loaded? In order for this to work, your timerange does need to be recognized by the code here:

ESMValCore/esmvalcore/local.py

Lines 66 to 68 in 546937f

    
           def _get_start_end_date( 
        
                   file: str | Path | LocalFile | ESGFFile) -> tuple[str, str]: 
        
               """Get the start and end dates as a string from a file name.

Leverage IRIS constrained loading capabilities.

You could probably implement this in the fix_file method of

ESMValCore/esmvalcore/cmor/_fixes/access/access_esm1_5.py

Line 11 in 546937f

class AllVars(AccessFix):

and make it return a cube instead of a filename and then modify esmvalore.preprocessor.load so it skips the actual load step if the input is already a cube. Similar to what I tried out in #2454. In the longer term, we would like to implement a more flexible loading mechanism (see #2371), but we will first need to find funding for that.

rbeucher · 2024-07-30T01:16:02Z

Thanks @bouweandela, that is really useful. We are going to look into this.

valeriupredoi · 2024-07-30T13:40:06Z

time gating is one side of the problem, as Bouwe points out, another is variable selection which we don't do it anymore at load point (we used to have an iris Constraint at load raw point, though), what you can do about it though, you can overload it with a constraint, see load_raw and its usage - if this is a bit too much of a hassle, you can perform the single-variable loading via a fix, so that it runs ahead of everything else, a rather agricultural solution, but a fairly hassle-free one in me books 🍺

rhaegar325 added the enhancement New feature or request label Jul 22, 2024

rbeucher changed the title ~~A specific method to load ACCESS-ESM raw data~~ Speed up loading for ACCESS-ESM non-CMOR datasets Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up loading for ACCESS-ESM non-CMOR datasets #2487

Speed up loading for ACCESS-ESM non-CMOR datasets #2487

rhaegar325 commented Jul 22, 2024 •

edited

Loading

rbeucher commented Jul 22, 2024

bouweandela commented Jul 23, 2024

rbeucher commented Jul 30, 2024

valeriupredoi commented Jul 30, 2024

Speed up loading for ACCESS-ESM non-CMOR datasets #2487

Speed up loading for ACCESS-ESM non-CMOR datasets #2487

Comments

rhaegar325 commented Jul 22, 2024 • edited Loading

rbeucher commented Jul 22, 2024

bouweandela commented Jul 23, 2024

rbeucher commented Jul 30, 2024

valeriupredoi commented Jul 30, 2024

rhaegar325 commented Jul 22, 2024 •

edited

Loading