Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERA5 catalog progress + Python issue #36

Open
meteorologist15 opened this issue Aug 15, 2024 · 7 comments
Open

ERA5 catalog progress + Python issue #36

meteorologist15 opened this issue Aug 15, 2024 · 7 comments
Assignees

Comments

@meteorologist15
Copy link

meteorologist15 commented Aug 15, 2024

The manual catalog for ERA5 data, coupled with the JSON generated by the CatalogBuilder, can be ingested by intake-esm, but only partially. The unmodified catalog contains the following data:

activity_id,institution_id,source_id,experiment_id,frequency,modeling_realm,table_id,member_id,variable_id,temporal_subset,chunk_freq,grid_label,platform,dimensions,cell_methods,path
ECMWF_Reanalysis_Phase_5,ECMWF,ECMWF_Reanalysis,Hourly_Data_On_Pressure_Levels,hourly,atmos,,1,specific_humidity,1940-2023,annual,,,longitude|latitude|time,,/uda/ERA5/Hourly_Data_On_Pressure_Levels/reanalysis/global/1000hPa/1hr-timestep/annual_file-range/specific_humidity/ERA5_1hr_specific_humidity_2023.nc
ECMWF_Reanalysis_Phase_5,ECMWF,ECMWF_Reanalysis,Monthly_Averaged_Data_On_Single_Levels,monthly,atmos,,1,10m_u_component_of_wind,1940-2023,annual,,,longitude|latitude|time,,/uda/ERA5/Monthly_Averaged_Data_On_Single_Levels/reanalysis/global/annual_file-range/Wind/u_10m/ERA5_monthly_averaged_10m_u_component_of_wind_2023.nc
ECMWF_Reanalysis_Phase_5,ECMWF,ECMWF_Reanalysis,Monthly_Averaged_Data_On_Single_Levels,monthly,ocean,,1,peak_wave_period,1940-2023,annual,,,longitude|latitude|time,,/uda/ERA5/Monthly_Averaged_Data_On_Single_Levels/reanalysis/global/annual_file-range/Ocean_waves/peak_wave_period/ERA5_monthly_averaged_peak_wave_period_2023.nc
ECMWF_Reanalysis_Phase_5,ECMWF,ECMWF_Reanalysis,Monthly_Averaged_Data_On_Single_Levels,monthly,atmos,,1,mean_surface_downward_short_wave_radiation_flux,1940-2023,annual,,,longitude|latitude|number|time,,/uda/ERA5/Monthly_Averaged_Data_On_Single_Levels/ensemble_members/global/annual_file-range/Mean_rates/mean_surface_downward_short-wave_rad_flux/ERA5_monthly_averaged_mean_surface_downward_short_wave_radiation_flux_2023.nc
ECMWF_Reanalysis_Phase_5,ECMWF,ECMWF_Reanalysis,Monthly_Averaged_Data_On_Single_Levels,monthly,atmos,,1,10m_wind_speed,1940-2023,annual,,,longitude|latitude|number|time,,/uda/ERA5/Monthly_Averaged_Data_On_Single_Levels/ensemble_members/global/annual_file-range/Wind/10m_wind_speed/ERA5_monthly_averaged_10m_wind_speed_2023.nc
ECMWF_Reanalysis_Phase_5,ECMWF,ECMWF_Reanalysis,Hourly_Data_On_Single_Levels,hourly,atmos,,1,mean_surface_downward_short_wave_radiation_flux,1940-2023,annual,,,longitude|latitude|time,,/uda/ERA5/Hourly_Data_On_Single_Levels/ensemble_mean/global/3hr-timestep/annual_file-range/Mean_rates/mean_surface_downward_short-wave_rad_flux/ERA5_3hr_mean_surface_downward_short_wave_radiation_flux_2023.nc
ECMWF_Reanalysis_Phase_5,ECMWF,ECMWF_Reanalysis,Monthly_Averaged_Data_On_Pressure_Levels,monthly,atmos,,1,fraction_of_cloud_cover,1940-2023,annual,,,longitude|latitude|level|time,,/uda/ERA5/Monthly_Averaged_Data_On_Pressure_Levels/reanalysis/global/all_levels/annual_file-range/cloud_cover_fraction/ERA5_monthly_averaged_fraction_of_cloud_cover_2022.nc
ECMWF_Reanalysis_Phase_5_Land,ECMWF,ECMWF_Reanalysis,ERA5-Land_Monthly_Averaged_Data,monthly,land,,1,lake_mix_layer_temperature,1950-2023,annual,,,longitude|latitude|time,,/uda/ERA5/ERA5-Land_Monthly_Averaged_Data/reanalysis/global/annual_file-range/Lakes/lake_mix-layer_temp/ERA5-Land_monthly_averaged_lake_mix_layer_temperature_2023.nc
ECMWF_Reanalysis_Phase_5_Extra,ECMWF,ECMWF_Reanalysis,ERA5_Extra,hourly,atmos,,1,updraught,1979-2023,monthly,,1,initial_time0_hours|forecast_time0|lv_HYBL0|lat_0|lon_0|ncl_strlen_0,,/uda/ERA5/ERA5_Extra/reanalysis/global/monthly_file-range/updraught/ERA5MARS_updraught_202012.nc4

The following is also run:

>>> data_catalog_3 = intake.open_esm_datastore("ERA5_initCatalog_slimmed.json")
/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/cat.py:269: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  self._df.sample(20, replace=True)
>>> data_catalog_3.df
                       activity_id institution_id         source_id  ...                                         dimensions cell_methods                                               path
0         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                            longitude|latitude|time          NaN  /uda/ERA5/Hourly_Data_On_Single_Levels/reanaly...
1         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                            longitude|latitude|time          NaN  /uda/ERA5/Hourly_Data_On_Pressure_Levels/reana...
2         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                            longitude|latitude|time          NaN  /uda/ERA5/Monthly_Averaged_Data_On_Single_Leve...
3         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                            longitude|latitude|time          NaN  /uda/ERA5/Monthly_Averaged_Data_On_Single_Leve...
4         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                     longitude|latitude|number|time          NaN  /uda/ERA5/Monthly_Averaged_Data_On_Single_Leve...
5         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                     longitude|latitude|number|time          NaN  /uda/ERA5/Monthly_Averaged_Data_On_Single_Leve...
6         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                            longitude|latitude|time          NaN  /uda/ERA5/Hourly_Data_On_Single_Levels/ensembl...
7         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                            longitude|latitude|time          NaN  /uda/ERA5/Hourly_Data_On_Single_Levels/reanaly...
8         ECMWF_Reanalysis_Phase_5          ECMWF  ECMWF_Reanalysis  ...                      longitude|latitude|level|time          NaN  /uda/ERA5/Monthly_Averaged_Data_On_Pressure_Le...
9    ECMWF_Reanalysis_Phase_5_Land          ECMWF  ECMWF_Reanalysis  ...                            longitude|latitude|time          NaN  /uda/ERA5/ERA5-Land_Monthly_Averaged_Data/rean...
10  ECMWF_Reanalysis_Phase_5_Extra          ECMWF  ECMWF_Reanalysis  ...  initial_time0_hours|forecast_time0|lv_HYBL0|la...          NaN  /uda/ERA5/ERA5_Extra/reanalysis/global/monthly...

[11 rows x 16 columns]

The following execution/error results:

>>> dsets_3 = data_catalog_3.to_dataset_dict()

--> The keys in the returned dictionary of datasets are constructed as follows:
        'source_id.experiment_id.frequency.modeling_realm.member_id.chunk_freq'
/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/indexing.py:1452: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  value = value[(slice(None),) * axis + (subkey,)]
/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/indexing.py:1452: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  value = value[(slice(None),) * axis + (subkey,)]
/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/indexing.py:1452: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  value = value[(slice(None),) * axis + (subkey,)]
Traceback (most recent call last):
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/source.py", line 259, in _open_dataset
    self._ds = xr.combine_by_coords(datasets, **self.xarray_combine_by_coords_kwargs)
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/combine.py", line 958, in combine_by_coords
    concatenated_grouped_by_data_vars = tuple(
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/combine.py", line 959, in <genexpr>
    _combine_single_variable_hypercube(
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/combine.py", line 630, in _combine_single_variable_hypercube
    concatenated = _combine_nd(
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/combine.py", line 232, in _combine_nd
    combined_ids = _combine_all_along_first_dim(
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/combine.py", line 267, in _combine_all_along_first_dim
    new_combined_ids[new_id] = _combine_1d(
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/combine.py", line 290, in _combine_1d
    combined = concat(
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/concat.py", line 252, in concat
    return _dataset_concat(
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/xarray/core/concat.py", line 597, in _dataset_concat
    raise ValueError(
ValueError: coordinate 't2m' not present in all datasets.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/pydantic/deprecated/decorator.py", line 55, in wrapper_function
    return vd.call(*args, **kwargs)
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/pydantic/deprecated/decorator.py", line 150, in call
    return self.execute(m)
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/pydantic/deprecated/decorator.py", line 222, in execute
    return self.raw_function(**d, **var_kwargs)
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/core.py", line 686, in to_dataset_dict
    raise exc
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/core.py", line 682, in to_dataset_dict
    key, ds = task.result()
  File "/net2/ker/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/net2/ker/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/net2/ker/anaconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/core.py", line 824, in _load_source
    return key, source.to_dask()
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/source.py", line 272, in to_dask
    self._load_metadata()
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake/source/base.py", line 283, in _load_metadata
    self._schema = self._get_schema()
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/source.py", line 208, in _get_schema
    self._open_dataset()
  File "/net2/ker/anaconda3/lib/python3.9/site-packages/intake_esm/source.py", line 264, in _open_dataset
    raise ESMDataSourceError(
intake_esm.source.ESMDataSourceError: Failed to load dataset with key='ECMWF_Reanalysis.Hourly_Data_On_Single_Levels.hourly.atmos.1.annual'
                 You can use `cat['ECMWF_Reanalysis.Hourly_Data_On_Single_Levels.hourly.atmos.1.annual'].df` to inspect the assets/files for this key.

After removing the offending datasets (in this case, the files containing t2m (2-meter temperature) and blh (boundary layer height)), I am able to successfully generate output from the "to_dataset_dict()" method. Example below:

{'ECMWF_Reanalysis.Monthly_Averaged_Data_On_Single_Levels.monthly.ocean.1.annual': <xarray.Dataset>
Dimensions:    (longitude: 720, latitude: 361, time: 12)
Coordinates:
  * longitude  (longitude) float32 0.0 0.5 1.0 1.5 ... 358.0 358.5 359.0 359.5
  * latitude   (latitude) float32 90.0 89.5 89.0 88.5 ... -89.0 -89.5 -90.0
  * time       (time) datetime64[ns] 2023-01-01 2023-02-01 ... 2023-12-01
Data variables:
    pp1d       (time, latitude, longitude) float32 dask.array<chunksize=(12, 361, 720), meta=np.ndarray>
Attributes: (12/17)
    Conventions:                       CF-1.6
    history:                           2024-05-01 21:23:52 GMT by grib_to_net...
    intake_esm_vars:                   ['peak_wave_period']
    intake_esm_attrs:activity_id:      ECMWF_Reanalysis_Phase_5
    intake_esm_attrs:institution_id:   ECMWF
    intake_esm_attrs:source_id:        ECMWF_Reanalysis
    ...                                ...
    intake_esm_attrs:temporal_subset:  1940-2023
    intake_esm_attrs:chunk_freq:       annual
    intake_esm_attrs:dimensions:       longitude|latitude|time
    intake_esm_attrs:path:             /uda/ERA5/Monthly_Averaged_Data_On_Sin...
    intake_esm_attrs:_data_format_:    netcdf
    intake_esm_dataset_key:            ECMWF_Reanalysis.Monthly_Averaged_Data..., 'ECMWF_Reanalysis.Hourly_Data_On_Pressure_Levels.hourly.atmos.1.annual': <xarray.Dataset>
Dimensions:    (longitude: 1440, latitude: 721, time: 8760)
Coordinates:
  * longitude  (longitude) float32 0.0 0.25 0.5 0.75 ... 359.0 359.2 359.5 359.8
  * latitude   (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
  * time       (time) datetime64[ns] 2023-01-01 ... 2023-12-31T23:00:00
Data variables:
    q          (time, latitude, longitude) float32 dask.array<chunksize=(8760, 721, 1440), meta=np.ndarray>
Attributes: (12/17)
    Conventions:                       CF-1.6
    history:                           2024-04-13 15:24:29 GMT by grib_to_net...
    intake_esm_vars:                   ['specific_humidity']
    intake_esm_attrs:activity_id:      ECMWF_Reanalysis_Phase_5
    intake_esm_attrs:institution_id:   ECMWF
    intake_esm_attrs:source_id:        ECMWF_Reanalysis
    ...                                ...
    intake_esm_attrs:temporal_subset:  1940-2023
    intake_esm_attrs:chunk_freq:       annual
    intake_esm_attrs:dimensions:       longitude|latitude|time
    intake_esm_attrs:path:             /uda/ERA5/Hourly_Data_On_Pressure_Leve...
    intake_esm_attrs:_data_format_:    netcdf
    intake_esm_dataset_key:            ECMWF_Reanalysis.Hourly_Data_On_Pressu..., 'ECMWF_Reanalysis.Monthly_Averaged_Data_On_Single_Levels.monthly.atmos.1.annual': <xarray.Dataset>
Dimensions:    (longitude: 1440, latitude: 721, time: 12)
Coordinates:
  * longitude  (longitude) float32 0.0 0.25 0.5 0.75 ... 359.0 359.2 359.5 359.8
  * latitude   (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
  * time       (time) datetime64[ns] 2023-01-01 2023-02-01 ... 2023-12-01
Data variables:
    u10        (time, latitude, longitude) float32 dask.array<chunksize=(12, 721, 1440), meta=np.ndarray>
Attributes: (12/17)
    Conventions:                       CF-1.6
    history:                           2024-05-03 23:14:39 GMT by grib_to_net...
    intake_esm_vars:                   ['10m_u_component_of_wind']
    intake_esm_attrs:activity_id:      ECMWF_Reanalysis_Phase_5
    intake_esm_attrs:institution_id:   ECMWF
    intake_esm_attrs:source_id:        ECMWF_Reanalysis
    ...                                ...
    intake_esm_attrs:temporal_subset:  1940-2023
    intake_esm_attrs:chunk_freq:       annual
    intake_esm_attrs:dimensions:       longitude|latitude|time
    intake_esm_attrs:path:             /uda/ERA5/Monthly_Averaged_Data_On_Sin...
    intake_esm_attrs:_data_format_:    netcdf
    intake_esm_dataset_key:            ECMWF_Reanalysis.Monthly_Averaged_Data..., 'ECMWF_Reanalysis.Hourly_Data_On_Single_Levels.hourly.atmos.1.annual': <xarray.Dataset>

Path to unmodified catalog (CSV): /nbhome/Kristopher.Rand/uda/catalogs/ERA5_initCatalog_slimmed.csv
Path to unmodified catalog's associated JSON: /nbhome/Kristopher.Rand/uda/catalogs/ERA5_initCatalog_slimmed.json

Path to modified catalog (CSV): /nbhome/Kristopher.Rand/uda/catalogs/ERA5_initCatalog_slimmed_modified.csv
Path to modified catalog's associated JSON: /nbhome/Kristopher.Rand/uda/catalogs/ERA5_initCatalog_slimmed_modified.json

@aradhakrishnanGFDL
Copy link
Collaborator

Thanks @meteorologist15! This helps to see how we can use the catalog builder to generate the modified csv, as discussed.

@aradhakrishnanGFDL
Copy link
Collaborator

aradhakrishnanGFDL commented Oct 21, 2024

TODO: open new issues for the dev and testing with catalog builder

@meteorologist15
Copy link
Author

Catalog example generated with the Catalog Builder for ERA5 dataset (pressure levels, geopotential variable, 300 hPa:

activity_id,institution_id,source_id,experiment_id,frequency,realm,table_id,member_id,grid_label,variable_id,time_range,chunk_freq,grid_label,platform,dimensions,cell_methods,path
,,,Hourly_Data_On_Pressure_Levels,,,,,,geopotential,,,,,,,/uda/ERA5/Hourly_Data_On_Pressure_Levels/reanalysis/global/300hPa/6hr-timestep/annual_file-range/geopotential/ERA5_6hr_geopotential_1940.nc
,,,Hourly_Data_On_Pressure_Levels,,,,,,geopotential,,,,,,,/uda/ERA5/Hourly_Data_On_Pressure_Levels/reanalysis/global/300hPa/6hr-timestep/annual_file-range/geopotential/ERA5_6hr_geopotential_1941.nc
,,,Hourly_Data_On_Pressure_Levels,,,,,,geopotential,,,,,,,/uda/ERA5/Hourly_Data_On_Pressure_Levels/reanalysis/global/300hPa/6hr-timestep/annual_file-range/geopotential/ERA5_6hr_geopotential_1942.nc
...etc

The categories preserved are experiment_id, variable_id, and path

The configuration used:

headerlist: ["activity_id", "institution_id", "source_id", "experiment_id",
                  "frequency", "realm", "table_id",
                  "member_id", "grid_label", "variable_id",
                  "time_range", "chunk_freq","grid_label","platform","dimensions","cell_methods","path"]

output_path_template: ['NA', 'NA', 'experiment_id', 'NA', 'NA', 'NA', 'NA', 'NA', 'variable_id']

output_file_template: ['NA', 'NA', 'variable_id', 'NA']

input_path: "/uda/ERA5/Hourly_Data_On_Pressure_Levels/reanalysis/global/300hPa/6hr-timestep/annual_file-range/geopotential/"

output_path: "/nbhome/Kristopher.Rand/uda/catalogs/test_catalogbuilder"

@aradhakrishnanGFDL
Copy link
Collaborator

@meteorologist15 I’m trying to run this. Are you using the main branch from this repository?

@meteorologist15
Copy link
Author

Locally committed small change to gfdlcrawler to account for filenames in without a "." in its name. Awaiting to further commit to branch on github.

@meteorologist15
Copy link
Author

meteorologist15 commented Oct 24, 2024

Two separate issues exist: 1) Filenames with multiple word variable names, separated by an underscore -- if the "" character in filenames is to be checked. 2) If using "" as a separator, properly capturing/resolving "monthly_averaged" in the filenames of monthly averaged datasets. Some more fundamental changes to the crawler script may be necessary. 3. Variable names in the path that differ from the filename.

@aradhakrishnanGFDL
Copy link
Collaborator

Locally committed small change to gfdlcrawler to account for filenames in without a "." in its name. Awaiting to further commit to branch on github.

Great. Thanks. You may use this as reference. But also the fastest approach not the perfect approach is good for now. https://docs.google.com/document/d/17nlIgSQPwL1MFqwHlRV8R5vCpug08r71tM75poGpQtc/edit#heading=h.60aeh5dnv42m

meteorologist15 pushed a commit to meteorologist15/CatalogBuilder that referenced this issue Oct 24, 2024
meteorologist15 pushed a commit to meteorologist15/CatalogBuilder that referenced this issue Oct 24, 2024
meteorologist15 pushed a commit to meteorologist15/CatalogBuilder that referenced this issue Oct 28, 2024
meteorologist15 pushed a commit to meteorologist15/CatalogBuilder that referenced this issue Oct 30, 2024
meteorologist15 pushed a commit to meteorologist15/CatalogBuilder that referenced this issue Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants