Preprocessing/Conversion Module #139

Zeitsperre · 2023-06-30T15:24:13Z

Proposal

Approach

The scope of what is currently named miranda.convert should be explicitly focused on the treatment of xarray-compatible data formats (i.e. from NetCDF and Zarr) as Dataset objects, to Dataset objects. All functionality focused on conversion (i.e. from CSV, FWF, MySQL, etc.) should be in a separate module. The API for these conversion functions should be similar between projects, with functional layout as follows:

_function_that_contains_logic_to_convert_to_xarray()
- Private, lots of configuration allowed.
- Refers to a different source for metadata configurations (no metadata definitions in code logic).
_function_that_helps_write_out_datasets_as_files()
- Private, used to pass keyword arguments to file-naming function and leverage miranda.io for outputting files.
- Called from public function via call signature (output_dir=Path() and file_format={"zarr", "netcdf")
function_that_can_be_called_by_user()
- Public, should have a near-identical call signature between projects.
- Should be written in such a way that dask or multiprocessing kwargs can be passed along to it for asynchronous conversion (if possible).

Values and metadata corrections should NOT be performed at this step. The goal is to have objects that are easily passed to the existing miranda.convert pipeline. Conversion to CF-compliant values should not be performed here.

Irrelevant of project, these functions should make use of a handful of common configurations:

IO:
- Function that writes out NetCDF or Zarr files, separated by variable (gridded) or station*variable (station-obs)
- Function that names files based using a standardized approach based on keyword arguments supplied to it.
Metadata:
- Like in miranda.convert, JSON files should be leveraged for populating the metadata of newly converted datasets.
- For simplicity (until a common JSON schema is determined), the JSON files used for data corrections should not be shared with those for data conversion.

It's not clear to me whether we should be drilling down further by having sub-sub-modules by data provider. To be determined.

The text was updated successfully, but these errors were encountered:

Zeitsperre added the enhancement New feature or request label Jun 30, 2023

Zeitsperre self-assigned this Jun 30, 2023

Zeitsperre linked a pull request Feb 29, 2024 that will close this issue

Refactor ECCC functionality and create Preprocess module #165

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing/Conversion Module #139

Preprocessing/Conversion Module #139

Zeitsperre commented Jun 30, 2023

Preprocessing/Conversion Module #139

Preprocessing/Conversion Module #139

Comments

Zeitsperre commented Jun 30, 2023

Proposal

Approach