Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing/Conversion Module #139

Open
Zeitsperre opened this issue Jun 30, 2023 · 0 comments · May be fixed by #165
Open

Preprocessing/Conversion Module #139

Zeitsperre opened this issue Jun 30, 2023 · 0 comments · May be fixed by #165
Assignees
Labels
enhancement New feature or request

Comments

@Zeitsperre
Copy link
Collaborator

Proposal

Approach

The scope of what is currently named miranda.convert should be explicitly focused on the treatment of xarray-compatible data formats (i.e. from NetCDF and Zarr) as Dataset objects, to Dataset objects. All functionality focused on conversion (i.e. from CSV, FWF, MySQL, etc.) should be in a separate module. The API for these conversion functions should be similar between projects, with functional layout as follows:

  • _function_that_contains_logic_to_convert_to_xarray()
    • Private, lots of configuration allowed.
    • Refers to a different source for metadata configurations (no metadata definitions in code logic).
  • _function_that_helps_write_out_datasets_as_files()
    • Private, used to pass keyword arguments to file-naming function and leverage miranda.io for outputting files.
    • Called from public function via call signature (output_dir=Path() and file_format={"zarr", "netcdf")
  • function_that_can_be_called_by_user()
    • Public, should have a near-identical call signature between projects.
    • Should be written in such a way that dask or multiprocessing kwargs can be passed along to it for asynchronous conversion (if possible).

Values and metadata corrections should NOT be performed at this step. The goal is to have objects that are easily passed to the existing miranda.convert pipeline. Conversion to CF-compliant values should not be performed here.

Irrelevant of project, these functions should make use of a handful of common configurations:

  • IO:

    • Function that writes out NetCDF or Zarr files, separated by variable (gridded) or station*variable (station-obs)
    • Function that names files based using a standardized approach based on keyword arguments supplied to it.
  • Metadata:

    • Like in miranda.convert, JSON files should be leveraged for populating the metadata of newly converted datasets.
    • For simplicity (until a common JSON schema is determined), the JSON files used for data corrections should not be shared with those for data conversion.

It's not clear to me whether we should be drilling down further by having sub-sub-modules by data provider. To be determined.

@Zeitsperre Zeitsperre added the enhancement New feature or request label Jun 30, 2023
@Zeitsperre Zeitsperre self-assigned this Jun 30, 2023
@Zeitsperre Zeitsperre linked a pull request Feb 29, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant