Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] File system/Data storage abstraction #1640

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

MTachon
Copy link
Contributor

@MTachon MTachon commented Apr 29, 2024

Overview

This PR intends to abstract away the local/remote file system or byte storage where the data used by the providers are stored. The implementation leverages a file-system interface provided through fsspec. fsspec provides already support for various file-systems and cloud storage services through built-in implementations. It also allows for using other known implementations, as well as implementing and registering new backends.

The data storage abstraction happens mostly in the implementation of the BaseProvider, where a new instance attribute fs is introduced. This attribute, which is a fsspec file-system interface, is inherited by other providers. Other providers can use this file-system interface in their implementation to access the data.

The instantiation of fsspec file-system objects may use configuration variables set in the providers section (some backends used by fsspec may also use environment variables).

This PR introduces a new optional file_system section, in the providers section of pygeoapi's runtime configuration. If the file_system section is omitted, the fs attribute of the BaseProvider will be an instance of the fsspec.implementations.local.LocalFileSystem class. In that case, the fs attribute can be used by other providers to access and read files on the "local" file system. In the implementations of the providers, calls to the builtin open functions can then be replaced by the open method of the LocalFileSystem:

# Calls of 'open' builtin function...
with open(self.data, mode='rt') as f:
    ...

# ... can be replaced with calls of 'open' method of LocalFileSystem instance
with self.fs.open(self.data, mode='rt') as f:
    ...

The file_system section, if given in pygeoapi's configuration, has one protocol mandatory field. The value passed to this field must be one of the protocols supported by fsspec (see fsspec.available_protocols()). For faster access, the data can be cached locally (e.g. when the data is on remote storage). This is of course not suitable for very large datasets, as the data needs to be downloaded on the first query, which is in that case both time consuming and takes much space. To cache locally the data, one can configure pygeoapi's runtime as follows:

providers
    - type: ...
      ...
      data: <my-bucket>/<key>
      file_system:  # optional
          protocol: gs  # mandatory, anything from the `fsspec.available_protocols()` list
          storage_options:  # optional
              # Credentials and other keywords parameters, specific to implementations supported by fsspec.
              # See https://filesystem-spec.readthedocs.io/en/latest/api.html#implementations and https://filesystem-spec.readthedocs.io/en/latest/api.html#external-implementations
              ...
          cache_storage: /path/to/cached/data  # optional, if not given, a temporary directory (cleaned up when process ends) will be used
          cache_options:  # optional
              # see https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.cached.WholeFileCacheFileSystem
              expiry_time: ...
              ...

Related Issue / discussion

#824

Additional information

  • This PR also adapted the implementation of several providers (GeoJSON, CSV, FileSystem, Rasterio, Xarray) to the data storage abstraction, which should give a good starting point for the implementation of other providers.
  • As for now, the implementation of the data storage abstraction cannot be used by the OGRProvider, as the Open method of drivers from osgeo.gdal/osgeo.ogr accepts only a path (str) to the data to open, and not a file-like handle. However, this provider can already access data on remote storage with GDAL's virtual file systems. The environment variables (if any needed) need then to be set in pygeoapi's parent process to make it work.
  • fsspec was added to requirements.txt as it is used in pygeoapi/provider/base.py. fsspec has had a OS package for Ubuntu for some time now, and does not seem to have known high or critical vulnerabilities.
  • fsspec has support for builtin implementations. Packages for external implementations supported by fsspec can be added to the requirements-provider.txt file.
  • netCDF4 was replaced by h5py and h5netcdf to allow reading from remote storage with xarray.

Dependency policy (RFC2)

  • I have ensured that this PR meets RFC2 requirements

Updates to public demo

Contributions and licensing

(as per https://github.com/geopython/pygeoapi/blob/master/CONTRIBUTING.md#contributions-and-licensing)

  • I'd like to contribute [feature X|bugfix Y|docs|something else] to pygeoapi. I confirm that my contributions to pygeoapi will be compatible with the pygeoapi license guidelines at the time of contribution
  • I have already previously agreed to the pygeoapi Contributions and Licensing Guidelines

Updates to public demo

Contributions and licensing

(as per https://github.com/geopython/pygeoapi/blob/master/CONTRIBUTING.md#contributions-and-licensing)

  • I'd like to contribute [feature X|bugfix Y|docs|something else] to pygeoapi. I confirm that my contributions to pygeoapi will be compatible with the pygeoapi license guidelines at the time of contribution
  • I have already previously agreed to the pygeoapi Contributions and Licensing Guidelines

@tomkralidis
Copy link
Member

@MTachon I see this PR is marked as WIP. Is this still the case or is it ready for review? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants