Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Definition of a GeoDataCube #12

Open
RyanAhola opened this issue Jul 26, 2024 · 2 comments
Open

Definition of a GeoDataCube #12

RyanAhola opened this issue Jul 26, 2024 · 2 comments

Comments

@RyanAhola
Copy link

RyanAhola commented Jul 26, 2024

Building on recent discussion in Testbed-20 (https://gitlab.ogc.org/ogc/T20-GDC/-/issues/14), setting up a thread to discuss what the definition of what a "geodatacube" is. Goal is for the SWG to come up with a definition that can be referenced by OGC.

@m-mohr
Copy link

m-mohr commented Jul 30, 2024

In the following you can read the current write-up from the OGC GitLab.
It is based on and a short version of https://openeo.org/documentation/1.0/datacubes.html and if clarifications are needed, it's the best source to check.


GeoDataCubes

Datacubes are multi-dimensional arrays with additional information about their dimensionality. Datacubes can provide a nice and tidy interface for spatiotemporal data as well as for the operations you may want to execute on them. Although arrays are close to raster data, datacubes can also hold vector data as well. GeoDataCubes (GDC) are a special case of datacubes in that they have one or multiple spatial dimension, e.g. x and y. GeoDataCubes for raster data often consist of the dimensions x, y, time and bands. Sometimes they also have multiple temporal dimensions. GeodataCubes for vector data often consist of geometries, time and a variable. Generally, datacubes can consist of any combination of dimensions - the dimensions are unrestricted. The spatial dimension of GeoDataCubes may get removed during processing.

The following additional information are usually available for datacubes:

  • the dimensions (see below)
  • a grid cell type / sampling method (area or point)
  • a unit for the values

These additional information could be provided upfront via metadata.

Dimensions

A dimension refers to a certain axis of a datacube. This includes all variables (e.g. bands), which are represented as dimensions. An exemplary raster datacube could have the spatial dimensions x and y, and the temporal dimension t. Furthermore, it could have a bands dimension, extending into the realm of what kind of information is contained in the cube.

The following properties are usually available for dimensions:

  • a name
  • a type (potential types include: spatial (raster or vector data), temporal and other variables such as bands)
  • labels (usually exposed through textual or numerical representations, in the metadata as nominal values and/or extents)
  • a reference system / unit for the labels (a unit may implicitly be defined through the reference system)
  • a resolution / step size
  • other information specific to the dimension type (e.g. the geometry types for a dimension containing geometries)

Specific implementations of datacubes may prescribe details such as sorting orders or representations of labels. For example, some implementations may always sort temporal labels in their inherent order and encode them in an ISO8601 compliant way.

Datacubes contain scalar values (e.g. strings, numbers or boolean values), with all other associated attributes stored in dimensions (e.g. coordinates or timestamps). Attributes such as the CRS or the sensor can also be turned into dimensions. Be advised that in such a case, the uniqueness of pixel coordinates may be affected. When usually, (x, y) refers to a unique location, that changes to (x, y, CRS) when (x, y) values are reused in other coordinate reference systems (e.g. two neighboring UTM zones).

Common Operations

A couple of operations are commonly applied to datacubes:

  • subset - Restrict the extent of dimensions, e.g. remove all temporal information not in the year 2021
  • apply (map) - Compute values form operations on single values, e.g. multiply all values by 10
  • reduce - Reduce a dimension by computing a single value for all values along a dimension, e.g. compute the maximum value along the temporal dimension
  • resample - the layout of a certain dimension is changed into another layout, most likely also changing the resolution of that dimension, e.g. downscaling from daily to monthly values

Every operation that returns a subset of the datacube or the complete datacube is considered to be datacube access.

Every operation that is computing new values is considered to be datacube processing.

Comparison

Coverages

  • A coverage is a function which returns values from its range for a direct position within its domain, where the meaning of range and domain follow the usual definitions for a mathematical function. In practice, a data cube is more or less the same as a coverage, depending on the definition of a data cube. The concept of a coverage is agnostic of the mechanisms to generate, observe/measure, store or access data.

  • The domain of the coverage is made up of all dimensions where the coverage function can return a value (spatial, temporal, pressure levels...). Extra dimensions can be used beyond spatial and temporal, as long as the field values have a homogenous value along the dimension (e.g., the frequency of hyperspectral could be considered a dimension).

  • The individual values of the range can consist of one or more field, which are the observerd/measured properties at each position within the domain. The different fields are not considered a dimension.

  • The range set is the set of values within the range (the actual values, which can take the form of a multidimensional array in a gridded coverage)

  • The range type describes what kind of information is contained in each field of the range of the coverage.

  • The domain set is the description the domain of the coverage, which in the case of irregular gridded coverage and non-gridded coverage (e.g., point clouds), would contain the set of coordinates where values are available.

  • In coverages, there are two types of dimensional subsetting / filtering: slicing, which removes dimension on which the slicing occurs; and trimming which preserves the dimensionality of the output coverage.

  • There is the concept of "range subsetting" (called "field selection" in OGC API - Coverages), which can return a subset of the available fields.

  • Other operations such as aggregation on one or more dimension of the domain, and down/upsampling of the coverage, can also be performed on a coverage.

Further information: Open-EO/openeo-api#502

openEO

  • Subsetting is called filtering

xarray

A datacube as described here is closely related to the concept of a single xarray DataArray.

netCDF

A datacube is comparable to a netCDF variable with its dimensions.

Raster file formats

  • A datacube can be created from raster file formats such as GeoTiffs, but can not always be exported to such file formats.
    GeoTiffs for example can usually only represent x,y and a third dimension (usually bands).

@strobpr
Copy link

strobpr commented Aug 16, 2024

Based on what is discussed here and previously in https://gitlab.ogc.org/ogc/T20-GDC/-/issues/14 and #502, I started wondering whether the attempt to find a single definition for all these different incarnations of (geo)datacubes is at all possible. Maybe the only commonality is that a ‘(geospatial) Datacube’ stands for the desire to render a multitude of (geospatial) data interoperable and organise them such that working with them as an ensemble is more efficient than individually. This is of course too undetermined to build a good definition on it which could help to distinguish what is considered in and what out. Settling with that type of loose agreement would mean to renegotiate the term each time a concrete project is started (as seems here the case). This does not sound very efficient either.

A possible way out could be to understand ‘(geo)datacubing’ as a process with several stages which render (geospatial) data increasingly more organised and interoperable, such enhancing the efficiency to deal with them. Below is what that could look like (6 stages only because the analogy to cube faces). I would hope agreeing on certain ‘datacube stages’ might be easier than reserving the name just for one or from a specific stage.

Curious to hear other opinions, maybe it's just too hot an August afternoon here.

Stage
Description
Notes
1
Multitude of data which have sufficient metadata to allow ordering them along certain dimensions
2 Multitude of data which have declared dimensions to which all single data items can be referenced
3
Multitude of data which are referenced to more than one standardized dimension (one of them being a geospatial domain)
At this stage, we have essentially a point cloud in an established CRS
4 Multitude of data block-wise co-registered (aligned or binned) along at least one identified standardised dimension with all blocks sharing a common geospatial range This stage marks the forming of layers or coverages which can be ordered and show a geospatial overlap
5
All layers are co-gridded to a regular grid system
At this stage, all data are organized in layers sharing the same grid or grid system (Q: Are the layers supposed to be gap-free?)
6 All layers have homologous discretisation (‘gridding’) along all their declared dimensions At this final (ideal?) stage, the dimensions follow the same algorithmic set of rules, so that operations can equally be applied across all dimensions or domains

Applicable definitions:
Data
Value and (usually) uncertainty of a trait of a specific entity

Dimension
direction or aspect in which a trait can vary or be measured (a single type domain)

Domain
n-dimensional space created by individual dimensions

Standardised dimension
Dimension with a standardised (ISO, OGC, SI) reference system (Q: needs to have an axis?)

Layer
A multitude of data in which all items share at least one metadata value (e.g. being on the earth surface or constant elevation)

Value
state of a trait within a class or type (domain)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants