-
Notifications
You must be signed in to change notification settings - Fork 13
NetCDF Subsetting Investigation for MARVL Use
MARVL developers would like to be able to access a service that will allow them to download NetCDF data within a temporal/spatial range. In particular they would like to be able to access this subset of the collection data using OPeNDAP or as a subsetted NetCDF file.
They would like collections that support this service to list this service as being available in the collection metadata.
Note that this document considers horizontal spatial subsetting only. Vertical subsetting is not considered.
IMOS NetCDF data can be grouped into different types of data with similar characteristics as follows (using NODC Feature Types feature type definitions):
Data Type | Definition (From NODC Feature Types) | IMOS Data |
---|---|---|
grid | Data represented or projected on a regular or irregular grid | ACORN, SRS, Ocean Current GSLA |
trajectory | A series of data points along a path through space with monotonically increasing times. | AUV, SOOP TRV, SOOP TMV, SOOP SST, SOOP FRRF, SOOP CO2, SOOP BA, SOOP ASF (Flux product), ANFOG |
profile | An ordered set of data points along a vertical line at a fixed horizontal position and fixed time | Argo, SOOP XBT |
trajectoryProfile | A series of profile features located at points ordered along a trajectory. | AATAMS SATTAG*, SOOP BA (Argo and SOOP XBT could be this as well). |
timeseries | A series of data points at the same spatial location with monotonically increasing times | FAIMMS, ANMN (Temperature, CTD, CO2), ABOS SOFS (Surface waves)**, SRS OC RAD** |
timeseriesProfile | A series of profile features at the same horizontal position with monotonically increasing times. | ANMN (Velocity), ABOS SOTS, ABOS SOFS (Surface properties), SRS OC BODAW* |
* stored in continuous ragged and/or indexed ragged format - harder to subset!
** data point is vector e.g. DISP[TIME][NUM_SPECTRUM], lsky[TIME][WAVELENGTH_lsky]
Not all IMOS data is accessible in NetCDF files, for example AATAMS Biologging Penguin, Shearwater, SOOP Aus CPR collections. In most cases, they still fit into the feature Type categorisations described above, but won't be accessible using a NetCDF based solution unless we make them available as such.
We currently support subsetting and aggregation of gridded data sets using gogoduck and aodaac. The NetCDF Subset Service, OPeNDAP service and NetCDF library routines described below also support subsetting of gridded data, but won't be examined for this purpose as we already have services which do this and more.
Non-gridded data is currently accessible for download via the portal in unsubsetted format via the BODAAC service. Only files containing data meeting the requested spatial/temporal subsetting criteria are returned, but data in the file which does not meet the criteria is not removed.
The initial advice passed on from the MARVL developers was that Ramadda could perform the subsetting of NetCDF data that was desired. For this reason, some time was spent looking at Ramadda subsetting options with a view to understanding how different data types could be subsetted and what lower level api's may be available for re-use for doing so.
The result of this analysis was that Ramadda can only actually perform subsetting of IMOS gridded data and IMOS timeseries data modified to look like gridded data.
According to Seb, the supported timeseries data had previously been modified to look like gridded data so that it could be downloaded in Ramadda this way. I'm not sure this follows established conventions (CF) for timeseries data and an approach which does would be preferable.
Point subsetting options were also presented for ANFOG, SOOP SST, SOOP TRV, SOOP ASF, AATAMS SATTAG and ABOS ASFS datasets but did not work (they are actually trajectory datasets).
It was later confirmed with the MARVL developers that the subsetting options tested/used were for gridded data.
Note that Ramadda uses the NetCDF java library and in particular support for Scientific feature Types described below to provide subsetting functionality, although it uses an earlier version than that now available.
The option discussed with MARVL developers was to use GeoServer to return OPeNDAP url's subsetting each variable in the dataset using OPeNDAP Constraint Expressions matching the requested spatial/temporal parameters.
To create this OPeNDAP subset url, the base OPeNDAP url for the dataset needs to be determined, the variables to be subsetted need to be identified, the shape (dimensions) of the variables needs to be determined and the index ranges satisfying temporal/spatial criteria need calculated.
For profile data, files will be returned containing only data for the required spatial/temporal range.
For current IMOS timeseries data, files will be returned containing only data for the required spatial range. The index range to be applied to the TIME dimension of the variables can be determined by examining the TIME variable values. This requires access to the dataset using OPeNDAP at the time the url is being created or harvesting this information when harvesting the dataset and accessing harvested information at the time the url is being created.
For trajectory data or trajectory profile data, it is not possible to create a single OPeNDAP subset URL to subset the dataset for the required spatial/temporal range except in very specific circumstances.
This option does not support trajectory data sets, is not extensible to other common data types, is low level in terms of the access to data, is current IMOS data specific and is not convention based (e.g. CF convention)
The NetCDF Subset Service (NCSS) is a web service for subsetting CDM scientific datasets. The subsetting is specified using earth coordinates, such as lat/lon or projection coordinates bounding boxes and date ranges, rather than index ranges that refer to the underlying data arrays. The data arrays are subsetted but not resampled or reprojected, and preserve the resolution and accuracy of the original dataset.
Example URL:
http://thredds.ucar.edu/thredds/ncss/grib/NCEP/GFS/Global_0p5deg/best?north=47.0126&west=-114.841&east=-112.641&south=44.8534
&time_start=present&time_duration=PT3H&accept=netcdf
&var=v-component_of_wind_height_above_ground,u-component_of_wind_height_above_ground
The NetCDF Subset Service, works with NetCDF files following known conventions, and in particular following conventions that can be used identify the type of data contained in the dataset and how it can be subsetted.
For example, conventions for CF Features and Feature Types can be used by the NetCDF Subset Service for supported feature types.
The NetCDF Subset Service is new functionality that Unidata is introducing. It is currently available as a configurable service in Thredds 4.3 for gridded data allowing subsetting of gridded data returning NetCDF and point in gridded data returning NetCDF, CSV or XML.
According to the documentation, subsetting of grids, stations and points will be included in Thredds 4.5 which is currently in beta release. Stations here refer to timeseries data. Its unclear whether point data refers to just points or whether it includes profiles, trajectories and trajectory profiles as well as I was unable to get this functionality working in a downloaded version.
The NetCDF Subset Service uses an underlying Feature Dataset API made available in the java NetCDF tools library. This API includes profile and trajectory Feature Type access, but its unclear whether this is still experimental.
Currently, no IMOS trajectory/timeseries follow the required CF conventions that would allow them to be used by this service, but I was able to use ncml to change the metadata for an ABOS dataset (timeseries) such that it could be subsetted and downloaded . This option warrants further investigation/monitoring of progress as this would provide a more widely available, conventions based subsetting option supported by unidata. We should be looking at conforming to the required conventions for enabling this regardless.
Until such time as we can make use of a more generic, conventions based subsetting service such as the NetCDF Subset Service, we could provide our own. I'd suggest modelling this on the NetCDF Subset Service itself. This would be targetted at current IMOS datasets. GeoServer could then return the IMOS Subset Service url's for performing the subset. The portal could use this service to return subsetted netcdf files instead of unsubsetted ones.
All of the options above map the functionality we provide through the portal via our database to a netcdf file based subsetting approach. This creates a mismatch in terms of the subsetting capabilty we can provide for the netCDF output option. We also can't currently return NetCDF Subset files for data that isn't already in NetCDF format. We should really decide whether we are delivering data from the database or from the NetCDF files and either write a NetCDF output format for GeoServer or some other database based service OR ensure all data is packaged as NetCDF data so we can use NetCDF Subsetting Services on all our data.