Skip to content

Harmonization

Ana edited this page May 22, 2019 · 11 revisions

In this worked example we will use the same local dataset as that used in section 2.1. Dataset definition and loading local grid data. This dataset comes from the NCEP/NCAR Reanalysis 1 encompassing the period 1961-2010 for the Iberian Peninsula domain and is available in a tar.gz file that can be downloaded and stored in a local directory as follows:

download.file("http://meteo.unican.es/work/loadeR/data/Iberia_NCEP.tar.gz", 
              destfile = "mydirectory/Iberia_NCEP.tar.gz")
# Extract files from the tar.gz file
untar("mydirectory/Iberia_NCEP.tar.gz", exdir = "mydirectory")
# First, the path to the ncml file is defined:
ncep.local <- "mydirectory/Iberia_NCEP/Iberia_NCEP.ncml"

Direct data loading

Before data loading, we do the inventory of the NcML file to identify the desired variable.

di <- dataInventory(ncep.local)

## [2016-02-17 20:02:39] Doing inventory ...
## [2016-02-17 20:02:39] Retrieving info for 'Z' (5 vars remaining)
## [2016-02-17 20:02:39] Retrieving info for 'T' (4 vars remaining)
## [2016-02-17 20:02:40] Retrieving info for 'Q' (3 vars remaining)
## [2016-02-17 20:02:40] Retrieving info for '2T' (2 vars remaining)
## [2016-02-17 20:02:40] Retrieving info for 'SLP' (1 vars remaining)
## [2016-02-17 20:02:40] Retrieving info for 'pr' (0 vars remaining)
## [2016-02-17 20:02:40] Done.

# e.g. temperature
str(di$`2T`)

## List of 4
## $ Description: chr "2m Temperature"
## $ DataType   : chr "float"
## $ Units      : chr "K"
## $ Dimensions :List of 4
##  ..$ time :List of 4
##  .. ..$ Type      : chr "Time"
##  .. ..$ TimeStep  : chr "1.0 days"
##  .. ..$ Units     : chr "days since 1950-01-01 00:00:00"
##  .. ..$ Date_range: chr "1961-01-01T00:00:00Z - 2010-12-31T00:00:00Z"
##  ..$ level:List of 3
##  .. ..$ Type  : chr "Height"
##  .. ..$ Units : chr "m"
##  .. ..$ Values: num 2
##  ..$ lat  :List of 3
##  .. ..$ Type  : chr "Lat"
##  .. ..$ Units : chr "degrees north"
##  .. ..$ Values: num [1:6] 35 37.5 40 42.5 45 47.5
##  ..$ lon  :List of 3
##  .. ..$ Type  : chr "Lon"
##  .. ..$ Units : chr "degrees east"
##  .. ..$ Values: num [1:9] -15 -12.5 -10 -7.5 -5 -2.5 0 2.5 5

The corresponding variable name of 2m air temperature is "2T" and the units are Kelvin. We can load this data with `loadGridData`as follows:

```r
tas <- loadGridData(ncep.local, 
                    var = "2T",
                    lonLim = c(-12, 5), 
                    latLim= c(35,45), 
                    season= 6:8, 
                    years = 1981:2000)

## [2019-05-22 09:37:23] Defining geo-location parameters
## [2019-05-22 09:37:23] Defining time selection parameters
## [2019-05-22 09:37:23] Retrieving data subset ...
## [2019-05-22 09:37:23] Done

Data loading with the dictionary file (harmonization)

The dictionary file is the tool used in loadeR to harmonize the variables according to the climate4R vocabulary:

C4R.vocabulary()

##    identifier                           standard_name   units
## 1        hurs               2-meter relative humidity       %
## 2     hursmax       maximum 2-meter relative humidity       %
## 3     hursmin       minimum 2-meter relative humidity       %
## 4         hus                       specific humidity kg.kg-1
## 5        huss               2-meter specific humidity kg.kg-1
## 6     hussmax       maximum 2-meter specific humidity kg.kg-1
## 7     hussmin       minimum 2-meter specific humidity kg.kg-1
## 8          lm                        land binary mask       1
## 9        orog                        surface altitude       m
## 10         ps           air pressure at surface level      Pa
## 11        psl               air pressure at sea level      Pa
## 12       rlds  surface downwelling longwave radiation   W.m-2
## 13       rlut              toa outgoing longwave flux   W.m-2
## 14       rlus  surface upwelling longwave flux in air   W.m-2
## 15       rsus surface upwelling shortwave flux in air   W.m-2
## 16       rsds surface downwelling shortwave radiation   W.m-2
## 17      sftlf                      land area fraction       1
## 18         ta                         air temperature    degC
## 19        tas                 2-meter air temperature    degC
## 20     tasmax             maximum 2-m air temperature    degC
## 21     tasmin             minimum 2-m air temperature    degC
## 22       tdps            2-meter dewpoint temperature    degC
## 23         ts                     surface_temperature    degC
## 24         pr              total precipitation amount      mm
## 25        prr                   total rainfall amount      mm
## 26       prsn                   total snowfall amount      mm
## 27         ua                           eastward wind   m.s-1
## 28        uas              eastward near-surface wind   m.s-1
## 29         va                          northward wind   m.s-1
## 30        vas             northward near-surface wind   m.s-1
## 31        wss                 near-surface wind speed   m.s-1
## 32     wssmax         maximum near-surface wind speed   m.s-1
## 33        wsg                      wind speed of gust   m.s-1
## 34     wsgmax              maximum wind speed of gust   m.s-1
## 35          z                            geopotential  m2.s-2
## 36         zg                     geopotential height       m
## 37         zs                    surface geopotential  m2.s-2
## 38        zgs             surface geopotential height       m

It matches the standard name given by the climate4R vocabulary and the native name in the dataset. In this example, the dictionary file (Iberia_NCEP.dic) is included in the tar.gz. The dictionary file is typically created by the user at their convenience.

dictionary <- "mydirectory/Iberia_NCEP/Iberia_NCEP.dic"
read.table(dictionary, header = TRUE, sep = ",")

##   identifier short_name time_step lower_time_bound upper_time_bound aggr_fun  offset        scale deaccum
## 1        hus          Q       24h                0               24     mean    0.00    1.0000000       0
## 2        psl        SLP       24h                0               24     mean    0.00    1.0000000       0
## 3         ta          T       24h                0               24     mean -273.15    1.0000000       0
## 4          z          Z       24h                0               24     mean    0.00    0.1020408       0
## 5        tas         2T       24h                0               24     mean -273.15    1.0000000       0
## 6         pr         pr       24h                0               24      sum    0.00 1000.0000000       0

Variable name

When loading data with function loadGridData, the particular variables of each dataset are translated -and transformed if necessary- into the common vocabulary by means of a dictionary if the argument dictionary = TRUE is specified. The function will perform all the necessary transformations to return the standard variables, as defined in the vocabulary. Thus, by means of the dictionary users do not need to care about specific variable names and variables into the different datasets, as long as the identifier is compliant with the climate4R vocabulary.

Next, we illustrate a simple example of the use of the dictionary file.

The standard variable name for 2-meter air temperature is "tas".

tas2 <- loadGridData(ncep.local, var = "tas", dictionary = TRUE)

## [2019-05-22 09:57:34] Defining harmonization parameters for variable "tas"
## [2019-05-22 09:57:34] Defining geo-location parameters
## [2019-05-22 09:57:34] Defining time selection parameters
## [2019-05-22 09:57:34] Retrieving data subset ...
## [2019-05-22 09:57:35] Done

The NCEP dataset uses the variable name "2T" for 2-meter air temperature. As a result, if we use the standard name "tas" to load the data without a dictionary, the function will return an error:

tas3 <- try(loadGridData(ncep.local, var = "tas", dictionary = FALSE))
# Returns the error message:
## Error in loadGridData("mydirectory/Iberia_NCEP/Iberia_NCEP.ncml",  : 
##  Variable requested not found
##  Check 'dataInventory' output and/or dictionary 'identifier'.

Variable units

Another useful feature of the dictionary is on-the-fly unit transformation. Since the standard units for "tas" are degrees Celsius, in this example, the dictionary also transforms Kelvin into degC. This is done by the offset parameter that is set in the dictionary file (in this case -273.15).

Note the differences in the attributes of objects tas and tas2 regarding variable names and units:

## str(tas$Variable)

## List of 2
##  $ varName: chr "2T"
##  $ level  : num 2
##  - attr(*, "use_dictionary")= logi FALSE
##  - attr(*, "description")= chr "2m Temperature"
##  - attr(*, "units")= chr "K"
##  - attr(*, "longname")= chr "2T"
##  - attr(*, "daily_agg_cellfun")= chr "none"
##  - attr(*, "monthly_agg_cellfun")= chr "none"
##  - attr(*, "verification_time")= chr "none"

str(tas2$Variable)

## List of 2
##  $ varName: chr "tas"
##  $ level  : num 2
##  - attr(*, "use_dictionary")= logi TRUE
##  - attr(*, "description")= chr "2m Temperature"
##  - attr(*, "units")= chr "degC"
##  - attr(*, "longname")= chr "2-meter air temperature"
##  - attr(*, "daily_agg_cellfun")= chr "none"
##  - attr(*, "monthly_agg_cellfun")= chr "none"
##  - attr(*, "verification_time")= chr "none"

NOTE: more advanced features for unit handling and conversion after data loading are available through the climate4R package convertR


Other parameters

As shown in the example, there are other parameters which define the temporal characteristics of the data and other conversion operations needed to obtain the final data according to the user's needs. The following parameters need to be included in the .dic file:

  • identifier: this is the name of the standard variable, as defined in the vocabulary
  • short_name: this is the name with which the original variable has been coded in the dataset
  • time_step: time scale of the data. For instance, 24h (for daily data), 3h ...
  • lower_time_bound and upper_time_bound: temporal range of the data. These parameters indicate the lower and upper bound of the time interval for which the data are representative. For instance, instantaneous variables will have identical lower/upper bounds, while a value that is representative of a daily amount (e.g., total accumulated precipitation in 24 h, or mean daily temperature), will have the corresponding lower/upper bounds for which the value apply (e.g. from 00:00 of day 1 to 00:00 of day 2), being the value closed by the left and open by the right.
  • deaccum: in case of cumulative variables (e.g. precipitation) sometimes to obtain the data associated to a particular period it is needed to subtract two consecutive data (deaccumulate). This case are considered with this parameter when it is activated (deaccum=1).
  • cell_method: function of time aggregation between the lower and upper time bound. For instance, its value is "none" for instantaneous variables, "mean" for mean daily temperatures or "sum" for daily precipitation values. See the example below.
  • offset: constant summed to the original variable for units conversion (e.g.: offset = -273.15 for conversion from Kelvin to Celsius)
  • scale: scale factor applied to the original variable for units conversion (e.g.: scale = 0.001 for conversion from m to mm)
  • deaccum: This is a logical flag (0 = FALSE, 1= TRUE), which indicates if the variable should be de-accumulated at each time step. Typically applied to precipitation in some forecast datasets.
  • derived: this value is internally used by the loading functions to know if the variable is derived from any other variable(s) or can be directly read from the dataset.
  • interface: this is a internal value used by the loading functions.

Note that all the fields above need to be included in the dictionary file. Their ordering is not important, as long as their names are preserved.


<-- Home page of the Wiki

print(sessionInfo())

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS

## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

## Random number generation:
##  RNG:     Mersenne-Twister 
##  Normal:  Inversion 
##  Sample:  Rounding 
##  
## locale:
##  [1] LC_CTYPE=es_ES.UTF-8          LC_NUMERIC=C                  LC_TIME=es_ES.UTF-8          
##  [4] LC_COLLATE=es_ES.UTF-8        LC_MONETARY=es_ES.UTF-8       LC_MESSAGES=es_ES.UTF-8      
##  [7] LC_PAPER=es_ES.UTF-8          LC_NAME=es_ES.UTF-8           LC_ADDRESS=es_ES.UTF-8       
## [10] LC_TELEPHONE=es_ES.UTF-8      LC_MEASUREMENT=es_ES.UTF-8    LC_IDENTIFICATION=es_ES.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] visualizeR_1.3.2  transformeR_1.4.8 loadeR_1.4.12     loadeR.java_1.1.1 rJava_0.9-11     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1              compiler_3.6.0          RColorBrewer_1.1-2      bitops_1.0-6           
##  [5] tools_3.6.0             boot_1.3-20             dotCall64_1.0-0         vioplot_0.3.0          
##  [9] lattice_0.20-38         Matrix_1.2-17           parallel_3.6.0          spam_2.2-2             
## [13] akima_0.6-2             padr_0.4.2              raster_2.9-5            mapplots_1.5.1         
## [17] fields_9.8-1            maps_3.3.0              grid_3.6.0              data.table_1.12.2      
## [21] dtw_1.20-1              pbapply_1.4-0           tcltk_3.6.0             sm_2.2-5.6             
## [25] SpecsVerification_0.5-2 sp_1.3-1                latticeExtra_0.6-28     magrittr_1.5           
## [29] scales_1.0.0            codetools_0.2-16        CircStats_0.2-6         MASS_7.3-51.1          
## [33] abind_1.4-5             colorspace_1.4-1        proxy_0.4-23            munsell_0.5.0          
## [37] RCurl_1.95-4.12         verification_1.42       easyVerification_0.4.4  RcppEigen_0.3.3.5.0    
## [41] zoo_1.8-5