D-vine copula based postprocessing (DVQR) scripts for the ESSD benchmark.
The code is provided as supplementary material with
- Demaeyer, J., Bhend, J., Lerch, S., Primo, C., Van Schaeybroeck, B., Atencia, A., Ben Bouallègue, Z., Chen, J., Dabernig, M., Evans, G., Faganeli Pucer, J., Hooper, B., Horat, N., Jobst, D., Merše, J., Mlakar, P., Möller, A., Mestre, O., Taillardat, M., and Vannitsem, S.: The EUPPBench postprocessing benchmark dataset v1.0, Earth Syst. Sci. Data, 15, 2635–2653, https://doi.org/10.5194/essd-15-2635-2023, 2023.
Please cite this article if you use (a part of) this code for a publication.
In the D-vine (drawable vine) copula based postprocessing, a multivariate conditional copula
$$F^{-1}{y\vert x_1, \ldots, x_m}(\alpha\vert x_1(t), \ldots, x_m(t)):=F_y^{-1}\left(C^{-1}(\alpha\vert F{x_1}(x_1(t)),\ldots, F_{x_m}(x_m(t)))\right),$$
where
The whole method is implemented in the programming language R.
- To get the ESSD benchmark data set you can use the download script.
- To construct data frames from the NetCDF files of the ESSD benchmark, you can use the the R-script
benchmarkdata.R
. With this script, you receivebenchmark_t2m_train_ext.Rdata
, which denotes the training data andbenchmark_t2m_test_ext.Rdata
, which contains the test/validation data. These constructed data sets will be used in the ESSD benchmark for this method. To reveive the NetCDF files from the .Rdata files as final output, use thecreate_netcdf_output.R
file.
The following table describes variable names that are referred to in training as well as test/validation data:
Variable | Description |
---|---|
init_date |
Initialization date of the ensemble forecasts. |
valid_date |
Valid date of the ensemble forecasts. |
leadtime |
Lead time of the ensemble forecasts in hours. |
id |
Station id. |
name |
Station name. |
station_altitude |
Station altitude in meter. |
station_longitude |
Station longitude. |
station_latitude |
Station latitude. |
station_land_usage |
Station land usage. |
model_altitude |
Model altitude in meter. |
model_longitude |
Model longitude. |
model_latitude |
Model latitude. |
model_land_usage |
Model land usage. |
observation |
Observation of t2m in Kelvin. |
ctrl |
Control ensemble forecast of t2m in Kelvin. |
ensi |
|
mean |
Mean of the ensemble forecasts of t2m in Kelvin. |
q10 |
1st decile of the ensemble forecasts of t2m in Kelvin. |
median |
Median of the ensemble forecasts of t2m in Kelvin. |
q90 |
9th decile of the ensemble forecasts of t2m in Kelvin. |
sd |
Standard deviation of the ensemble forecasts of t2m in Kelvin. |
iqr |
Interquartil range of the ensemble forecasts of t2m in Kelvin. |
rand |
Random ensemble forecasts of the t2m ensemble forecasts in Kelvin. |
For the D-vine copula based postprocessing we applied the R-script dvine_pp.R
.
- A local postprocessing is selected, i.e. the locations are postprocessed (for each lead time) separately.
- A weekly training period is used to account for seasonality, i.e. a symmetric interval of
nweeks
around the week where the forecast day belongs to is chosen. For example, if you would like to forecast any day in the 8th week of the year and you setnweeks = 3
, you would take the data of the 5th, 6th, 7th, 8th, 9th, 10th, 11th week in the training datatdata
to estimate the model for the 8th week. This yields in total 52 estimated models for each station having a fixed lead time, i.e. for each week of the year you have one model (week 53 is set to week 52). For our final postprocessing models we selectnweeks = 2
, which is determined by the in-sample mean CRPS on the training data over all locations andleadtimes
. Consequently we have for each week, each location and each lead time a model which is used to produce the postprocessed quantile ensemble forecasts in the validation datavdata
. - The t2m observation is selected as response variable for the training data, i.e.
tresponse = "observation"
. - The control and mean ensemble forecast of t2m is provided as predictor variable for the training and validation data, i.e.
tpredictors = c("ctrl", "mean")
andvpredictors = c("ctrl", "mean")
, respectively. - All copula families in the R-package vinereg are available, i.e.
family_set = "all"
. - As forward covariate selection criterion the AIC-corrected conditional log-likelihood is selected, i.e.
selcrit = "aic"
. - The order of the covariates in the D-vine is automatically determined by the algorithm itself, i.e.
order = NA
. ffd
denotes the first forecast date andlfd
the last forecast date.- The parameter
cores
specifies the number of cores used for code parallelization.
Remark: The computational effort of this method could be enormously lessened by e.g. reducing the
family_set
to a smaller subset of copulas or by the pre-selection of the variableorder
.
- vinereg: For the D-vine copula quantile regression proposed by Kraus and Czado, (2016).
- lubridate: For calculating the week in a year.
- doMC: For multi-core usage.
- ncdf4: For reading the NetCDF-files in R.