forked from EcoForecast/EF_Activities
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Exercise_06_StateSpace.Rmd
222 lines (173 loc) · 13.7 KB
/
Exercise_06_StateSpace.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
Activity 6 - State-space models
========================================================
This activity will explore the state-space framework for modeling time-series and spatial data sets. Chapter 8 provides a more in-depth description of the state-space model, but in a nutshell it is based on separating the process model, which describes how the system evolves in time or space, from the observation error model. Furthermore, the state-space model gets its name because the model estimates that true value of the underlying **latent** state variables.
For this activity we will write all the code, process all the data, and visualize all the outputs in R, but the core of the Bayesian computation will be handled by JAGS (Just Another Gibbs Sampler, http://mcmc-jags.sourceforge.net). Therefore, before we get started you will want to download both the JAGS software and the rjags library, which allows R to call JAGS. We're also going to install our `ecoforecastR` package, which has some helper functions we will use.
```{r}
library(rjags)
#library(rnoaa)
library(daymetr)
devtools::install_github("EcoForecast/ecoforecastR",force=TRUE)
```
Next we'll want to grab the data we want to analyze. For this example we'll use the Google Flu Trends data for the state of Massachusetts, which we saw how to pull directly off the web in Activity 3.
```{r}
gflu = read.csv("data/gflu_data.txt",skip=11)
time = as.Date(gflu$Date)
y = gflu$Massachusetts
plot(time,y,type='l',ylab="Flu Index",lwd=2,log='y')
```
Next we'll want to define the JAGS code, which we'll do by writing the code as a string in R. The code itself has three components, the data model, the process model, and the priors. The data model relates the observed data, y, at any time point to the latent variable, x. For this example we'll assume that the observation model just consists of Gaussian observation error. The process model relates the state of the system at one point in time to the state one time step ahead. In this case we'll start with the simplest possible process model, a random walk, which just consists of Gaussian process error centered around the current value of the system.
$$X_{t+1} \sim N(X_{t},\tau_{add})$$
Finally, for the priors we need to define priors for the initial condition, the process error, and the observation error.
```{r}
RandomWalk = "
model{
#### Data Model
for(t in 1:n){
y[t] ~ dnorm(x[t],tau_obs)
}
#### Process Model
for(t in 2:n){
x[t]~dnorm(x[t-1],tau_add)
}
#### Priors
x[1] ~ dnorm(x_ic,tau_ic)
tau_obs ~ dgamma(a_obs,r_obs)
tau_add ~ dgamma(a_add,r_add)
}
"
```
Next we need to define the data and priors as a list. For this analysis we'll work with the log of the Google flu index since the zero-bound on the index and the magnitudes of the changes appear much closer to a log-normal distribution than to a normal.
```{r}
data <- list(y=log(y),n=length(y),x_ic=log(1000),tau_ic=100,a_obs=1,r_obs=1,a_add=1,r_add=1)
```
Next we need to definite the initial state of the model's parameters for each chain in the MCMC. The overall initialization is stored as a list the same length as the number of chains, where each chain is passed a list of the initial values for each parameter. Unlike the definition of the priors, which had to be done independent of the data, the initialization of the MCMC is allowed (and even encouraged) to use the data. However, each chain should be started from different initial conditions. We handle this below by basing the initial conditions for each chain off of a different random sample of the original data.
```{r}
nchain = 3
init <- list()
for(i in 1:nchain){
y.samp = sample(y,length(y),replace=TRUE)
init[[i]] <- list(tau_add=1/var(diff(log(y.samp))),tau_obs=5/var(log(y.samp)))
}
```
Now that we've defined the model, the data, and the initialization, we need to send all this info to JAGS, which will return the JAGS model object.
```{r}
j.model <- jags.model (file = textConnection(RandomWalk),
data = data,
inits = init,
n.chains = 3)
```
Next, given the defined JAGS model, we'll want to take a few samples from the MCMC chain and assess when the model has converged. To take samples from the MCMC object we'll need to tell JAGS what variables to track and how many samples to take.
```{r, fig.asp = 1.0}
## burn-in
jags.out <- coda.samples (model = j.model,
variable.names = c("tau_add","tau_obs"),
n.iter = 1000)
plot(jags.out)
```
Here we see that the model converges rapidly. Since rjags returns the samples as a CODA object, we can use any of the diagnositics in the R *coda* library to test for convergence, summarize the output, or visualize the chains.
Now that the model has converged we'll want to take a much larger sample from the MCMC and include the full vector of X's in the output
```{r}
jags.out <- coda.samples (model = j.model,
variable.names = c("x","tau_add","tau_obs"),
n.iter = 10000)
```
Given the full joint posteror samples, we're next going to visualize the output by just looking at the 95% credible interval of the timeseries of X's and compare that to the observed Y's. To do so we'll convert the coda output into a matrix and then calculate the quantiles. Looking at colnames(out) will show you that the first two columns are `tau_add` and `tau_obs`, so we calculate the CI starting from the 3rd column. We also transform the samples back from the log domain to the linear domain.
```{r}
time.rng = c(1,length(time)) ## adjust to zoom in and out
out <- as.matrix(jags.out)
x.cols <- grep("^x",colnames(out)) ## grab all columns that start with the letter x
ci <- apply(exp(out[,x.cols]),2,quantile,c(0.025,0.5,0.975)) ## model was fit on log scale
plot(time,ci[2,],type='n',ylim=range(y,na.rm=TRUE),ylab="Flu Index",log='y',xlim=time[time.rng])
## adjust x-axis label to be monthly if zoomed
if(diff(time.rng) < 100){
axis.Date(1, at=seq(time[time.rng[1]],time[time.rng[2]],by='month'), format = "%Y-%m")
}
ecoforecastR::ciEnvelope(time,ci[1,],ci[3,],col=ecoforecastR::col.alpha("lightBlue",0.75))
points(time,y,pch="+",cex=0.5)
```
Next, lets look at the posterior distributions for `tau_add` and `tau_obs`, which we'll convert from precisions back into standard deviations.
```{r}
hist(1/sqrt(out[,1]),main=colnames(out)[1])
hist(1/sqrt(out[,2]),main=colnames(out)[2])
```
We'll also want to look at the joint distribution of the two parameters to check whether the two parameters strongly covary.
```{r, fig.asp = 1.0}
plot(out[,1],out[,2],pch=".",xlab=colnames(out)[1],ylab=colnames(out)[2])
cor(out[,1:2])
```
Assignment:
-----------
To explore the ability of state space models to generate forecasts (or in this case, a hindcast) remove the last 40 observations (convert to NA) and refit the model.
* Generate a time-series plot for the CI of x that includes the observations (as above but zoom the plot on the last ~80 observations). Use a different color and symbol to differentiate observations that were included in the model versus those that were converted to NA's.
* Comment on how well the random walk model performed (both accuracy and precision) and how it might be modified to improve both these criteria.
Extra Credit (Part 1):
----------------------
To look at how observation frequency affects data assimilation, convert 3 out of every 4 observations to NA (i.e. treat the data as approximately monthly) and refit the model.
* Generate a time-series plot for the CI of x that includes the observations (as above). Use a different color and symbol to differentiate observations that were included in the model versus those that were converted to NA's.
* Compare the CI between the two runs.
* Generate a predicted (median) vs observed plot for the data points that were removed
* Comment on the accuracy and precision of the state estimates.
* How does the reduction in data volume affect the parameter estimates (taus)
# Dynamic Linear Models
The random walk model can easily be generalized to more sophisiticated models describing the dynamics of the system. One simple but useful extension is the class of dynamic linear models (DLMs) -- linear models where the future state depends on the current state and other covariates, $z_t$
$$X_{t+1} \sim N(X_t + \beta_0 + \beta_1 z_t + \beta_{X} X_{t}, \tau_{add})$$
where $\beta_0$ is the intercept, $\beta_1$ is the slope of the covariate effect, and $\beta_{X}$ is the slope of the initial condition effect, expressed as a deviation from the random walk default (i.e. the actual slope is $1 + \beta_X$). Rather than implement this model in JAGS directly, we're going to rely on the ecoforecastR package, which accepts a `lm` like syntax for specifying covariates (with the notable exception that the response variable, which is our latent X, is not specified explictly). Here we're going to use the Daymet product to get daily weather estimates, and then use daily minimum temperature (Tmin) as the covariate in our influenza model
```{r}
## grab weather data
df <- daymetr::download_daymet(site = "Boston",
lat = 42.36,
lon = -71.06,
start = 2003,
end = 2016,
internal = TRUE)$data
df$date <- as.Date(paste(df$year,df$yday,sep = "-"),"%Y-%j")
data$Tmin = df$tmin..deg.c.[match(time,df$date)]
## fit the model
ef.out <- ecoforecastR::fit_dlm(model=list(obs="y",fixed="~ 1 + X + Tmin"),data)
names(ef.out)
```
The package returns a list with four elements. `params` and `predict` are both the same mcmc.list objects we get back from JAGS, only split between the parameters and the latent state variables, respectively, to make it easier to perform diagnostics and visualizations:
```{r, fig.asp = 1.0}
## parameter diagnostics
params <- window(ef.out$params,start=1000) ## remove burn-in
plot(params)
summary(params)
cor(as.matrix(params))
pairs(as.matrix(params))
## confidence interval
out <- as.matrix(ef.out$predict)
ci <- apply(exp(out),2,quantile,c(0.025,0.5,0.975))
plot(time,ci[2,],type='n',ylim=range(y,na.rm=TRUE),ylab="Flu Index",log='y',xlim=time[time.rng])
## adjust x-axis label to be monthly if zoomed
if(diff(time.rng) < 100){
axis.Date(1, at=seq(time[time.rng[1]],time[time.rng[2]],by='month'), format = "%Y-%m")
}
ecoforecastR::ciEnvelope(time,ci[1,],ci[3,],col=ecoforecastR::col.alpha("lightBlue",0.75))
points(time,y,pch="+",cex=0.5)
```
The JAGS model that was fit 'under the hood' is returned as `model` which we can view as:
```{r, echo=FALSE}
strsplit(ef.out$model,"\n",fixed = TRUE)[[1]]
```
This code illustrates a few things:
* The "Priors" section is identical to our earlier random walk model
* The "Random Effects" section, which is currently commented out, illustrates that the `ecoforcastR::fit_dlm` function supports random effects, which can be turned on via the `model$random` argument
* The "Fixed Effects" section contains additional priors for our fixed effects as well as priors on the means (mu) and precisions (tau) of the covariates.
* The "Data Model" section is the same as in our random walk except for the addition of code for the means of the covariates. This code is here as a very simple missing data model -- any time the covariate is observed it is used to estimate the mean and precision, but any time the covariate is missing (NA) it is imputed.
* The "Process Model" is very similar to the random walk, except now the expected value (mu) is calculated according to the linear model described earlier
Finally, the returned object also includes the `data` that was used to fit the model.
Part 2:
-----------
* Compare the process and observation error estimates and model CI between this fit and the original random walk model. How much has the residual variance been reduced by?
* Because a state-space model returns X's that are close to the Y's, metrics such as R2 and RMSE aren't great metrics of model performance. Besides looking at the taus, how else could we judge which model is doing better (in a way that avoids/penalizes overfitting)?
* Explain and discuss the parameter estimates (betas) from the linear model (what do they mean both biologically and in terms of the predictability of the system) and their correlations
Part 3:
-----------
Repeat the process of forecasting the last 40 observations (convert to NA), this time using the DLM with temperature as a covariate
* Generate a time-series plot for the CI that includes the observations and both the random walk and DLM models (Hint, think about the order you plot in so you can see both models, also consider including transpancy [alpha] in the CI color)
* Comment on how well the DLM model performed (both accuracy and precision) relative to the random walk and the true observations. How could the model be further improved?
# Next steps
Apply these modeling approaches to you own time-series data! As a simple place to start note that you can fit the basic Random Walk model using `fit_dlm` just by setting `fixed = ""`. Also, as with standard `lm` syntax, you can suppress the intercept by including -1 in fixed, specify interaction terms using multiplication (e.g. X * Tmin), and express polynomials both on X and on covariates (e.g. X^2 or Tmin^2). The latter allows you to construct models with stabalizing feedbacks, for example:
$$ N_{t+1} = N_t + rN_t \left( 1 + {{N_t}\over{K}} \right) = (1+r)N_t + {{r}\over{K}}N_t^2$$
can be expressed as fixed = "-1 + X + X^2" where $\beta_X = r$ and $\beta_{X^2} = r/K$.
Within the ecoforecastR package, the ParseFixed function (which is used by fit_dlm) can also construct text strings for process models, priors, and missing data models that can be inserted into other JAGS models, which allows you to easily construct non-Gaussian dynamic generalized linear mixed models (DGLMMs), data fusion models, or more complex nonlinear models.