-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path02-Software.qmd
366 lines (266 loc) · 13.7 KB
/
02-Software.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
# Software and packages
Bayesian analysis faces non-trivial mathematical problems, for example, solution of integrals for posterior summaries. Likely, the most popular and convenient way to approximate complex integrals is to use Markov Chain Monte Carlo (MCMC) algorithms. The Bayesian Inference Using Gibbs Sampling (BUGS) software introduced in 1989 was probably one of the first software for that purpose. Since then, other software and approaches emerged allowing an accessible (for example, as R package) and user-friendly (for example, using specific syntax) for researchers. In this section we provide an overview of available statistical software for Bayesian analysis.
## Available software
### BUGS
BUGS is "Bayesian Inference Using Gibbs Sampling". BUGS used a Gibbs Sampler or the Metropolis-Hastings algorithm for the simulation of posterior distributions and was likely the first accessible approach to Bayesian simulations for researchers. While WinBUGS was used in Windows systems, OpenBUGS was for Linux systems. BUGS is neither further developed nor maintained anymore.
### INLA
INLA is "Integrated Nested Laplace Approximation" ([https://www.r-inla.org/](https://www.r-inla.org/)). INLA uses latent Gaussian models for approximate Bayesian inference. INLA is an alternative to MCMC. Because of it's approximation approach it is much faster than MCMC.
INLA interfaces with R but is not on CRAN. It can be manually installed as described on [https://www.r-inla.org/](https://www.r-inla.org/).
### JAGS
JAGS is "Just Another Gibbs Sampler" written in C++. JAGS is a dialect of the BUGS language. JAGS interfaces with R using the package **rjags**.
### Stan
"Stan is a state-of-the-art platform for statistical modeling and high-performance statistical computation." ([https://mc-stan.org/](https://mc-stan.org/)) It uses a probabilistic programming language using MCMC sampling, approximate Bayesian inference and penalized maximum likelihood estimation.
STAN interfaces with R using the package **rstan**. Regression approaches are implemented in the packages **rstanarm** and **brms**.
## Evaluation
We evaluated point estimates, uncertainty intervals and computational time of the above mentioned software approaches using simulated survival data. We assumed for 200 observations an exponential time to event assuming a hazard ratio of 0.6.
```{r}
#| echo: true
#| message: false
#| warning: false
library(brms)
require(survival)
require(ggplot2)
library(splines)
library(tidyverse)
library(marginaleffects)
library(tinytable)
library(INLA)
library(rstanarm)
library(rjags)
library(runjags)
# Set seed
set.seed(1)
# Simulate survival data
n <- 200
hr <- 0.6
rando <- rbinom(n, size = 1, prob = 0.5)
cens <- 15*runif(n)
h <- .1*exp(log(hr)*(rando == 1))
dt <- -log(runif(n))/h
event <- ifelse(dt <= cens,1,0)
dt <- pmin(dt, cens)
data <- data.frame(time = pmin(dt, cens), event, rando, id = 1:n)
```
The Kaplan-Meier plot is shown in the graph below
```{r}
#| echo: true
#| message: false
#| warning: false
# Set significance level
alpha <- 0.1
# Kaplan-Meier
mod_km <- survfit(Surv(time, event) ~ rando, data=data, conf.int = 1 - alpha)
plot(mod_km, main = "Kaplan-Meier plot")
```
### Poisson count process: Frequentist approach
We model the data in small pieces of follow-up time with a Poisson regression with offset, that is, we count the number of events in small follow-up time windows (i.e. event rate): A discrete Poisson process.
```{r}
#| echo: true
#| message: false
#| warning: false
# Split data set into small time windows
data_split <- survSplit(Surv(time, event) ~., data=data, cut=seq(0, max(data$time), 0.5))
data_split$fup <- data_split$time-data_split$tstart
# Save spline and boundary knots information
spline_info <- bs(data_split$time, df = 3)
boundary_knots <- attr(spline_info, "Boundary.knots")
# Attach splines to dataset
data_split$spline_1 <- spline_info[,1]
data_split$spline_2 <- spline_info[,2]
data_split$spline_3 <- spline_info[,3]
# Start computational time
start_freq <- Sys.time()
# Piecewise exponential model
mod <- glm(event~rando+spline_1+spline_2+spline_3+
spline_1:rando+spline_2:rando+spline_3:rando+
offset(log(fup)), data=data_split, family=poisson())
# End computational time
end_freq <- Sys.time()
# Computational time in seconds
time_freq <- difftime(end_freq, start_freq, units = "secs")
```
The marginal hazard ratio is estimated as
```{r}
#| echo: true
#| message: false
#| warning: false
# Marginal hazard ratio
avg_comparisons(mod, variable="rando",
comparison = "ratio",
conf_level = 0.95)[,c("estimate", "conf.low", "conf.high")]
```
### Bayesian modelling: Priors
In a Bayesian setting one defines prior distributions on the unknown parameters. In the above used Poisson regression model these are priors on
- the log baseline hazard ("Intercept"),
- the fixed effect parameters (on log scale) which includes at least a time variable (likely modelled in non-linear way) and a group variable.
In the following we set the prior distribution of all above mentioned parameters to a centered Gaussian distribution with variance 10 (informative prior).
### INLA
In INLA the priors can be specified via the 'control.fixed' specification. Important: The scale parameter is expressed as a precision parameter.
```{r}
#| echo: true
#| message: false
#| warning: false
# Define priors
inla_priors <- list(mean.intercept=0, prec.intercept=1/10, mean=0, prec=1/10)
# Start computational time
start_inla <- Sys.time()
mod_inla <- inla(event~rando+spline_1+spline_2+spline_3+
spline_1:rando+spline_2:rando+spline_3:rando+offset(log(fup)),
data=data_split,
family="poisson", control.fixed=inla_priors,
control.compute=list(config = TRUE, return.marginals.predictor = T)
)
# End computational time
end_inla <- Sys.time()
# Computational time in seconds
time_inla <- difftime(end_inla, start_inla, units = "secs")
```
### Stan
In Stan the priors can be specified via the 'prior' specification. Important: The scale parameter is expressed as standard deviation.
```{r}
#| echo: true
#| message: false
#| warning: false
### Define priors
prior_brm <- c(prior_string("normal(0,sqrt(10))", class = "b", coef = "rando"),
prior_string("normal(0,sqrt(10))", class = "b", coef = "spline_1"),
prior_string("normal(0,sqrt(10))", class = "b", coef = "spline_2"),
prior_string("normal(0,sqrt(10))", class = "b", coef = "spline_3"),
prior_string("normal(0,sqrt(10))", class = "b", coef = "rando:spline_1"),
prior_string("normal(0,sqrt(10))", class = "b", coef = "rando:spline_2"),
prior_string("normal(0,sqrt(10))", class = "b", coef = "rando:spline_3"),
prior_string("normal(0,sqrt(10))", class = "Intercept"))
# Seed
seed <- 1
# Start computational time
start_brm <- Sys.time()
mod_brm <- brm(event ~ rando+spline_1+spline_2+spline_3+
spline_1:rando+spline_2:rando+spline_3:rando+offset(log(fup)),
data=data_split,
family=poisson(),
prior = prior_brm,
seed=seed,
silent= 2 ,
refresh = 0)
# End computational time
end_brm <- Sys.time()
# Computational time in seconds
time_brm <- difftime(end_brm, start_brm, units = "secs")
# Summary
summary_brm <- summary(mod_brm)
# Start computational time
start_rstanarm <- Sys.time()
mod_rstanarm <- stan_glm(event ~ rando+spline_1+spline_2+spline_3+
spline_1:rando+spline_2:rando+spline_3:rando+offset(log(fup)),
data=data_split,
family=poisson(),
prior_intercept = normal(0, sqrt(10)),
prior = normal(0, sqrt(10)),
seed=seed,
refresh = 0)
# End computational time
end_rstanarm <- Sys.time()
# Computational time in seconds
time_rstanarm <- difftime(end_rstanarm, start_rstanarm, units = "secs")
# Summary
summary_rstanarm <- mod_rstanarm$stan_summary
```
### JAGS
In JAGS the priors can be specified via the 'prior' specification. Important: The scale parameter is expressed as precision parameter.
```{r}
#| echo: true
#| message: false
#| warning: false
#| cache: false
# JAGS silent warnings
runjags.options(silent.jags=T)
# Prepare data
jags_data <- list(y=data_split$event,
rando = data_split$rando,
spline_1 = data_split$spline_1,
spline_2 = data_split$spline_2,
spline_3 = data_split$spline_3,
spline_1_rando = data_split$spline_1*data_split$rando,
spline_2_rando = data_split$spline_2*data_split$rando,
spline_3_rando = data_split$spline_3*data_split$rando,
fup=data_split$fup,
n = nrow(data_split))
# JAGS code
jags.script <- "
model{
# likelihood
for( i in 1:n) {
y[i] ~ dpois(lambda[i])
log(lambda[i]) <- beta0 +
beta1*rando[i] +
beta2*spline_1[i] +
beta3*spline_2[i] +
beta4*spline_3[i] +
beta5*spline_1_rando[i] +
beta6*spline_2_rando[i] +
beta7*spline_3_rando[i] +
log(fup[i])
}
# priors
beta0 ~ dnorm(0, 0.1)
beta1 ~ dnorm(0, 0.1)
beta2 ~ dnorm(0, 0.1)
beta3 ~ dnorm(0, 0.1)
beta4 ~ dnorm(0, 0.1)
beta5 ~ dnorm(0, 0.1)
beta6 ~ dnorm(0, 0.1)
beta7 ~ dnorm(0, 0.1)
}
"
# Start computational time
start_jags <- Sys.time()
# Run the model
mod_jags <- jags.model(textConnection(jags.script),
data = jags_data,
n.chains = 4,
n.adapt = 4000,
quiet = T)
output_jags <- coda.samples(
mod_jags,
variable.names = c("beta0", "beta1",
"beta2", "beta3",
"beta4", "beta5",
"beta6", "beta7"),
n.iter=200)
# Summary
summary_jags <- summary(output_jags)
# End computational time
end_jags <- Sys.time()
# Computational time
time_jags <- difftime(end_jags, start_jags, units = "secs")
```
### Results
We conclude that point estimates and credible intervals between the Bayesian software approaches are comparable. As expected, INLA outperforms the other Bayesian software approaches but only provides an approximation to the posterior distributions. Among the MCMC approaches, **rstanarm** reveals the best computational time.
```{r}
#| echo: false
#| message: false
#| warning: false
#| cache: false
output_all <- data.frame(type="INLA", var=dimnames(mod_inla$summary.fixed)[[1]], est=mod_inla$summary.fixed$mean, lci=mod_inla$summary.fixed$`0.025quant`, uci=mod_inla$summary.fixed$`0.975quant`, time=as.numeric(time_inla))
output_all <- bind_rows(output_all, data.frame(type="Stan (brm)", var=dimnames(summary_brm$fixed)[[1]], est=summary_brm$fixed$Estimate, lci=summary_brm$fixed$`l-95% CI`, uci=summary_brm$fixed$`u-95% CI`, time=as.numeric(time_brm)))
output_all <- bind_rows(output_all, data.frame(type="Stan (rstanarm)", var=dimnames(summary_rstanarm)[[1]][1:8], est=summary_rstanarm[1:8,"mean"], lci=summary_rstanarm[1:8,"2.5%"], uci=summary_rstanarm[1:8,"97.5%"], time=as.numeric(time_rstanarm)))
output_all <- bind_rows(output_all, data.frame(type="JAGS", var=dimnames(summary_rstanarm)[[1]][1:8], est=summary_jags$statistics[, "Mean"], lci=summary_jags$quantiles[, "2.5%"], uci=summary_jags$quantiles[, "97.5%"], time=as.numeric(time_jags)))
output_all <- bind_rows(output_all, data.frame(type="Frequentist", var=names(mod$coefficients), est=mod$coefficients, lci=confint.default(mod)[,1], uci=confint.default(mod)[,2], time=as.numeric(time_freq)))
output_all$var[output_all$var=="(Intercept)"] <- "Intercept"
ggplot(output_all, aes(x=var, y=est, colour=type))+geom_point(position=position_dodge2(width=0.5))+geom_linerange(aes(ymin=lci, ymax=uci), position=position_dodge2(width=0.5))+theme_bw()+theme(axis.text.x = element_text(angle=30, vjust = 1, hjust=1), panel.grid.minor = element_blank())+scale_colour_brewer("", palette="Dark2")+ylab("Estimate on log scale")+xlab("")+ggtitle("Point estimates and uncertainty intervals")
time_all <- output_all |> group_by(type) |> summarise(time=first(time))
time_all$type <- factor(time_all$type, levels=time_all$type[order(time_all$time)])
ggplot(time_all, aes(x=type, y=time))+geom_point(position=position_dodge2(width=0.5))+theme_bw()+theme(axis.text.x = element_text(angle=30, vjust = 1, hjust=1), panel.grid.minor = element_blank())+scale_colour_brewer("", palette="Dark2")+ylab("Computational time (in seconds)")+xlab("Model type")+ggtitle("Computational time (in seconds)")
```
## Reproducibility and SCTO validation
Reproducibility of MCMC results depends on several factors (see [https://mc-stan.org/docs/reference-manual/reproducibility.html](https://mc-stan.org/docs/reference-manual/reproducibility.html)), among others,
- Operation system,
- MCMC language, interface and version,
- Versions of included libraries (Boost and Eigen),
- C++ compiler, including version, compiler flags, and linked libraries.
All above mentioned packages can be validated according to the guidelines of the SCTO validation platform ([https://swissclinicaltrialorganisation.github.io/validation_platform/](https://swissclinicaltrialorganisation.github.io/validation_platform/)).
## Author {-}
André Moser, Senior Statistician\
Department of Clinical Research\
University of Bern\
3012 Bern, Switzerland