KintorModel/ANCOVA.Rmd at master · the8thday/KintorModel · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
---
title: "ANCOVA"
author: "liuc"
date: "2/11/2022"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## 协方差分析

此文通过一个小示例记录协方差分析以及计算 emmeans时的应用。 比如在具有主要结局指标的重复测量数据分析中， 以baseline作为协变量，以某一时期的结果作为主要结局变量，则通常会采用协变量的方法进行分析。

*什么是协变量分析：*
\[ Y = \beta_0 + \beta_1 X + \beta_2 Z + \epsilon \]

一个设计简单的方差分析，一般需要计算出组内方差和组间方差，然后求出F值，并根据F值阈值做出判断。在协方差中，在这些计算之前需要对协变量进行adjust。协方差分析是将线性回归与方差分析相结合的一种分析方法。把对因变量Y有影响的因素X看作协变量，建立Y对X的线性回归，利用回归关系把X值化为相等，再进行各组Y的修正均数间比较。
在协方差分析（ANCOVA）中，协变量（covariates）是用于控制研究中潜在的混淆因素。协变量的主要目的是降低误差的方差，从而增加实验中处理效应的统计效能。具体来说，协变量可以帮助我们更准确地估计处理效应，因为它们允许我们在考虑其他影响因变量的变量的同时，考虑处理变量的影响。


协变量分析的效应值就是矫正过的在各个分组中的均值，协方差分析的自变量至少包括一个分类变量和一个数值型变量。
协方差分析也即是矫正混杂因素。

emmeans为在各个分组中outcome的均值；contrasts为不同分组间的差异，汇报的结果一般还需要包括Pvalue和95%CI。


*对于如何纳入交互作用的一些记录：*主要指two-way分析中两分组变量间的交互。
虽则一般在主要研究终点上不会引入交互作用，但后续的次要终点研究是可以考虑交互作用的。
When using frequentist methods, it is invalid to remove the interaction from a linear model just because of a hypothesis test, because that invalidates the estimate of residual variance, i.e., it results in an underestimation of 𝜎2. This is a form of model uncertainty discussed at more length in RMS.

In a confirmatory randomized experiment it is commonplace to have a "no interaction" model for the primary analysis and a pre-specified secondary analysis using an interaction model. The most cohesive approach IMHO is a Bayesian analysis in which the prior distribution for the interaction effect is elicited before data analysis. This allows interactions to be "half in" and "half out" of the model.

对于交互作用主要如此考虑：如果研究目的有必要去考虑交互作用，则不论是否P<0.05，都按照分组间做分析。如果只是探索性的分析，则在<0.05的时候，按照一个分组去分析另一个分组的亚组分析；如果>0.05，则可以观察具体有意义的分组变量或者直接就做两个单独的ANCOVA。


*顺便理清一下EMMs：*Estimated marginal means (EMMs)是一个由模型估算出来的值，其are defined as equally weighted means of these predictions at specified margins。In models with covariates, EMMs are often called adjusted means.
同时和模型自身的特征有关，不同模型需要不同的EMM估算方法。
在计算各种EMMs时，模型的正确构建是最为重要的，对分组变量间求contrast往往是最终目的，在设计碰到interaction时往往不易解释结果，此处做一个总结。首先是，`Interaction contrasts`，其是contrasts的contrast，比如在一个双factor交互作用的简单
模型中`ll <- lm(y ~ a * b)`，其的contrasts为`c <- contrast(ll, pairwise ~ a | b)`，
而在得到的此contrasts的基础上再进一步得到contrast`IC_res <- contrast(c, interaction = c("poly", "consec"), by = NULL)`，即是a分组之间的差和b分组间的差的组合，`emmeans::test(IC_res, joint = T)`提供整体的TypeIII检验。 函数`joint_tests(aov_res2)`obtains and tests the interaction contrasts for all effects in the model and compiles them in one Type-III-ANOVA-like table.
对于一个covariate和factor的交互，一般计算covariate在factor每一个水平上的slope.


```{r, include=FALSE}
library(tidyverse)
library(easystats)
# library(rstatix)
library(grafify)
library(emmeans)
library(marginaleffects)
library(multcomp)
```


```{r}
df <- readr::read_rds('./datasets/foo3.rds')

# 此数据集中以W24D1作为主要结局指标。
# W1D1作为协变量，可以理解为W1为basline数据
# 研究项目分组 为分类变量。

skimr::skim(df)
```

首先是否满足的分析条件，和一般线性回归一样，需要满足LINE，同时还需要协变量和因变量之间不存在交互效应。
以下分步骤所做的检测，亦可在模型建立后performance::check_model进行（推荐，针对残差的部分较为方便）。

*对于不满足条件的，可以用sm包的非参数检验方法的sm.ancova()函数进行分析, 或者robust ANCOVA using the WRS2 package.*

ANCOVA分析需要数据满足一定的假设：

1. Linearity between the covariate and the outcome variable
2. Homogeneity of regression slopes
3. The outcome variable should be approximately normally distributed
4. Homoscedasticity
5. No significant outliers，经验表明Outlier对协方差的影响还是蛮大的，尤其在一些临床指标中，Outlier出现的频次还是很高的。


```{r}
# raw means
# 可以初步看待下不同分组中的raw means
df %>% select(W1D1, `研究项目分组`, diffW24) %>% distinct() %>%
  group_by(`研究项目分组`) %>% summarise(mm = mean(diffW24))


# 正态性
df %>% group_by(`研究项目分组`) %>% identify_outliers(diffW24)
df %>% group_by(`研究项目分组`) %>% shapiro_test(diffW24)
df %>% levene_test(diffW24 ~ `研究项目分组`)

# 线性，不同分组
# 不同的线之间最好平行
ggpubr::ggscatter(
  df, x = "W1D1", y = "diffW24",
  color = "研究项目分组", add = "reg.line"
)+
  ggpubr::stat_regline_equation(
    aes(label =  paste(..eq.label.., ..rr.label.., sep = "~~~~"), color = `研究项目分组`)
  )

# 分组变量和协变量的交互效应，最好是没有交互作用，即P>0.05
anova_test(data = df,
       formula = diffW24 ~ `研究项目分组`* W1D1
       )

# 除了上述检验外还需要检查各种残差的分布，详见
# https://www.datanovia.com/en/lessons/ancova-in-r/#:~:text=The%20Analysis%20of%20Covariance%20(ANCOVA,two%20or%20more%20independent%20groups.
```


```{r}
# fit two nested models (equal and varying slopes across groups)
aov0 <- aov(diffW24 ~ W1D1 + `研究项目分组`, data=df) # ≈ lm(Postwt ~ Prewt + Treat + offset(Prewt), data=anorexia)
aov1 <- aov(diffW24 ~ W1D1 * `研究项目分组`, data=df)
# check if we need the interaction term
anova(aov0, aov1)

```


```{r}
summary.lm(aov1)
```


当以上条件基本满足后就可以展开协方差分析。作为其实本质是线性分析的方法，其可以纳入的协变量的个数不应该只有一个。。

one-way ANCOVA
构建模型。在R中需要注意*aov函数要求协变量写在效应变量前面*。

```{r}
#
aov_res <- aov(diffW24 ~ W1D1 + `研究项目分组`, # W1D1为需考虑的协变量
               data = df)

# type III
car::Anova(aov_res, type =  'III')

# 两两post hoc 分析
pwc3 <- emmeans_test(
  diffW24 ~ `研究项目分组`,
  covariate = W1D1,
  p.adjust.method = "bonferroni",
  data = df,
  detailed = TRUE
  )
pwc3 # pairwise comparison p-value, 可以看到其95%CI没有经过矫正
get_emmeans(pwc3)

grafify::posthoc_Levelwise(Model = aov_res,
                           Fixed_Factor = c("研究项目分组"),
                           infer = c(TRUE, TRUE)
                           )
```

emmeans 计算以及分析.
以上所用到的函数emmeans_test和posthoc_Levelwise的结果在95%CI上有些许出入，应该和不同的矫正方法有关。
下面对协方差的emmeans利用emmeans包分步进行。可以看到emmeans_test的结果没有经过P值矫正，而posthoc_Levelwise的
95%CI经过了bonferroni矫正。

```{r}
# 每一个变量及其因子的emmeans
(emm_res <- emmeans::emmeans(aov_res, specs = '研究项目分组'))

# 两两比对的contrast
pairs(emm_res, adjust = 'bonferroni', infer = c(TRUE, TRUE))

contrast(emm_res, method = 'pairwise', adjust = 'none', infer = c(T, T))

plot(emm_res, comparisons = TRUE) + theme_bw() +
  labs(x = "Estimated Marginal Mean")

```


`multcomp`包的事后分析：
```{r}
postHocs <- multcomp::glht(aov_res, linfct = multcomp::mcp(`研究项目分组` = "Tukey"))

summary(postHocs)
```


### sm 包所提供的非参数模型

各参数定义：
`model`指的是，

```{r}
sm_res <- with(df, sm::sm.ancova(W1D1, dW24, ARM, model="equal"))
```


*除了上述非参数模型外，在临床应用中，因为aov的假设，比如每个分组中的error SD is the same:*
但是gls函数would be a much better model to use when you have wildly different sample SDs.
The standard errors of the EMMs will depend on the individual sample variances, and the t tests of the comparisons will be in essence the Welch t statistics with Satterthwaite degrees of freedom.
```{r}
mod2 = nlme::gls(response ~ treat, data = mydata,
                 weights = varIdent(form = ~1 | treat))

emmeans(mod2, pairwise ~ treat)
```


### 和线性模型的对比

将下面的模型和aov_res对比，可以见到二者是一致的。

可以推测到对于简单线性模型`lm(diffW24 ~ `研究项目分组`, data = df)`和直接计算`mean`值得到的结果应该是一致的。

```{r}
# ?aov 本就是利用线性模型得到的分析

lm_fit <- lm(diffW24 ~ W1D1 + `研究项目分组`,
               data = df)

summary(lm_fit)

```


```{r}
anova_table <- anova(lm_fit)

anova_table
```


不同`type`间的anova差别在于计算方式的不同，比如`anova`函数默认采用'I'型，即是
```{r}
car::Anova(lm_fit, type = 'III')
```


```{r}
#通过anova表获取平方和
SS_group = anova_table[1,2]
SS_x = anova_table[2,2]
SS_error = anova_table[3,2]

#计算自由度
df_group = anova_table[1,1]
df_x = anova_table[2,1]
df_error = anova_table[3,1]

#计算均方
MS_group = SS_group/df_group
MS_x = SS_x/df_x
MS_error = SS_error/df_error

#计算F值
F_group = MS_group/MS_error
F_x = MS_x/MS_error

#进行F检验
pf(F_group, df_group, df_error, lower.tail=FALSE)
pf(F_x, df_x, df_error, lower.tail=FALSE)
```


### 手动计算

```{r}
# 生成数据
y = c(7.1, 6.8, 7.2, 8.1, 8.3, 7.9, 8.5, 7.4, 8.0, 8.6)
group = c('a','a','a','b','b','b','c','c','c','c')
x = c(65, 70, 75, 85, 80, 75, 90, 85, 80, 95)

data <- data.frame(Y = y, # 因变量
                   X = group,# 自变量
                   Z = x # 协变量
                   )

# 建立协变量和因变量的线性方程
lm_cov <- lm(Y ~ Z, data = data)

y_predict <- predict(lm_cov, newdata = data)


# 各组中协变量的均值
cov_group_mean <- tapply(x, group, mean)
# 协变量均值的预测值
c <- cov_group_mean * (coefficients(lm_cov)[2]) + coefficients(lm_cov)[1]

# 用y值减去对应组的c值
y_adj <- c()
for(i in 1:length(y)){
  y_adj[i] = y[i] - c[group[i]]
}


# 协变量的方差分析
anova(lm_cov)
```

对矫正后的Y值做方差分析，注意在这里还需要
*好难实现哦*
```{r}
df <- data.frame(Y = y_adj, X = group)

aov(Y ~ X, data = df) |> summary()
```

和R语言的`aov`函数做对比：
```{r}
aov(Y ~ Z + X, data = data) |> summary()
```


### 多因素协方差分析

多因素方差分析和多因素协方差分析类同。
在进行post hoc分析时，要依据交互效应是否有统计学意义，进行不同的分析思路。
如果没有交互效应则按照一个分组分别汇报另一个分组的主效应。
如果交互效应有统计学意义，则可以按照dislevel的分组分别对`研究项目分组`进行oneway ANCOVA的分析. 然后Compute pairwise comparisons between treatment groups


```{r}
aov.res2 <- aov(formula = diffW24 ~ W1D1 + `研究项目分组` * dislevel,
                data = df
                )

performance(aov.res2)

car::Anova(aov.res2, type = 'III')
```

下面进行post-hoc 分析. 考虑没有交互效应。
按照dislevel分组，分别计算`研究项目分组`分组中的emmeans。

```{r}
pc <- emmeans::emmeans(aov.res2,
                       specs = trt.vs.ctrl ~ `研究项目分组`|dislevel,
                       type = "response",
                       ref = 1,
                       adjust = 'fdr', infer = c(TRUE, TRUE))
pc

(emm_res <- emmeans(aov.res2, specs = c("研究项目分组", 'dislevel')))
contrast(emm_res, method = 'trt.vs.ctrl',
         type = 'response', adjust = 'fdr',by = 'dislevel', infer = c(TRUE, TRUE))
```


#### 基于`modelbased`包的分析

`modelbased`包提供包括，也都是对emmeans的包装:

- `estimate_means()`, Estimates the average values at each factor levels
- `estimate_contrasts()`, Estimates and tests contrasts between different factor levels
- `estimate_expectation()`等方便的函数。

```{r}
# model的emmeans
# 每一个变量的factor level的emmeans
# at 参数自动选择
# This is typically used when some of the predictors of interest are factors.
emm = modelbased::estimate_means(model = aov_res,
                           at = '研究项目分组',
                           transform = 'response',
                           ci = 0.95
                           )
modelbased::estimate_means(model = aov_res,
                           at = '研究项目分组=c("安慰剂BID", "5mgBID")',
                           transform = 'response',
                           ci = 0.95
                           )
plot(emm)
# standardize(emm)

# contrast
# 可以指定reference contrast(ref)
con <- modelbased::estimate_contrasts(model = aov_res,
                               contrast = "研究项目分组",
                               ci = 0.95,
                               adjust = 'holm', # 怎么此参数没用？一直都是tukey的结果
                               method = 'revpairwise'
                               )
# 二者的结果不一样，是因为上面好像有bug
# contrast(emm_res, method = 'revpairwise', p_adjust = 'holm', infer = c(T, T))

plot(con, emm)


# is not values on the response variable, but the model's parameters
# marginal effects return averages of coefficients
# 在对"研究项目分组"这一分类变量矫正后，W1D1的effect is negative and significant
modelbased::estimate_slopes(model = aov_res)
slopes <- estimate_slopes(aov_res, trend = "W1D1", at = "研究项目分组")

slopes

```


预测值和真实值:

从图中的结果可以看到预测的效果似乎一般。

```{r}
pred_data <- estimate_expectation(aov_res)
head(pred_data)

pred_data$diffW24 <- df$diffW24
pred_data$Predicted_2 <- estimate_expectation(aov.res2)$Predicted

pred_data %>%
  ggplot() +
  geom_line(aes(x = diffW24, y = diffW24), linetype = "dashed") +
  geom_point(aes(x = diffW24, y = Predicted), color = "grey") +
  geom_point(aes(x = diffW24, y = Predicted_2), color = "red") +
  ylab("diffW24 (predicted)") +
  theme_modern()
```

estimate the relationship between the response and the two predictors.

三组用药组的diffW24都随着W1D1的增加有下降的趋势，可以考虑在入组时患者screen期的数据不要太大。

```{r}
predicted <- estimate_expectation(aov_res, data = "grid")

df %>%
  ggplot(aes(x = W1D1)) +
  geom_point(aes(y = diffW24, color = `研究项目分组`)) +
  geom_ribbon(data = predicted, aes(ymin = CI_low, ymax = CI_high,
                                    fill = `研究项目分组`), alpha = 0.3) +
  geom_line(data = predicted, aes(y = Predicted, color = `研究项目分组`), size = 1) +
  theme_modern()
```


####  将结果输出成所需要的表格并绘制森林图

*森林图*

```{r}
con_p <- pc$contrasts %>% as.data.frame() %>%
  as_tibble()

plot(pc) + theme_bw()
```

下面利用forestplot包进行绘制，不过此包的使用主要是还是以整理成其所需要的数据格式为主，似乎可操作性差一些。

```{r}
library(forestplot)

ss <- tibble(mean  = 0.531,
             lower = 0.386,
             upper = 0.731,
             dislevel = "Summary",
             contrast = NA,
             summary = TRUE)

header <- tibble(dislevel = c("Dislevel"),
                 contrast = c("Contrast"),
                 summary = TRUE)

empty_row <- tibble(mean = NA_real_)

con_p <- con_p %>%
  rename('mean'=estimate,
         'lower' = lower.CL,
         'upper' = upper.CL
         ) %>% select(dislevel, contrast, mean, lower, upper)

header %>% bind_rows(con_p) %>% bind_rows(empty_row) %>% bind_rows(ss) %>%
  forestplot(
    labeltext = c(dislevel, contrast),
    mean = mean,
    lower = lower,
    upper = upper,
    hrzl_lines = gpar(col = "#444444"),
    is.summary = summary,
    zero = 12,
    grid = structure(c(10),
                     gp = gpar(lty = 2, col = "#CCCCFF")),
    col = fpColors(box = "royalblue",
                   line = "darkblue",
                   summary = "royalblue"),
    title = 'Forest Plot'
  )
```

forest plot by forester
麻烦的永远是整理森林图的数据。不知能否基于某些模型的输出自动整理待分析的数据。

```{r}
# devtools::install_github("rdboyes/forester")
library(forester)


demo_data <- read_csv('./datasets/forest_data.csv')
demo_data$dislevel <- ifelse(is.na(demo_data$contrast),
                          demo_data$dislevel,
                          paste0("  ", demo_data$dislevel))


forester(left_side_data = demo_data[,1:2],
         estimate = demo_data$mean,
         ci_low = demo_data$lower,
         ci_high = demo_data$upper,
         display = FALSE,
         null_line_at = 10,
         font_family = "sans",
         estimate_col_name = "Estimate (95% CI)",
         file_path = here::here("/Users/congliu/Downloads/forester_plot.png"),
         xlim = c(-10, 50),
         ggplot_width = 20,
         arrows = TRUE,
         arrow_labels = c("kx826 Better", "Placebo Better"),
         )
```


R里面合适的森林图包还是需要一个的呀。

```{r}
library(forestploter)


f <- data.frame(aa)

# Add blank column for the forest plot to display CI.
# Adjust the column width with space.
# 这里用空白行放在画图部分的前面
f$` ` <- paste(rep(" ", 20), collapse = " ")

# Create confidence interval column to display
f$`Diff (95% CI)` <- ifelse(is.na(f$SE), "",
                             sprintf("%.2f (%.2f to %.2f)",
                                     f$estimate, f$lower.CL, f$upper.CL))

# Define theme
tm <- forest_theme(base_size = 10,
                   refline_col = "red",
                   arrow_type = "closed",
                   footnote_col = "blue")

p <- forest(f[,c(1:2, 10:11)],
            est = f$estimate,
            lower = f$lower.CL,
            upper = f$upper.CL,
            sizes = f$SE,
            ci_column = 3, # 即是画出来的CI放在第几列，要和空白行连在一起
            ref_line = 0,
            arrow_lab = c("Placebo Better", "Treatment Better"),
            # xlim = c(0, 4),
            # ticks_at = c(0.5, 1, 2, 3),
            footnote = "This is the demo data. Please feel free to change\nanything you want.",
            # theme = tm
            )

# Print plot
plot(p)
```