forked from opain/GenoPred
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Z-test_vs_ANOVA_V3.Rmd
334 lines (257 loc) · 11.4 KB
/
Z-test_vs_ANOVA_V3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
---
title: "Estimating significance of prediction differences"
output:
html_document:
toc: true
theme: united
toc_depth: 3
number_sections: true
toc_float: true
fig_caption: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
***
(under development)
This analysis was to compare different methods for estimating the statistical significance of prediction differences across polygenic scoring methods, and models in general.
Here I simulate data with one continuous outcome, and two correlated continuous sets of predictions The correlation between the predictors and the outcome is varied, and the correlation between the predictors themselves is varied.
As I do in my other analyses, I estimate the correlation between the predictors and the outcome to determine their predictive utility. Then I compare different methods by comparing the observed-predicted correlations. Several method comparing correlation are considered:
* Two-sample Z-test
* Compares estimates from two populations and does not account for the correlation between predictors
* Permutation based
* Randomises the phenotype, retaining the correlation between predictors, and then estimates the number times the difference in correlation is larger than the observed difference.
* Cox test
* Method for comparing non-nested models
* Pearson and Filon’s (1898):
* Method for comparing correlations between variables that are correlated and measured in a single sample
* Implemented using cocor.dep.groups.overlap function in the cocor package
* Other methods implemented by this function are highly concordant
* Fisher r-to-z :
* Method for comparing correlations between variables that are independent and measures in different samples.
* Implemented using psych package
* Accounts for non-normal error of correlations
* Williams test :
* Method for comparing correlations between variables that are dependent and measured in the same sample.
* Implemented using psych package
* Accounts for non-normal error of correlations
We will simulate the following scenarios:
* Estimates from two independent samples
* Estimates from one sample, but predictors are uncorrelated
* Estimates from one sample, and predictors are highly correlated
***
# Estimates from two independent samples
```{R, eval=T}
library(data.table)
set.seed(1)
# Sample 1
N<-200
y1<-rnorm(N)
x1<-scale(y1+rnorm(N,0,3))
dat1<-data.table(y1,x1)
cor(dat1)
# Sample 2
N<-200
y2<-rnorm(N)
x2<-scale(y2+rnorm(N,0,10))
dat2<-data.table(y2,x2)
cor(dat2)
### Derive models using each predictor
mod1<-lm(y1 ~ x1, data=dat1)
mod2<-lm(y2 ~ x2, data=dat2)
dat1$pred1<-predict(mod1, dat1)
dat2$pred2<-predict(mod2, dat2)
dat_cor<-data.frame(model1='x1',
model2='x2',
cor_x1_x2=NA,
cor_y_x1=coef(summary(mod1))[2,1],
cor_y_x1_se=coef(summary(mod1))[2,2],
cor_y_x2=coef(summary(mod2))[2,1],
cor_y_x2_se=coef(summary(mod2))[2,2],
cor_diff=coef(summary(mod1))[2,1]-coef(summary(mod2))[2,1])
### Test difference between predictors using a Z-test
dat_cor$cor_diff_Ztest_P<-pnorm(-(dat_cor$cor_diff/sqrt((dat_cor$cor_y_x1_se^2)+(dat_cor$cor_y_x2_se^2))))
### Test difference between predictors using a permutation test
n_perm<-500
diff<-NULL
for(i in 1:n_perm){
y1_sample<-sample(y1)
y2_sample<-sample(y2)
mod1_i<-lm(y1_sample ~ x1, data=dat1)
mod2_i<-lm(y2_sample ~ x2, data=dat2)
diff_i<-coef(summary(mod1_i))[2,1]-coef(summary(mod2_i))[2,1]
diff<-c(diff,diff_i)
}
dat_cor$cor_diff_perm_P<-sum(diff > dat_cor$cor_diff[1])/n_perm
### Test difference between predictors using coxtest
# Not possible as different samples
### Test difference between predictors using Fisher's Z transformation
library(cocor)
dat_cor$cocor_fisherZ_P<-cocor.indep.groups(r1.jk=dat_cor$cor_y_x1, r2.hm=dat_cor$cor_y_x2, n1=N, n2=N, alternative='greater')@fisher1925$p.value
library(psych)
dat_cor$psych_fisherZ_P<-paired.r(xy=dat_cor$cor_y_x1, xz=dat_cor$cor_y_x2, n=N, n2=N, twotailed=F)$p
dat_cor
```
The Fisher's r-to-z transformation is a commonly used approach for comparing correlations from different samples. The two-sample z-test does not account for the non-normal error distribution of Pearson correlations. Results indicate the fisher r-to-z method is concordant with the two-sample z-test. The permutation-based approach is more conservative than other methods.
***
# Estimates from one sample, but predictors are uncorrelated
```{R, eval=T}
library(data.table)
set.seed(1)
N<-200
y<-rnorm(N)
x1<-as.numeric(scale(y+rnorm(N,0,3)))
x2<-as.numeric(scale(y+rnorm(N,0,10)))
dat<-data.table(y,x1,x2)
cor(dat)
### Derive models using each predictor
mod1<-lm(y ~ x1, data=dat)
mod2<-lm(y ~ x2, data=dat)
dat$pred1<-predict(mod1, dat)
dat$pred2<-predict(mod2, dat)
dat_cor<-data.frame(model1='x1',
model2='x2',
cor_x1_x2=cor(dat$x1,dat$x2),
cor_y_x1=coef(summary(mod1))[2,1],
cor_y_x1_se=coef(summary(mod1))[2,2],
cor_y_x2=coef(summary(mod2))[2,1],
cor_y_x2_se=coef(summary(mod2))[2,2],
cor_diff=coef(summary(mod1))[2,1]-coef(summary(mod2))[2,1])
### Test difference between predictors using a Z-test
dat_cor$cor_diff_Ztest_P<-pnorm(-(dat_cor$cor_diff/sqrt((dat_cor$cor_y_x1_se^2)+(dat_cor$cor_y_x2_se^2))))
### Test difference between predictors using a permutation test
n_perm<-500
diff<-NULL
for(i in 1:n_perm){
y_sample<-sample(y)
mod1_i<-lm(y_sample ~ x1, data=dat)
mod2_i<-lm(y_sample ~ x2, data=dat)
diff_i<-coef(summary(mod1_i))[2,1]-coef(summary(mod2_i))[2,1]
diff<-c(diff,diff_i)
}
dat_cor$cor_diff_perm_P<-sum(diff > dat_cor$cor_diff[1])/n_perm
### Test difference between predictors using coxtest
library(lmtest)
dat_cor$cox_diff_P<-coxtest(mod1, mod2)$P[2]
### Test difference between predictors using Pearson
library(cocor)
dat_cor$pearson_diff_P<-cocor.dep.groups.overlap(dat_cor$cor_y_x1, dat_cor$cor_y_x2, dat_cor$cor_x1_x2, N,alternative='greater')@pearson1898$p.value
### Test difference between predictors using Williams's Test
library(psych)
dat_cor$williams_diff_P<-paired.r(xy=dat_cor$cor_y_x1, xz=dat_cor$cor_y_x2, yz=dat_cor$cor_x1_x2, n=N, twotailed=F)$p
dat_cor
```
Apart fromt the coxtest method, results are similar across methods, with pearson and williams methods being highly concordant.
***
# Estimates from one sample, and predictors are highly correlated
```{R, eval=T}
library(data.table)
set.seed(1)
N<-200
y<-rnorm(N)
set.seed(2)
x1<-as.numeric(scale(y+rnorm(N,0,3)))
set.seed(3)
x2<-as.numeric(scale(x1+rnorm(N,0,1)))
dat<-data.table(y,x1,x2)
cor(dat)
### Derive models using each predictor
mod1<-lm(y ~ x1, data=dat)
mod2<-lm(y ~ x2, data=dat)
dat$pred1<-predict(mod1, dat)
dat$pred2<-predict(mod2, dat)
dat_cor<-data.frame(model1='x1',
model2='x2',
cor_x1_x2=cor(dat$x1,dat$x2),
cor_y_x1=coef(summary(mod1))[2,1],
cor_y_x1_se=coef(summary(mod1))[2,2],
cor_y_x2=coef(summary(mod2))[2,1],
cor_y_x2_se=coef(summary(mod2))[2,2],
cor_diff=coef(summary(mod1))[2,1]-coef(summary(mod2))[2,1])
### Test difference between predictors using a Z-test
dat_cor$cor_diff_Ztest_P<-pnorm(-(dat_cor$cor_diff/sqrt((dat_cor$cor_y_x1_se^2)+(dat_cor$cor_y_x2_se^2))))
### Test difference between predictors using a permutation test
n_perm<-500
diff<-NULL
for(i in 1:n_perm){
y_sample<-sample(y)
mod1_i<-lm(y_sample ~ x1, data=dat)
mod2_i<-lm(y_sample ~ x2, data=dat)
diff_i<-coef(summary(mod1_i))[2,1]-coef(summary(mod2_i))[2,1]
diff<-c(diff,diff_i)
}
dat_cor$cor_diff_perm_P<-sum(diff > dat_cor$cor_diff[1])/n_perm
### Test difference between predictors using coxtest
library(lmtest)
dat_cor$cox_diff_P<-coxtest(mod1, mod2)$P[2]
### Test difference between predictors using Pearson
library(cocor)
dat_cor$pearson_diff_P<-cocor.dep.groups.overlap(dat_cor$cor_y_x1, dat_cor$cor_y_x2, dat_cor$cor_x1_x2, N,alternative='greater')@pearson1898$p.value
### Test difference between predictors using Williams's Test
library(psych)
dat_cor$williams_diff_P<-paired.r(xy=dat_cor$cor_y_x1, xz=dat_cor$cor_y_x2, yz=dat_cor$cor_x1_x2, n=N, twotailed=F)$p
### Test difference between predictors using ROC test
library(pROC)
mod1_roc<-roc(y ~ x1)
mod2_roc<-roc(y ~ x2)
dat_cor$mod1_auc<-mod1_roc$auc
dat_cor$mod2_auc<-mod2_roc$auc
dat_cor$roc_diff_P<-roc.test(mod1_roc,mod2_roc, paired=T, alternative='greater', method='bootstrap')$p.value
dat_cor
```
Again the coxtest method is very different from other methods. The two-sample Z test is now deviating from the other methods because it doesn't account for the correlation between predictors. The results for the permutation, pearson and williams method are highly concordant.
The psych package is a solid package and the Williams method is recommended by Steiger who worked alot in this area. It is also faster than the permutation based approach. From here on use the Williams test to test for significant differences between correlations between outcomes and correlated predictors.
***
# Under development...
Another method that is used to compare models is to compare AUC curves. Ths can be implemented using the pROC package in R. This test is only suitable for binary outcomes. For comparison, compare the AUC method to the permutation and wiliams methods.
***
## Estimates from one sample, and predictors are highly correlated
```{R, eval=T}
library(data.table)
set.seed(1)
N<-200
y<-rbinom(n=N, size=1, prob=0.5)
set.seed(2)
x1<-as.numeric(scale(y+rnorm(N,0,2)))
set.seed(3)
x2<-as.numeric(scale(x1+rnorm(N,0,1)))
dat<-data.table(y,x1,x2)
cor(dat)
### Derive models using each predictor
mod1<-glm(y ~ x1, data=dat, family='binomial')
mod2<-glm(y ~ x2, data=dat, family='binomial')
dat$pred1<-predict(mod1, dat)
dat$pred2<-predict(mod2, dat)
dat_cor<-data.frame(model1='x1',
model2='x2',
cor_x1_x2=cor(dat$x1,dat$x2),
cor_y_x1=coef(summary(mod1))[2,1],
cor_y_x1_se=coef(summary(mod1))[2,2],
cor_y_x2=coef(summary(mod2))[2,1],
cor_y_x2_se=coef(summary(mod2))[2,2],
cor_diff=coef(summary(mod1))[2,1]-coef(summary(mod2))[2,1])
### Test difference between predictors using a permutation test
n_perm<-500
diff<-NULL
for(i in 1:n_perm){
y_sample<-sample(y)
mod1_i<-glm(y_sample ~ x1, data=dat, family='binomial')
mod2_i<-glm(y_sample ~ x2, data=dat, family='binomial')
diff_i<-coef(summary(mod1_i))[2,1]-coef(summary(mod2_i))[2,1]
diff<-c(diff,diff_i)
}
dat_cor$cor_diff_perm_P<-sum(diff > dat_cor$cor_diff[1])/n_perm
### Test difference between predictors using Williams's Test
library(psych)
dat_cor$williams_diff_P<-paired.r(xy=dat_cor$cor_y_x1, xz=dat_cor$cor_y_x2, yz=dat_cor$cor_x1_x2, n=N, twotailed=F)$p
### Test difference between predictors using ROC test
library(pROC)
mod1_roc<-roc(y ~ x1)
mod2_roc<-roc(y ~ x2)
dat_cor$mod1_auc<-mod1_roc$auc
dat_cor$mod2_auc<-mod2_roc$auc
dat_cor$roc_diff_P<-roc.test(mod1_roc,mod2_roc, paired=T, alternative='greater')$p.value
dat_cor
```
This analysis raises some concerns about the validity of the William's test when the outcome is binary. We should check the false positive rate of the Williams test by testing how many <0.05 tests there are under the null.
***