-
Notifications
You must be signed in to change notification settings - Fork 0
/
Session5-AnalysingData_advanced.rmd
2029 lines (1363 loc) · 81.3 KB
/
Session5-AnalysingData_advanced.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Session 5: Advanced statistical analyses"
author:
name: Jalal Al-Tamimi
affiliation: Université Paris Cité
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
html_notebook:
highlight: pygments
number_sections: yes
toc: yes
toc_depth: 6
toc_float:
collapsed: yes
fig_crop: no
---
# Loading packages
```{r warning=FALSE, message=FALSE, error=FALSE}
## Use the code below to check if you have all required packages installed. If some are not installed already, the code below will install these. If you have all packages installed, then you could load them with the second code.
requiredPackages = c('tidyverse', 'broom', 'knitr', 'Hmisc', 'corrplot', 'lme4', 'lmerTest', 'party', 'ranger','doFuture', 'tidymodels', 'pROC', 'varImp', 'lattice', 'vip', 'emmeans', 'ggsignif', 'PresenceAbsence', 'languageR', 'FactoMineR', 'factoextra', 'RColorBrewer', 'scatterplot3d', 'cowplot', 'psycho', 'ordinal')
for(p in requiredPackages){
if(!require(p,character.only = TRUE)) install.packages(p)
library(p,character.only = TRUE)
}
```
# Correlation tests {.tabset .tabset-fade .tabset-pills}
## Basic correlations
Let us start with a basic correlation test. We want to evaluate if two numeric variables are correlated with each other.
We use the function `cor` to obtain the pearson correlation and `cor.test` to run a basic correlation test on our data with significance testing
```{r}
cor(english$RTlexdec, english$RTnaming, method = "pearson")
cor.test(english$RTlexdec, english$RTnaming)
```
What these results are telling us? There is a positive correlation between `RTlexdec` and `RTnaming`. The correlation coefficient (R²) is 0.76 (limits between -1 and 1). This correlation is statistically significant with a t value of 78.699, degrees of freedom of 4566 and a p-value < 2.2e-16.
What are the degrees of freedom? These relate to number of total observations - number of comparisons. Here we have 4568 observations in the dataset, and two comparisons, hence 4568 - 2 = 4566.
For the p value, there is a threshold we usually use. This threshold is p = 0.05. This threshold means we have a minimum to consider any difference as significant or not. 0.05 means that we have a probability to find a significant difference that is at 5% or lower. IN our case, the p value is lower that 2.2e-16. How to interpret this number? this tells us to add 15 0s before the 2!! i.e., 0.0000000000000002. This probability is very (very!!) low. So we conclude that there is a statistically significant correlation between the two variables.
The formula to calculate the t value is below.
![](t-score.jpg)
x̄ = sample mean
μ0 = population mean
s = sample standard deviation
n = sample size
The p value is influenced by various factors, number of observations, strength of the difference, mean values, etc.. You should always be careful with interpreting p values taking everything else into account.
## Using the package `corrplot`
Above, we did a correlation test on two predictors.
What if we want to obtain a nice plot of all numeric predictors and add significance levels?
### Correlation plots
```{r fig.height=6}
corr <-
english %>%
select(where(is.numeric)) %>%
cor() %>%
print()
print(corr)
corrplot(corr, method = 'ellipse', type = 'upper')
```
### More advanced
Let's first compute the correlations between all numeric variables and plot these with the p values
```{r fig.height=15}
## correlation using "corrplot"
## based on the function `rcorr' from the `Hmisc` package
## Need to change dataframe into a matrix
corr <-
english %>%
select(where(is.numeric)) %>%
as.matrix(english) %>%
rcorr(type = "pearson")
print(corr)
# use corrplot to obtain a nice correlation plot!
corrplot(corr$r, p.mat = corr$P,
addCoef.col = "black", diag = FALSE, type = "upper", tl.srt = 55)
```
```{r}
english %>%
group_by(AgeSubject) %>%
summarise(mean = mean(RTlexdec),
sd = sd(RTlexdec))
```
# Linear Models {.tabset .tabset-fade .tabset-pills}
Up to now, we have looked at descriptive statistics, and evaluated summaries, correlations in the data (with p values).
We are now interested in looking at group differences.
## Introduction
The basic assumption of a Linear model is to create a regression analysis on the data. We have an outcome (or dependent variable) and a predictor (or an independent variable). The formula of a linear model is as follows `outcome ~ predictor` that can be read as "outcome as a function of the predictor". We can add "1" to specify an intercept, but this is by default added to the model
### Model estimation
```{r}
english2 <- english %>%
mutate(AgeSubject = factor(AgeSubject, levels = c("young", "old")))
mdl.lm <- english2 %>%
lm(RTlexdec ~ AgeSubject, data = .)
#lm(RTlexdec ~ AgeSubject, data = english)
mdl.lm #also print(mdl.lm)
summary(mdl.lm)
```
### Tidying the output
```{r}
# from library(broom)
tidy(mdl.lm) %>%
select(term, estimate) %>%
mutate(estimate = round(estimate, 3))
mycoefE <- tidy(mdl.lm) %>% pull(estimate)
```
Obtaining mean values from our model
```{r}
#old
mycoefE[1]
#young
mycoefE[1] + mycoefE[2]
```
### Nice table of our model summary
We can also obtain a nice table of our model summary. We can use the package `knitr` or `xtable`
#### Directly from model summary
```{r}
kable(summary(mdl.lm)$coef, digits = 3)
```
#### From the `tidy` output
```{r}
mdl.lmT <- tidy(mdl.lm)
kable(mdl.lmT, digits = 3)
```
### Dissecting the model
Let us dissect the model. If you use "str", you will be able to see what is available under our linear model. To access some info from the model
#### "str" and "coef"
```{r}
str(mdl.lm)
```
```{r}
coef(mdl.lm)
## same as
## mdl.lm$coefficients
```
#### "coef" and "coefficients"
What if I want to obtain the "Intercept"? Or the coefficient for distance? What if I want the full row for distance?
```{r}
coef(mdl.lm)[1] # same as mdl.lm$coefficients[1]
coef(mdl.lm)[2] # same as mdl.lm$coefficients[2]
```
```{r}
summary(mdl.lm)$coefficients[2, ] # full row
summary(mdl.lm)$coefficients[2, 4] #for p value
```
#### Residuals
What about residuals (difference between the observed value and the estimated value of the quantity) and fitted values? This allows us to evaluate how normal our residuals are and how different they are from a normal distribution.
```{r warning=FALSE, message=FALSE, error=FALSE}
hist(residuals(mdl.lm))
qqnorm(residuals(mdl.lm)); qqline(residuals(mdl.lm))
plot(fitted(mdl.lm), residuals(mdl.lm), cex = 4)
```
#### Goodness of fit?
```{r warning=FALSE, message=FALSE, error=FALSE}
AIC(mdl.lm) # Akaike's Information Criterion, lower values are better
BIC(mdl.lm) # Bayesian AIC
logLik(mdl.lm) # log likelihood
```
Or use the following from `broom`
```{r}
glance(mdl.lm)
```
#### Significance testing
Are the above informative? of course not directly. If we want to test for overall significance of model. We run a null model (aka intercept only) and compare models.
```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lm.Null <- english %>%
lm(RTlexdec ~ 1, data = .)
mdl.comp <- anova(mdl.lm.Null, mdl.lm)
mdl.comp
```
The results show that adding the variable "AgeSubject" improves the model fit. We can write this as follows: Model comparison showed that the addition of AgeSubject improved the model fit when compared with an intercept only model ($F$(`r mdl.comp[2,3]`) = `r round(mdl.comp[2,5], 2)`, *p* < `r mdl.comp[2,6]`) (F(1) = 4552 , p < 2.2e-16)
## Plotting fitted values
### Trend line
Let's plot our fitted values but only for the trend line
```{r warning=FALSE, message=FALSE, error=FALSE}
english %>%
ggplot(aes(x = AgeSubject, y = RTlexdec))+
geom_boxplot()+
theme_bw() + theme(text = element_text(size = 15))+
geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") +
labs(x = "Age", y = "RTLexDec", title = "Boxplot and predicted trend line", subtitle = "with ggplot2")
```
This allows us to plot the fitted values from our model with the predicted linear trend. This is exactly the same as our original data.
### Predicted means and the trend line
We can also plot the predicted means and linear trend
```{r warning=FALSE, message=FALSE, error=FALSE}
english %>%
ggplot(aes(x = AgeSubject, y = predict(mdl.lm)))+
geom_boxplot(color = "blue") +
theme_bw() + theme(text = element_text(size = 15)) +
geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") +
labs(x = "Age", y = "RTLexDec", title = "Predicted means and trend line", subtitle = "with ggplot2")
```
### Raw data, predicted means and the trend line
We can also plot the actual data, the predicted means and linear trend
```{r warning=FALSE, message=FALSE, error=FALSE}
english %>%
ggplot(aes(x = AgeSubject, y = RTlexdec))+
geom_boxplot() +
geom_boxplot(aes(x = AgeSubject, y = predict(mdl.lm)), color = "blue") +
theme_bw() + theme(text = element_text(size = 15)) +
geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") +
labs(x = "Species", y = "Length", title = "Boxplot raw data, predicted means (in blue) and trend line", subtitle = "with ggplot2")
```
### Add significance levels and trend line on a plot?
We can use the p values generated from either our linear model to add significance levels on a plot. We use the code from above and add the significance level. We also add a trend line
```{r warning=FALSE, message=FALSE, error=FALSE}
english %>%
ggplot(aes(x = AgeSubject, y = RTlexdec))+
geom_boxplot() +
geom_boxplot(aes(x = AgeSubject, y = predict(mdl.lm)), color = "blue") +
theme_bw() + theme(text = element_text(size = 15)) +
geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") +
labs(x = "Species", y = "Length", title = "Boxplot raw data, predicted means (in blue) and trend line", subtitle = "with significance testing") +
geom_signif(comparison = list(c("old", "young")),
map_signif_level = TRUE, test = function(a, b) {
list(p.value = summary(mdl.lm)$coefficients[2, 4])})
```
## What about pairwise comparison?
When having three of more levels in our predictor, we can use pairwise comparisons, with corrections to evaluate differences between each level.
```{r}
summary(mdl.lm)
```
```{r}
mdl.lm %>% emmeans(pairwise ~ AgeSubject, adjust = "fdr") -> mdl.emmeans
mdl.emmeans
```
How to interpret the output? Discuss with your neighbour and share with the group.
Hint... Look at the emmeans values for each level of our factor "Species" and the contrasts.
## Multiple predictors?
Linear models require a numeric outcome, but the predictor can be either numeric or a factor. We can have more than one predictor. The only issue is that this complicates the interpretation of results
```{r warning=FALSE, message=FALSE, error=FALSE}
english %>%
lm(RTlexdec ~ AgeSubject * WordCategory, data = .) %>%
summary()
```
And with an Anova
```{r warning=FALSE, message=FALSE, error=FALSE}
english %>%
lm(RTlexdec ~ AgeSubject * WordCategory, data = .) %>%
anova()
```
The results above tell us that all predictors used are significantly different.
# Generalised Linear Models {.tabset .tabset-fade .tabset-pills}
Here we will look at an example when the outcome is binary. This simulated data is structured as follows. We asked one participant to listen to 165 sentences, and to judge whether these are "grammatical" or "ungrammatical". There were 105 sentences that were "grammatical" and 60 "ungrammatical". This fictitious example can apply in any other situation. Let's think Geography: 165 lands: 105 "flat" and 60 "non-flat", etc. This applies to any case where you need to "categorise" the outcome into two groups.
## Load and summaries
Let's load in the data and do some basic summaries
```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- read_csv("grammatical.csv")
grammatical
str(grammatical)
head(grammatical)
```
## GLM - Categorical predictors
Let's run a first GLM (Generalised Linear Model). A GLM uses a special family "binomial" as it assumes the outcome has a binomial distribution. In general, results from a Logistic Regression are close to what we get from SDT (see above).
To run the results, we will change the reference level for both response and grammaticality. The basic assumption about GLM is that we start with our reference level being the "no" responses to the "ungrammatical" category. Any changes to this reference will be seen in the coefficients as "yes" responses to the "grammatical" category.
### Model estimation and results
The results below show the logodds for our model.
```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- grammatical %>%
mutate(response = factor(response, levels = c("no", "yes")),
grammaticality = factor(grammaticality, levels = c("ungrammatical", "grammatical")))
grammatical %>%
group_by(grammaticality, response) %>%
table()
mdl.glm <- grammatical %>%
glm(response ~ grammaticality, data = ., family = binomial)
summary(mdl.glm)
tidy(mdl.glm) %>%
select(term, estimate) %>%
mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm) %>% pull(estimate)
```
The results show that for one unit increase in the response (i.e., from no to yes), the logodds of being "grammatical" is increased by `r mycoef2[2]` (the intercept shows that when the response is "no", the logodds are `r mycoef2[1]`). The actual logodds for the response "yes" to grammatical is `r mycoef2[1]+mycoef2[2]`
### Logodds to Odd ratios
Logodds can be modified to talk about the odds of an event. For our model above, the odds of "grammatical" receiving a "no" response is a mere 0.2; the odds of "grammatical" to receive a "yes" is a 20; i.e., 20 times more likely
```{r warning=FALSE, message=FALSE, error=FALSE}
exp(mycoef2[1])
exp(mycoef2[1] + mycoef2[2])
```
### LogOdds to proportions
If you want to talk about the percentage "accuracy" of our model, then we can transform our loggodds into proportions. This shows that the proportion of "grammatical" receiving a "yes" response increases by 99% (or 95% based on our "true" coefficients)
```{r warning=FALSE, message=FALSE, error=FALSE}
plogis(mycoef2[1])
plogis(mycoef2[1] + mycoef2[2])
```
### Plotting
```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- grammatical %>%
mutate(prob = predict(mdl.glm, type = "response"))
grammatical %>%
ggplot(aes(x = as.numeric(grammaticality), y = prob)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = T) + theme_bw(base_size = 20)+
labs(y = "Probability", x = "")+
coord_cartesian(ylim = c(0,1))+
scale_x_discrete(limits = c("Ungrammatical", "Grammatical"))
```
## GLM - Numeric predictors {.tabset .tabset-fade .tabset-pills}
In this example, we will run a GLM model using a similar technique to that used in `Al-Tamimi (2017)` and `Baumann & Winter (2018)`. We use the package `LanguageR` and the dataset `English`.
In the model above, we used the equation as lm(RTlexdec ~ AgeSubject). We were interested in examining the impact of age of subject on reaction time in a lexical decision task. In this section, we are interested in understanding how reaction time allows to differentiate the participants based on their age. We use `AgeSubject` as our outcome and `RTlexdec` as our predictor using the equation glm(AgeSubject ~ RTlexdec). We usually can use `RTlexdec` as is, but due to a possible quasi separation and the fact that we may want to compare coefficients using multiple acoustic metrics, we will z-score our predictor. We run below two models, with and without z-scoring
For the glm model, we need to specify `family = "binomial"`.
### Without z-scoring of predictor
#### Model estimation
```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.glm2 <- english2 %>%
glm(AgeSubject ~ RTlexdec, data = ., family = "binomial")
tidy(mdl.glm2) %>%
select(term, estimate) %>%
mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm2) %>% pull(estimate)
```
#### LogOdds to proportions
If you want to talk about the percentage "accuracy" of our model, then we can transform our loggodds into proportions.
```{r warning=FALSE, message=FALSE, error=FALSE}
plogis(mycoef2[1])
plogis(mycoef2[1] + mycoef2[2])
```
#### Plotting
```{r warning=FALSE, message=FALSE, error=FALSE}
english2 <- english2 %>%
mutate(prob = predict(mdl.glm2, type = "response"))
english2 %>%
ggplot(aes(x = as.numeric(AgeSubject), y = prob)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = T) + theme_bw(base_size = 20)+
labs(y = "Probability", x = "")+
coord_cartesian(ylim = c(0,1))+
scale_x_discrete(limits = c("Young", "Old"))
```
The plot above show how the two groups differ using a glm. The results point to an overall increase in the proportion of reaction time when moving from the "Young" to the "Old" group.
Let's use z-scoring next
### With z-scoring of predictor
#### Model estimation
```{r warning=FALSE, message=FALSE, error=FALSE}
english2 <- english2 %>%
mutate(`RTlexdec_z` = scale(RTlexdec, center = TRUE, scale = TRUE))
english2['RTlexdec_z'] <- as.data.frame(scale(english2$RTlexdec))
mdl.glm3 <- english2 %>%
glm(AgeSubject ~ RTlexdec_z, data = ., family = "binomial")
tidy(mdl.glm3) %>%
select(term, estimate) %>%
mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm3) %>% pull(estimate)
```
#### LogOdds to proportions
If you want to talk about the percentage "accuracy" of our model, then we can transform our loggodds into proportions.
```{r warning=FALSE, message=FALSE, error=FALSE}
plogis(mycoef2[1])
plogis(mycoef2[1] + mycoef2[2])
```
#### Plotting
##### Normal
```{r warning=FALSE, message=FALSE, error=FALSE}
english2 <- english2 %>%
mutate(prob = predict(mdl.glm3, type = "response"))
english2 %>%
ggplot(aes(x = as.numeric(AgeSubject), y = prob)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = T) + theme_bw(base_size = 20)+
labs(y = "Probability", x = "")+
coord_cartesian(ylim = c(0,1))+
scale_x_discrete(limits = c("Young", "Old"))
```
We obtain the exact same plots, but the model estimations are different. Let's use another type of predictions
##### z-scores
```{r warning=FALSE, message=FALSE, error=FALSE}
z_vals <- seq(-3, 3, 0.01)
dfPredNew <- data.frame(RTlexdec_z = z_vals)
## store the predicted probabilities for each value of RTlexdec_z
pp <- cbind(dfPredNew, prob = predict(mdl.glm3, newdata = dfPredNew, type = "response"))
pp %>%
ggplot(aes(x = RTlexdec_z, y = prob)) +
geom_point() +
theme_bw(base_size = 20)+
labs(y = "Probability", x = "")+
coord_cartesian(ylim = c(0,1))+
scale_x_continuous(breaks = c(-3, -2, -1, 0, 1, 2, 3))
```
We obtain the exact same plots, but the model estimations are different.
## Accuracy and Signal Detection Theory
### Rationale
We are generally interested in performance, i.e., whether the we have "accurately" categorised the outcome or not and at the same time want to evaluate our biases in responses. When deciding on categories, we are usually biased in our selection.
Let's ask the question: How many of you have a Mac laptop and how many a Windows laptop? For those with a Mac, what was the main reason for choosing it? Are you biased in anyway by your decision?
To correct for these biases, we use some variants from Signal Detection Theory to obtain the true estimates without being influenced by the biases.
### Running stats
Let's do some stats on this
| | Yes | No | Total |
|----------------------------|--------------------|------------------|------------------|
| Grammatical (Yes Actual) | TP = 100 | FN = 5 | (Yes Actual) 105 |
| Ungrammatical (No Actual) | FP = 10 | TN = 50 | (No Actual) 60 |
| Total | (Yes Response) 110 | (No Response) 55 | 165 |
```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- grammatical %>%
mutate(response = factor(response, levels = c("yes", "no")),
grammaticality = factor(grammaticality, levels = c("grammatical", "ungrammatical")))
## TP = True Positive (Hit); FP = False Positive; FN = False Negative; TN = True Negative
TP <- nrow(grammatical %>%
filter(grammaticality == "grammatical" &
response == "yes"))
FN <- nrow(grammatical %>%
filter(grammaticality == "grammatical" &
response == "no"))
FP <- nrow(grammatical %>%
filter(grammaticality == "ungrammatical" &
response == "yes"))
TN <- nrow(grammatical %>%
filter(grammaticality == "ungrammatical" &
response == "no"))
TP
FN
FP
TN
Total <- nrow(grammatical)
Total
(TP+TN)/Total # accuracy
(FP+FN)/Total # error, also 1-accuracy
# When stimulus = yes, how many times response = yes?
TP/(TP+FN) # also True Positive Rate or Specificity
# When stimulus = no, how many times response = yes?
FP/(FP+TN) # False Positive Rate,
# When stimulus = no, how many times response = no?
TN/(FP+TN) # True Negative Rate or Sensitivity
# When subject responds "yes" how many times is (s)he correct?
TP/(TP+FP) # precision
# getting dprime (or the sensitivity index); beta (bias criterion, 0-1, lower=increase in "yes"); Aprime (estimate of discriminability, 0-1, 1=good discrimination; 0 at chance); bppd (b prime prime d, -1 to 1; 0 = no bias, negative = tendency to respond "yes", positive = tendency to respond "no"); c (index of bias, equals to SD)
#(see also https://www.r-bloggers.com/compute-signal-detection-theory-indices-with-r/amp/)
psycho::dprime(TP, FP, FN, TN,
n_targets = TP+FN,
n_distractors = FP+TN,
adjust=F)
```
The most important from above, is d-prime. This is modelling the difference between the rate of "True Positive" responses and "False Positive" responses in standard unit (or z-scores). The formula can be written as:
`d' (d prime) = Z(True Positive Rate) - Z(False Positive Rate)`
### GLM as a classification tool
The code below demonstrates the links between our GLM model and what we had obtained above from SDT. The predictions' table shows that our GLM was successful at obtaining prediction that are identical to our initial data setup. Look at the table here and the table above. Once we have created our table of outcome, we can compute percent correct, the specificity, the sensitivity, the Kappa score, etc.. this yields the actual value with the SD that is related to variations in responses.
```{r}
## predict(mdl.glm)>0.5 is identical to
## predict(glm(response~grammaticality,data=grammatical,family = binomial),type="response")
grammatical <- grammatical %>%
mutate(response = factor(response, levels = c("yes", "no")),
grammaticality = factor(grammaticality, levels = c("grammatical", "ungrammatical")))
mdl.glm.C <- grammatical %>%
glm(response ~ grammaticality, data = .,family = binomial)
tbl.glm <- table(grammatical$response, predict(mdl.glm.C, type = "response")>0.5)
colnames(tbl.glm) <- c("grammatical", "ungrammatical")
tbl.glm
PresenceAbsence::pcc(tbl.glm)
PresenceAbsence::specificity(tbl.glm)
PresenceAbsence::sensitivity(tbl.glm)
###etc..
```
If you look at the results from SDT above, these results are the same as
the following
Accuracy: (TP+TN)/Total (`r (TP+TN)/Total`)
True Positive Rate (or Specificity) TP/(TP+FN) (`r TP/(TP+FN)`)
True Negative Rate (or Sensitivity) TN/(FP+TN) (`r TN/(FP+TN)`)
### GLM and d prime
The values obtained here match those obtained from SDT. For d prime, the difference stems from the use of the logit variant of the Binomial family. By using a probit variant, one obtains the same values ([see here](https://stats.idre.ucla.edu/r/dae/probit-regression/) for more details). A probit variant models the z-score differences in the outcome and is evaluated in change in 1-standard unit. This is modelling the change from "ungrammatical" "no" responses into "grammatical" "yes" responses in z-scores. The same conceptual underpinnings of d-prime from Signal Detection Theory.
```{r}
## d prime
psycho::dprime(TP, FP, FN, TN,
n_targets = TP+FN,
n_distractors = FP+TN,
adjust=F)$dprime
## GLM with probit
coef(glm(response ~ grammaticality, data = grammatical, family = binomial(probit)))[2]
```
## GLM: Other distributions
If your data does not fit a binomial distribution, and is a multinomial (i.e., three or more response categories) or poisson (count data), then you need to use the glm function with a specific family function.
```{r warning=FALSE, message=FALSE, error=FALSE, echo=FALSE}
## For a multinomial (3 or more response categories), see below and use the following specification
## https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/
## mdl.multi <- nnet::multinom(outcome~predictor, data=data)
## For a poisson (count data), see below and use the following specification
## https://stats.idre.ucla.edu/r/dae/poisson-regression/
## mdl.poisson <- glm(outcome~predictor, data = data, family = "poisson")
```
# Cumulative Logit Link Models
These models work perfectly with rating data. Ratings are inherently ordered, 1, 2, ... n, and expect to observe an increase (or decrease) in overall ratings from 1 to n. To demonstrate this, we will use an example using the package "ordinal". Data were from a rating experiment where six participants rated the percept of nasality in the production of particular consonants in Arabic. The data came from nine producing subjects. The ratings were from 1 to 5. This example can apply to any study, e.g., rating grammaticality of sentences, rating how positive the sentiments are in a article, interview responses, etc.
## Importing and pre-processing
We start by importing the data and process it. We change the reference level in the predictor
```{r warning=FALSE, message=FALSE, error=FALSE}
rating <- read_csv("rating.csv")
rating
rating <- rating %>%
mutate(Response = factor(Response),
Context = factor(Context)) %>%
mutate(Context = relevel(Context, "isolation"))
rating
```
## Our first model
We run our first clm model as a simple, i.e., with no random effects
```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.clm <- rating %>%
clm(Response ~ Context, data = .)
summary(mdl.clm)
```
## Testing significance
We can evaluate whether "Context" improves the model fit, by comparing a null model with our model. Of course "Context" is improving the model fit.
```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.clm.Null <- rating %>%
clm(Response ~ 1, data = .)
anova(mdl.clm, mdl.clm.Null)
```
## Interpreting a cumulative model
As a way to interpret the model, we can look at the coefficients and make sense of the results. A CLM model is a Logistic model with a cumulative effect. The "Coefficients" are the estimates for each level of the fixed effect; the "Threshold coefficients" are those of the response. For the former, a negative coefficient indicates a negative association with the response; and a positive is positively associated with the response. The p values are indicating the significance of each level. For the "Threshold coefficients", we can see the cumulative effects of ratings 1|2, 2|3, 3|4 and 4|5 which indicate an overall increase in the ratings from 1 to 5.
## Plotting
### No confidence intervals
We use a modified version of a plotting function that allows us to visualise the effects. For this, we use the base R plotting functions. The version below is without confidence intervals.
```{r warning=FALSE, message=FALSE, error=FALSE}
par(oma=c(1, 0, 0, 3),mgp=c(2, 1, 0))
xlimNas = c(min(mdl.clm$beta), max(mdl.clm$beta))
ylimNas = c(0,1)
plot(0,0,xlim=xlimNas, ylim=ylimNas, type="n", ylab=expression(Probability), xlab="", xaxt = "n",main="Predicted curves - Nasalisation",cex=2,cex.lab=1.5,cex.main=1.5,cex.axis=1.5)
axis(side = 1, at = c(0,mdl.clm$beta),labels = levels(rating$Context), las=2,cex=2,cex.lab=1.5,cex.axis=1.5)
xsNas = seq(xlimNas[1], xlimNas[2], length.out=100)
lines(xsNas, plogis(mdl.clm$Theta[1] - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2] - xsNas)-plogis(mdl.clm$Theta[1] - xsNas), col='red')
lines(xsNas, plogis(mdl.clm$Theta[3] - xsNas)-plogis(mdl.clm$Theta[2] - xsNas), col='green')
lines(xsNas, plogis(mdl.clm$Theta[4] - xsNas)-plogis(mdl.clm$Theta[3] - xsNas), col='orange')
lines(xsNas, 1-(plogis(mdl.clm$Theta[4] - xsNas)), col='blue')
abline(v=c(0,mdl.clm$beta),lty=3)
abline(h=0, lty="dashed")
abline(h=0.2, lty="dashed")
abline(h=0.4, lty="dashed")
abline(h=0.6, lty="dashed")
abline(h=0.8, lty="dashed")
abline(h=1, lty="dashed")
legend(par('usr')[2], par('usr')[4], bty='n', xpd=NA,lty=1, col=c("black", "red", "green", "orange", "blue"),
legend=c("Oral", "2", "3", "4", "Nasal"),cex=0.75)
```
### With confidence intervals
Here is an attempt to add the 97.5% confidence intervals to these plots. This is an experimantal attempt and any feedback is welcome!
```{r warning=FALSE, message=FALSE, error=FALSE}
par(oma=c(1, 0, 0, 3),mgp=c(2, 1, 0))
xlimNas = c(min(mdl.clm$beta), max(mdl.clm$beta))
ylimNas = c(0,1)
plot(0,0,xlim=xlimNas, ylim=ylimNas, type="n", ylab=expression(Probability), xlab="", xaxt = "n",main="Predicted curves - Nasalisation",cex=2,cex.lab=1.5,cex.main=1.5,cex.axis=1.5)
axis(side = 1, at = c(0,mdl.clm$beta),labels = levels(rating$Context), las=2,cex=2,cex.lab=1.5,cex.axis=1.5)
xsNas = seq(xlimNas[1], xlimNas[2], length.out=100)
#+CI
lines(xsNas, plogis(mdl.clm$Theta[1]+(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2]+(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas)-plogis(mdl.clm$Theta[1]+(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), col='red')
lines(xsNas, plogis(mdl.clm$Theta[3]+(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas)-plogis(mdl.clm$Theta[2]+(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas), col='green')
lines(xsNas, plogis(mdl.clm$Theta[4]+(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)-plogis(mdl.clm$Theta[3]+(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas), col='orange')
lines(xsNas, 1-(plogis(mdl.clm$Theta[4]+(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)), col='blue')
#-CI
lines(xsNas, plogis(mdl.clm$Theta[1]-(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2]-(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas)-plogis(mdl.clm$Theta[1]-(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), col='red')
lines(xsNas, plogis(mdl.clm$Theta[3]-(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas)-plogis(mdl.clm$Theta[2]-(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas), col='green')
lines(xsNas, plogis(mdl.clm$Theta[4]-(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)-plogis(mdl.clm$Theta[3]-(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas), col='orange')
lines(xsNas, 1-(plogis(mdl.clm$Theta[4]-(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)), col='blue')
# fill area around CI using c(x, rev(x)), c(y2, rev(y1))
polygon(c(xsNas, rev(xsNas)),
c(plogis(mdl.clm$Theta[1]+(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), rev(plogis(mdl.clm$Theta[1]-(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas))), col = "gray90")
polygon(c(xsNas, rev(xsNas)),
c(plogis(mdl.clm$Theta[2]+(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas)-plogis(mdl.clm$Theta[1]+(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), rev(plogis(mdl.clm$Theta[2]-(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas)-plogis(mdl.clm$Theta[1]-(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas))), col = "gray90")
polygon(c(xsNas, rev(xsNas)),
c(plogis(mdl.clm$Theta[3]+(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas)-plogis(mdl.clm$Theta[2]+(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas), rev(plogis(mdl.clm$Theta[3]-(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas)-plogis(mdl.clm$Theta[2]-(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas))), col = "gray90")
polygon(c(xsNas, rev(xsNas)),
c(plogis(mdl.clm$Theta[4]+(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)-plogis(mdl.clm$Theta[3]+(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas), rev(plogis(mdl.clm$Theta[4]-(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)-plogis(mdl.clm$Theta[3]-(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas))), col = "gray90")
polygon(c(xsNas, rev(xsNas)),
c(1-(plogis(mdl.clm$Theta[4]-(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)), rev(1-(plogis(mdl.clm$Theta[4]+(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)))), col = "gray90")
lines(xsNas, plogis(mdl.clm$Theta[1] - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2] - xsNas)-plogis(mdl.clm$Theta[1] - xsNas), col='red')
lines(xsNas, plogis(mdl.clm$Theta[3] - xsNas)-plogis(mdl.clm$Theta[2] - xsNas), col='green')
lines(xsNas, plogis(mdl.clm$Theta[4] - xsNas)-plogis(mdl.clm$Theta[3] - xsNas), col='orange')
lines(xsNas, 1-(plogis(mdl.clm$Theta[4] - xsNas)), col='blue')
abline(v=c(0,mdl.clm$beta),lty=3)
abline(h=0, lty="dashed")
abline(h=0.2, lty="dashed")
abline(h=0.4, lty="dashed")
abline(h=0.6, lty="dashed")
abline(h=0.8, lty="dashed")
abline(h=1, lty="dashed")
legend(par('usr')[2], par('usr')[4], bty='n', xpd=NA,lty=1, col=c("black", "red", "green", "orange", "blue"),
legend=c("Oral", "2", "3", "4", "Nasal"),cex=0.75)
```
# Linear Mixed-effects Models. Why random effects matter {.tabset .tabset-fade .tabset-pills}
Let's generate a new dataframe that we will use later on for our mixed models
```{r warning=FALSE, message=FALSE, error=FALSE}
## Courtesy of Bodo Winter
set.seed(666)
#we create 6 subjects
subjects <- paste0('S', 1:6)
#here we add repetitions within speakers
subjects <- rep(subjects, each = 20)
items <- paste0('Item', 1:20)
#below repeats
items <- rep(items, 6)
#below is to generate random numbers that are log values
logFreq <- round(rexp(20)*5, 2)
#below we are repeating the logFreq 6 times to fit with the number of speakers and items
logFreq <- rep(logFreq, 6)
xdf <- data.frame(subjects, items, logFreq)
#below removes the individual variables we had created because they are already in the dataframe
rm(subjects, items, logFreq)
xdf$Intercept <- 300
submeans <- rep(rnorm(6, sd = 40), 20)
#sort make the means for each subject is the same...
submeans <- sort(submeans)
xdf$submeans <- submeans
#we create the same thing for items... we allow the items mean to vary between words...
itsmeans <- rep(rnorm(20, sd = 20), 6)
xdf$itsmeans <- itsmeans
xdf$error <- rnorm(120, sd = 20)
#here we create an effect column,
#here for each logFreq, we have a decrease of -5 of that particular logFreq
xdf$effect <- -5 * xdf$logFreq
xdf$dur <- xdf$Intercept + xdf$submeans + xdf$itsmeans + xdf$error + xdf$effect
#below is to subset the data and get only a few columns.. the -c(4:8) removes the columns 4 to 8..
xreal <- xdf[,-c(4:8)]
head(xreal)
rm(xdf, submeans, itsmeans)
```
## Plots
Let's start by doing a correlation test and plotting the data. Our results show that there is a negative correlation between duration and LogFrequency, and the plot shows this decrease.
```{r warning=FALSE, message=FALSE, error=FALSE}
corrMixed <- as.matrix(xreal[-c(1:2)]) %>%
rcorr(type="pearson")
print(corrMixed)
corrplot(corrMixed$r, method = "circle", type = "upper", tl.srt = 45,
addCoef.col = "black", diag = FALSE,
p.mat = corrMixed$p, sig.level = 0.05)
ggplot.xreal <- xreal %>%
ggplot(aes(x = logFreq, y = dur)) +
geom_point()+ theme_bw(base_size = 20) +
labs(y = "Duration", x = "Frequency (Log)") +
geom_smooth(method = lm, se=F)
ggplot.xreal
```
## Linear model
Let's run a simple linear model on the data. As we can see below, there are some issues with the "simple" linear model: we had set our SD for subjects to be 40, but this was picked up as 120 (see histogram of residuals). The QQ Plot is not "normal".
```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lm.xreal <- xreal %>%
lm(dur ~ logFreq, data = .)
summary(mdl.lm.xreal)
hist(residuals(mdl.lm.xreal))
qqnorm(residuals(mdl.lm.xreal)); qqline(residuals(mdl.lm.xreal))
plot(fitted(mdl.lm.xreal), residuals(mdl.lm.xreal), cex = 4)
```
## Linear Mixed Model
Our Linear Mixed effects Model will take into account the random effects we added and also our model specifications. We use a Maximum Likelihood estimate (REML = FALSE) as this is what we will use for model comparison. The Linear Mixed Model is reflecting our model specifications The SD of our subjects is picked up correctly. The model results are "almost" the same as our linear model above. The coefficient for the "Intercept" is at 337.973 and the coefficient for LogFrequency is at -5.460. This indicates that for each unit of increase in the LogFrequency, there is a decrease by 5.460 (ms).
```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal <- xreal %>%
lmer(dur ~ logFreq +(1|subjects) + (1|items), data = ., REML = FALSE)
summary(mdl.lmer.xreal)
hist(residuals(mdl.lmer.xreal))
qqnorm(residuals(mdl.lmer.xreal)); qqline(residuals(mdl.lmer.xreal))
plot(fitted(mdl.lmer.xreal), residuals(mdl.lmer.xreal), cex = 4)
```
## Our second Mixed model
This second model add a by-subject random slope. Random slopes allow for the variation that exists in the random effects to be taken into account. An intercept only model provides an averaged values to our participants.
```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.2 <- xreal %>%
lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = FALSE)
summary(mdl.lmer.xreal.2)
hist(residuals(mdl.lmer.xreal.2))
qqnorm(residuals(mdl.lmer.xreal.2)); qqline(residuals(mdl.lmer.xreal.2))
plot(fitted(mdl.lmer.xreal.2), residuals(mdl.lmer.xreal.2), cex = 4)
```
## Model comparison
But where are our p values? The lme4 developers decided not to include p values due to various issues with estimating df. What we can do instead is to compare models. We need to create a null model to allow for significance testing. As expected our predictor is significantly contributing to the difference.
```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.Null <- xreal %>%
lmer(dur ~ 1 + (logFreq|subjects) + (1|items), data = ., REML = FALSE)
anova(mdl.lmer.xreal.Null, mdl.lmer.xreal.2)
```
Also, do we really need random slopes? From the result below, we don't seem to need random slopes at all, given that adding random slopes does not improve the model fit. I always recommend testing this. Most of the time I keep random slopes.
```{r warning=FALSE, message=FALSE, error=FALSE}
anova(mdl.lmer.xreal, mdl.lmer.xreal.2)
```
But if you are really (really!!!) obsessed by p values, then you can also use lmerTest. BUT use after comparing models to evaluate contribution of predictors
```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.lmerTest <- xreal %>%
lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = TRUE)
summary(mdl.lmer.xreal.lmerTest)
detach("package:lmerTest", unload = TRUE)