Session5-AnalysingData_advanced.rmd

---
title: "Session 5: Advanced statistical analyses"
author:
  name: Jalal Al-Tamimi
  affiliation: Université Paris Cité
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
  html_notebook:
    highlight: pygments
    number_sections: yes
    toc: yes
    toc_depth: 6
    toc_float:
      collapsed: yes
      fig_crop: no
---

# Loading packages 

```{r warning=FALSE, message=FALSE, error=FALSE}
## Use the code below to check if you have all required packages installed. If some are not installed already, the code below will install these. If you have all packages installed, then you could load them with the second code.
requiredPackages = c('tidyverse', 'broom', 'knitr', 'Hmisc', 'corrplot', 'lme4', 'lmerTest', 'party', 'ranger','doFuture',  'tidymodels', 'pROC', 'varImp', 'lattice', 'vip', 'emmeans', 'ggsignif', 'PresenceAbsence', 'languageR', 'FactoMineR', 'factoextra', 'RColorBrewer', 'scatterplot3d', 'cowplot', 'psycho', 'ordinal')
for(p in requiredPackages){
  if(!require(p,character.only = TRUE)) install.packages(p)
  library(p,character.only = TRUE)
}

```


# Correlation tests {.tabset .tabset-fade .tabset-pills}

## Basic correlations

Let us start with a basic correlation test. We want to evaluate if two numeric variables are correlated with each other.

We use the function `cor` to obtain the pearson correlation and `cor.test` to run a basic correlation test on our data with significance testing




```{r}
cor(english$RTlexdec, english$RTnaming, method = "pearson")
cor.test(english$RTlexdec, english$RTnaming)
```

What these results are telling us? There is a positive correlation between `RTlexdec` and `RTnaming`. The correlation coefficient (R²) is 0.76 (limits between -1 and 1). This correlation is statistically significant with a t value of 78.699, degrees of freedom of 4566 and a p-value < 2.2e-16. 

What are the degrees of freedom? These relate to number of total observations - number of comparisons. Here we have 4568 observations in the dataset, and two comparisons, hence 4568 - 2 = 4566.

For the p value, there is a threshold we usually use. This threshold is p = 0.05. This threshold means we have a minimum to consider any difference as significant or not. 0.05 means that we have a probability to find a significant difference that is at 5% or lower. IN our case, the p value is lower that 2.2e-16. How to interpret this number? this tells us to add 15 0s  before the 2!! i.e., 0.0000000000000002. This probability is very (very!!) low. So we conclude that there is a statistically significant correlation between the two variables.


The formula to calculate the t value is below. 

![](t-score.jpg)


x̄ = sample mean
μ0 = population mean
s = sample standard deviation
n = sample size

The p value is influenced by various factors, number of observations, strength of the difference, mean values, etc.. You should always be careful with interpreting p values taking everything else into account.


## Using the package `corrplot`

Above, we did a correlation test on two predictors. 
What if we want to obtain a nice plot of all numeric predictors and add significance levels? 

### Correlation plots

```{r fig.height=6}
corr <- 
  english %>% 
  select(where(is.numeric)) %>% 
  cor() %>% 
  print()
print(corr)
corrplot(corr, method = 'ellipse', type = 'upper')

```



### More advanced

Let's first compute the correlations between all numeric variables and plot these with the p values

```{r fig.height=15}
## correlation using "corrplot"
## based on the function `rcorr' from the `Hmisc` package
## Need to change dataframe into a matrix
corr <- 
  english %>% 
  select(where(is.numeric)) %>% 
  as.matrix(english) %>% 
  rcorr(type = "pearson")
print(corr)
# use corrplot to obtain a nice correlation plot!
corrplot(corr$r, p.mat = corr$P,
         addCoef.col = "black", diag = FALSE, type = "upper", tl.srt = 55)
```


```{r}
english %>% 
  group_by(AgeSubject) %>% 
  summarise(mean = mean(RTlexdec),
            sd = sd(RTlexdec))
```

# Linear Models {.tabset .tabset-fade .tabset-pills}

Up to now, we have looked at descriptive statistics, and evaluated summaries, correlations in the data (with p values).

We are now interested in looking at group differences. 


## Introduction

The basic assumption of a Linear model is to create a regression analysis on the data. We have an outcome (or dependent variable) and a predictor (or an independent variable). The formula of a linear model is as follows `outcome ~ predictor` that can be read as "outcome as a function of the predictor". We can add "1" to specify an intercept, but this is by default added to the model

### Model estimation

```{r}
english2 <- english %>% 
  mutate(AgeSubject = factor(AgeSubject, levels = c("young", "old")))
mdl.lm <- english2 %>% 
  lm(RTlexdec ~ AgeSubject, data = .)
#lm(RTlexdec ~ AgeSubject, data = english)
mdl.lm #also print(mdl.lm)
summary(mdl.lm)
```

### Tidying the output

```{r}
# from library(broom)
tidy(mdl.lm) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
mycoefE <- tidy(mdl.lm) %>% pull(estimate)

```

Obtaining mean values from our model

```{r}
#old
mycoefE[1]
#young
mycoefE[1] + mycoefE[2]
```

### Nice table of our model summary

We can also obtain a nice table of our model summary. We can use the package `knitr` or `xtable`

#### Directly from model summary

```{r}
kable(summary(mdl.lm)$coef, digits = 3)

```

#### From the `tidy` output

```{r}
mdl.lmT <- tidy(mdl.lm)
kable(mdl.lmT, digits = 3)
```


### Dissecting the model

Let us dissect the model. If you use "str", you will be able to see what is available under our linear model. To access some info from the model

#### "str" and "coef"

```{r}
str(mdl.lm)
```



```{r}
coef(mdl.lm)
## same as 
## mdl.lm$coefficients
```

#### "coef" and "coefficients"

What if I want to obtain the "Intercept"? Or the coefficient for distance? What if I want the full row for distance?

```{r}
coef(mdl.lm)[1] # same as mdl.lm$coefficients[1]
coef(mdl.lm)[2] # same as mdl.lm$coefficients[2]
```


```{r}
summary(mdl.lm)$coefficients[2, ] # full row
summary(mdl.lm)$coefficients[2, 4] #for p value
```


#### Residuals

What about residuals (difference between the observed value and the estimated value of the quantity) and fitted values? This allows us to evaluate how normal our residuals are and how different they are from a normal distribution.

```{r warning=FALSE, message=FALSE, error=FALSE}
hist(residuals(mdl.lm))
qqnorm(residuals(mdl.lm)); qqline(residuals(mdl.lm))
plot(fitted(mdl.lm), residuals(mdl.lm), cex = 4)
```

#### Goodness of fit?

```{r warning=FALSE, message=FALSE, error=FALSE}
AIC(mdl.lm)	# Akaike's Information Criterion, lower values are better
BIC(mdl.lm)	# Bayesian AIC
logLik(mdl.lm)	# log likelihood
```


Or use the following from `broom`

```{r}
glance(mdl.lm)
```

#### Significance testing

Are the above informative? of course not directly. If we want to test for overall significance of model. We run a null model (aka intercept only) and compare models.

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lm.Null <- english %>% 
  lm(RTlexdec ~ 1, data = .)
mdl.comp <- anova(mdl.lm.Null, mdl.lm)
mdl.comp
```

The results show that adding the variable "AgeSubject" improves the model fit. We can write this as follows: Model comparison showed that the addition of AgeSubject improved the model fit when compared with an intercept only model ($F$(`r mdl.comp[2,3]`) = `r round(mdl.comp[2,5], 2)`, *p* < `r mdl.comp[2,6]`)  (F(1) = 4552 , p < 2.2e-16)

## Plotting fitted values

### Trend line

Let's plot our fitted values but only for the trend line

```{r warning=FALSE, message=FALSE, error=FALSE}
english %>% 
  ggplot(aes(x = AgeSubject, y = RTlexdec))+
  geom_boxplot()+
  theme_bw() + theme(text = element_text(size = 15))+
  geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") + 
  labs(x = "Age", y = "RTLexDec", title = "Boxplot and predicted trend line", subtitle = "with ggplot2") 
```

This allows us to plot the fitted values from our model with the predicted linear trend. This is exactly the same as our original data.

### Predicted means and the trend line

We can also plot the predicted means and linear trend

```{r warning=FALSE, message=FALSE, error=FALSE}
english %>% 
  ggplot(aes(x = AgeSubject, y = predict(mdl.lm)))+
  geom_boxplot(color = "blue") +
  theme_bw() + theme(text = element_text(size = 15)) +
  geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") + 
    labs(x = "Age", y = "RTLexDec", title = "Predicted means and trend line", subtitle = "with ggplot2") 

```


### Raw data, predicted means and the trend line

We can also plot the actual data, the predicted means and linear trend

```{r warning=FALSE, message=FALSE, error=FALSE}
english %>% 
  ggplot(aes(x = AgeSubject, y = RTlexdec))+
  geom_boxplot() +
  geom_boxplot(aes(x = AgeSubject, y = predict(mdl.lm)), color = "blue") +
  theme_bw() + theme(text = element_text(size = 15)) +
  geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") +
    labs(x = "Species", y = "Length", title = "Boxplot raw data, predicted means (in blue) and trend line", subtitle = "with ggplot2")
```

### Add significance levels and trend line on a plot?

We can use the p values generated from either our linear model to add significance levels on a plot. We use the code from above and add the significance level. We also add a trend line


```{r warning=FALSE, message=FALSE, error=FALSE}
english %>% 
  ggplot(aes(x = AgeSubject, y = RTlexdec))+
  geom_boxplot() +
  geom_boxplot(aes(x = AgeSubject, y = predict(mdl.lm)), color = "blue") +
  theme_bw() + theme(text = element_text(size = 15)) +
  geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") +
    labs(x = "Species", y = "Length", title = "Boxplot raw data, predicted means (in blue) and trend line", subtitle = "with significance testing") +
    geom_signif(comparison = list(c("old", "young")), 
              map_signif_level = TRUE, test = function(a, b) {
                list(p.value = summary(mdl.lm)$coefficients[2, 4])})


```





## What about pairwise comparison?

When having three of more levels in our predictor, we can use pairwise comparisons, with corrections to evaluate differences between each level.

```{r}
summary(mdl.lm)
```


```{r}
mdl.lm %>% emmeans(pairwise ~ AgeSubject, adjust = "fdr") -> mdl.emmeans
mdl.emmeans
```

How to interpret the output? Discuss with your neighbour and share with the group.

Hint... Look at the emmeans values for each level of our factor "Species" and the contrasts. 


## Multiple predictors?

Linear models require a numeric outcome, but the predictor can be either numeric or a factor. We can have more than one predictor. The only issue is that this complicates the interpretation of results

```{r warning=FALSE, message=FALSE, error=FALSE}
english %>% 
  lm(RTlexdec ~ AgeSubject * WordCategory, data = .) %>% 
  summary()
```


And with an Anova


```{r warning=FALSE, message=FALSE, error=FALSE}
english %>% 
  lm(RTlexdec ~ AgeSubject * WordCategory, data = .) %>% 
  anova()
```


The results above tell us that all predictors used are significantly different.




# Generalised Linear Models {.tabset .tabset-fade .tabset-pills}

Here we will look at an example when the outcome is binary. This simulated data is structured as follows. We asked one participant to listen to 165 sentences, and to judge whether these are "grammatical" or "ungrammatical". There were 105 sentences that were "grammatical" and 60 "ungrammatical". This fictitious example can apply in any other situation. Let's think Geography: 165 lands: 105 "flat" and 60 "non-flat", etc. This applies to any case where you need to "categorise" the outcome into two groups. 

## Load and summaries

Let's load in the data and do some basic summaries

```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- read_csv("grammatical.csv")
grammatical
str(grammatical)
head(grammatical)
```

## GLM - Categorical predictors

Let's run a first GLM (Generalised Linear Model). A GLM uses a special family "binomial" as it assumes the outcome has a binomial distribution. In general, results from a Logistic Regression are close to what we get from SDT (see above).

To run the results, we will change the reference level for both response and grammaticality. The basic assumption about GLM is that we start with our reference level being the "no" responses to the "ungrammatical" category. Any changes to this reference will be seen in the coefficients as "yes" responses to the "grammatical" category.

### Model estimation and results

The results below show the logodds for our model. 

```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- grammatical %>% 
  mutate(response = factor(response, levels = c("no", "yes")),
         grammaticality = factor(grammaticality, levels = c("ungrammatical", "grammatical")))

grammatical %>% 
  group_by(grammaticality, response) %>% 
  table()

mdl.glm <- grammatical %>% 
  glm(response ~ grammaticality, data = ., family = binomial)
summary(mdl.glm)

tidy(mdl.glm) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm) %>% pull(estimate)
```


The results show that for one unit increase in the response (i.e., from no to yes), the logodds of being "grammatical" is increased by `r mycoef2[2]` (the intercept shows that when the response is "no", the logodds are `r mycoef2[1]`). The actual logodds for the response "yes" to grammatical is `r mycoef2[1]+mycoef2[2]` 

### Logodds to Odd ratios

Logodds can be modified to talk about the odds of an event. For our model above, the odds of "grammatical" receiving a "no" response is a mere 0.2; the odds of "grammatical" to receive a "yes" is a 20; i.e., 20 times more likely 


```{r warning=FALSE, message=FALSE, error=FALSE}
exp(mycoef2[1])
exp(mycoef2[1] + mycoef2[2])

```

### LogOdds to proportions

If you want to talk about the percentage "accuracy" of our model, then we can transform our loggodds into proportions. This shows that the proportion of "grammatical" receiving a "yes" response increases by 99% (or 95% based on our "true" coefficients)

```{r warning=FALSE, message=FALSE, error=FALSE}
plogis(mycoef2[1])
plogis(mycoef2[1] + mycoef2[2])
```

### Plotting

```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- grammatical %>% 
  mutate(prob = predict(mdl.glm, type = "response"))
grammatical %>% 
  ggplot(aes(x = as.numeric(grammaticality), y = prob)) +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = T) + theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim = c(0,1))+
    scale_x_discrete(limits = c("Ungrammatical", "Grammatical"))
```


## GLM - Numeric predictors {.tabset .tabset-fade .tabset-pills}

In this example, we will run a GLM model using a similar technique to that used in `Al-Tamimi (2017)` and `Baumann & Winter (2018)`. We use the package `LanguageR` and the dataset `English`.


In the model above, we used the equation as lm(RTlexdec ~ AgeSubject). We were interested in examining the impact of age of subject on reaction time in a lexical decision task. In this section, we are interested in understanding how reaction time allows to differentiate the participants based on their age. We use `AgeSubject` as our outcome and `RTlexdec` as our predictor using the equation glm(AgeSubject ~ RTlexdec). We usually can use `RTlexdec` as is, but due to a possible quasi separation and the fact that we may want to compare coefficients using multiple acoustic metrics, we will z-score our predictor. We run below two models, with and without z-scoring

For the glm model, we need to specify `family = "binomial"`.

### Without z-scoring of predictor

#### Model estimation


```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.glm2 <- english2 %>% 
  glm(AgeSubject ~ RTlexdec, data = ., family = "binomial")

tidy(mdl.glm2) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm2) %>% pull(estimate)
```

#### LogOdds to proportions

If you want to talk about the percentage "accuracy" of our model, then we can transform our loggodds into proportions.

```{r warning=FALSE, message=FALSE, error=FALSE}
plogis(mycoef2[1])
plogis(mycoef2[1] + mycoef2[2])
```

#### Plotting

```{r warning=FALSE, message=FALSE, error=FALSE}
english2 <- english2 %>% 
  mutate(prob = predict(mdl.glm2, type = "response"))
english2 %>% 
  ggplot(aes(x = as.numeric(AgeSubject), y = prob)) +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = T) + theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim = c(0,1))+
    scale_x_discrete(limits = c("Young", "Old"))
```
 
 
The plot above show how the two groups differ using a glm. The results point to an overall increase in the proportion of reaction time when moving from the "Young" to the "Old" group.
Let's use z-scoring next


### With z-scoring of predictor

#### Model estimation


```{r warning=FALSE, message=FALSE, error=FALSE}
english2 <- english2 %>% 
  mutate(`RTlexdec_z` = scale(RTlexdec, center = TRUE, scale = TRUE))

english2['RTlexdec_z'] <- as.data.frame(scale(english2$RTlexdec))



mdl.glm3 <- english2 %>% 
  glm(AgeSubject ~ RTlexdec_z, data = ., family = "binomial")

tidy(mdl.glm3) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm3) %>% pull(estimate)
```

#### LogOdds to proportions

If you want to talk about the percentage "accuracy" of our model, then we can transform our loggodds into proportions. 

```{r warning=FALSE, message=FALSE, error=FALSE}
plogis(mycoef2[1])
plogis(mycoef2[1] + mycoef2[2])
```

#### Plotting

##### Normal

```{r warning=FALSE, message=FALSE, error=FALSE}
english2 <- english2 %>% 
  mutate(prob = predict(mdl.glm3, type = "response"))
english2 %>% 
  ggplot(aes(x = as.numeric(AgeSubject), y = prob)) +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = T) + theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim = c(0,1))+
    scale_x_discrete(limits = c("Young", "Old"))
```
 
 
We obtain the exact same plots, but the model estimations are different. Let's use another type of predictions


##### z-scores

```{r warning=FALSE, message=FALSE, error=FALSE}
z_vals <- seq(-3, 3, 0.01)

dfPredNew <- data.frame(RTlexdec_z = z_vals)

## store the predicted probabilities for each value of RTlexdec_z
pp <- cbind(dfPredNew, prob = predict(mdl.glm3, newdata = dfPredNew, type = "response"))

pp %>% 
  ggplot(aes(x = RTlexdec_z, y = prob)) +
  geom_point() +
  theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim = c(0,1))+
  scale_x_continuous(breaks = c(-3, -2, -1, 0, 1, 2, 3))
```
 
 
We obtain the exact same plots, but the model estimations are different. 






## Accuracy and Signal Detection Theory

### Rationale

We are generally interested in performance, i.e., whether the we have "accurately" categorised the outcome or not and at the same time want to evaluate our biases in responses. When deciding on categories, we are usually biased in our selection. 

Let's ask the question: How many of you have a Mac laptop and how many a Windows laptop? For those with a Mac, what was the main reason for choosing it? Are you biased in anyway by your decision? 

To correct for these biases, we use some variants from Signal Detection Theory to obtain the true estimates without being influenced by the biases. 

### Running stats

Let's do some stats on this 

|  | Yes | No | Total |
|----------------------------|--------------------|------------------|------------------|
| Grammatical (Yes Actual) | TP = 100 | FN = 5 | (Yes Actual) 105 |
| Ungrammatical (No Actual)  | FP = 10 | TN = 50 | (No Actual) 60 |
| Total | (Yes Response) 110 | (No Response) 55 | 165 |

```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- grammatical %>% 
  mutate(response = factor(response, levels = c("yes", "no")),
         grammaticality = factor(grammaticality, levels = c("grammatical", "ungrammatical")))

## TP = True Positive (Hit); FP = False Positive; FN = False Negative; TN = True Negative


TP <- nrow(grammatical %>% 
             filter(grammaticality == "grammatical" &
                      response == "yes"))
FN <- nrow(grammatical %>% 
             filter(grammaticality == "grammatical" &
                      response == "no"))
FP <- nrow(grammatical %>% 
             filter(grammaticality == "ungrammatical" &
                      response == "yes"))
TN <- nrow(grammatical %>% 
             filter(grammaticality == "ungrammatical" &
                      response == "no"))
TP
FN
FP
TN

Total <- nrow(grammatical)
Total
(TP+TN)/Total # accuracy
(FP+FN)/Total # error, also 1-accuracy

# When stimulus = yes, how many times response = yes?
TP/(TP+FN) # also True Positive Rate or Specificity

# When stimulus = no, how many times response = yes?
FP/(FP+TN) # False Positive Rate, 

# When stimulus = no, how many times response = no?
TN/(FP+TN) # True Negative Rate or Sensitivity 

# When subject responds "yes" how many times is (s)he correct?
TP/(TP+FP) # precision

# getting dprime (or the sensitivity index); beta (bias criterion, 0-1, lower=increase in "yes"); Aprime (estimate of discriminability, 0-1, 1=good discrimination; 0 at chance); bppd (b prime prime d, -1 to 1; 0 = no bias, negative = tendency to respond "yes", positive = tendency to respond "no"); c (index of bias, equals to SD)
#(see also https://www.r-bloggers.com/compute-signal-detection-theory-indices-with-r/amp/) 
psycho::dprime(TP, FP, FN, TN, 
               n_targets = TP+FN, 
               n_distractors = FP+TN,
               adjust=F)

```

The most important from above, is d-prime. This is modelling the difference between the rate of "True Positive" responses and "False Positive" responses in standard unit (or z-scores). The formula can be written as:

`d' (d prime) = Z(True Positive Rate) - Z(False Positive Rate)`

### GLM as a classification tool

The code below demonstrates the links between our GLM model and what we had obtained above from SDT. The predictions' table shows that our GLM was successful at obtaining prediction that are identical to our initial data setup. Look at the table here and the table above. Once we have created our table of outcome, we can compute percent correct, the specificity, the sensitivity, the Kappa score, etc.. this yields the actual value with the SD that is related to variations in responses. 

```{r}
## predict(mdl.glm)>0.5 is identical to 
## predict(glm(response~grammaticality,data=grammatical,family = binomial),type="response")
grammatical <- grammatical %>% 
  mutate(response = factor(response, levels = c("yes", "no")),
         grammaticality = factor(grammaticality, levels = c("grammatical", "ungrammatical")))



mdl.glm.C <- grammatical %>% 
  glm(response ~ grammaticality, data = .,family = binomial)

tbl.glm <- table(grammatical$response, predict(mdl.glm.C, type = "response")>0.5)
colnames(tbl.glm) <- c("grammatical", "ungrammatical")
tbl.glm
PresenceAbsence::pcc(tbl.glm)
PresenceAbsence::specificity(tbl.glm)
PresenceAbsence::sensitivity(tbl.glm)
###etc..
```

If you look at the results from SDT above, these results are the same as
the following

Accuracy: (TP+TN)/Total (`r (TP+TN)/Total`) 

True Positive Rate (or Specificity) TP/(TP+FN) (`r TP/(TP+FN)`)

True Negative Rate (or Sensitivity) TN/(FP+TN) (`r TN/(FP+TN)`) 

### GLM and d prime

The values obtained here match those obtained from SDT. For d prime, the difference stems from the use of the logit variant of the Binomial family. By using a probit variant, one obtains the same values ([see here](https://stats.idre.ucla.edu/r/dae/probit-regression/) for more details). A probit variant models the z-score differences in the outcome and is evaluated in change in 1-standard unit. This is modelling the change from "ungrammatical" "no" responses into "grammatical" "yes" responses in z-scores. The same conceptual underpinnings of d-prime from Signal Detection Theory.

```{r}
## d prime
psycho::dprime(TP, FP, FN, TN, 
               n_targets = TP+FN, 
               n_distractors = FP+TN,
               adjust=F)$dprime

## GLM with probit
coef(glm(response ~ grammaticality, data = grammatical, family = binomial(probit)))[2]

```





## GLM: Other distributions

If your data does not fit a binomial distribution, and is a multinomial (i.e., three or more response categories) or poisson (count data), then you need to use the glm function with a specific family function. 

```{r warning=FALSE, message=FALSE, error=FALSE, echo=FALSE}
## For a multinomial (3 or more response categories), see below and use the following specification
## https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/
## mdl.multi <- nnet::multinom(outcome~predictor, data=data)

## For a poisson (count data), see below and use the following specification
## https://stats.idre.ucla.edu/r/dae/poisson-regression/

## mdl.poisson <- glm(outcome~predictor, data = data, family = "poisson")


```


# Cumulative Logit Link Models

These models work perfectly with rating data. Ratings are inherently ordered, 1, 2, ... n, and expect to observe an increase (or decrease) in overall ratings from 1 to n. To demonstrate this, we will use an example using the package "ordinal". Data were from a rating experiment where six participants rated the percept of nasality in the production of particular consonants in Arabic. The data came from nine producing subjects. The ratings were from 1 to 5. This example can apply to any study, e.g., rating grammaticality of sentences, rating how positive the sentiments are in a article, interview responses, etc.

## Importing and pre-processing

We start by importing the data and process it. We change the reference level in the predictor

```{r warning=FALSE, message=FALSE, error=FALSE}
rating <- read_csv("rating.csv")
rating
rating <- rating %>% 
  mutate(Response = factor(Response),
         Context = factor(Context)) %>% 
  mutate(Context = relevel(Context, "isolation"))
rating
```

## Our first model

We run our first clm model as a simple, i.e., with no random effects

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.clm <- rating %>% 
  clm(Response ~ Context, data = .)
summary(mdl.clm)
```


## Testing significance 

We can evaluate whether "Context" improves the model fit, by comparing a null model with our model. Of course "Context" is improving the model fit.

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.clm.Null <- rating %>% 
  clm(Response ~ 1, data = .)
anova(mdl.clm, mdl.clm.Null)

```

## Interpreting a cumulative model

As a way to interpret the model, we can look at the coefficients and make sense of the results. A CLM model is a Logistic model with a cumulative effect. The "Coefficients" are the estimates for each level of the fixed effect; the "Threshold coefficients" are those of the response. For the former, a negative coefficient indicates a negative association with the response; and a positive is positively associated with the response. The p values are indicating the significance of each level. For the "Threshold coefficients", we can see the cumulative effects of ratings 1|2, 2|3, 3|4 and 4|5 which indicate an overall increase in the ratings from 1 to 5. 

## Plotting 

### No confidence intervals

We use a modified version of a plotting function that allows us to visualise the effects. For this, we use the base R plotting functions. The version below is without confidence intervals.


```{r warning=FALSE, message=FALSE, error=FALSE}
par(oma=c(1, 0, 0, 3),mgp=c(2, 1, 0))
xlimNas = c(min(mdl.clm$beta), max(mdl.clm$beta))
ylimNas = c(0,1)
plot(0,0,xlim=xlimNas, ylim=ylimNas, type="n", ylab=expression(Probability), xlab="", xaxt = "n",main="Predicted curves - Nasalisation",cex=2,cex.lab=1.5,cex.main=1.5,cex.axis=1.5)
axis(side = 1, at = c(0,mdl.clm$beta),labels = levels(rating$Context), las=2,cex=2,cex.lab=1.5,cex.axis=1.5)
xsNas = seq(xlimNas[1], xlimNas[2], length.out=100)
lines(xsNas, plogis(mdl.clm$Theta[1] - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2] - xsNas)-plogis(mdl.clm$Theta[1] - xsNas), col='red')
lines(xsNas, plogis(mdl.clm$Theta[3] - xsNas)-plogis(mdl.clm$Theta[2] - xsNas), col='green')
lines(xsNas, plogis(mdl.clm$Theta[4] - xsNas)-plogis(mdl.clm$Theta[3] - xsNas), col='orange')
lines(xsNas, 1-(plogis(mdl.clm$Theta[4] - xsNas)), col='blue')
abline(v=c(0,mdl.clm$beta),lty=3)
abline(h=0, lty="dashed")
abline(h=0.2, lty="dashed")
abline(h=0.4, lty="dashed")
abline(h=0.6, lty="dashed")
abline(h=0.8, lty="dashed")
abline(h=1, lty="dashed")

legend(par('usr')[2], par('usr')[4], bty='n', xpd=NA,lty=1, col=c("black", "red", "green", "orange", "blue"), 
       legend=c("Oral", "2", "3", "4", "Nasal"),cex=0.75)

```


### With confidence intervals

Here is an attempt to add the 97.5% confidence intervals to these plots. This is an experimantal attempt and any feedback is welcome!


```{r warning=FALSE, message=FALSE, error=FALSE}
par(oma=c(1, 0, 0, 3),mgp=c(2, 1, 0))
xlimNas = c(min(mdl.clm$beta), max(mdl.clm$beta))
ylimNas = c(0,1)
plot(0,0,xlim=xlimNas, ylim=ylimNas, type="n", ylab=expression(Probability), xlab="", xaxt = "n",main="Predicted curves - Nasalisation",cex=2,cex.lab=1.5,cex.main=1.5,cex.axis=1.5)
axis(side = 1, at = c(0,mdl.clm$beta),labels = levels(rating$Context), las=2,cex=2,cex.lab=1.5,cex.axis=1.5)
xsNas = seq(xlimNas[1], xlimNas[2], length.out=100)


#+CI 
lines(xsNas, plogis(mdl.clm$Theta[1]+(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2]+(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas)-plogis(mdl.clm$Theta[1]+(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), col='red')
lines(xsNas, plogis(mdl.clm$Theta[3]+(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas)-plogis(mdl.clm$Theta[2]+(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas), col='green')
lines(xsNas, plogis(mdl.clm$Theta[4]+(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)-plogis(mdl.clm$Theta[3]+(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas), col='orange')
lines(xsNas, 1-(plogis(mdl.clm$Theta[4]+(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)), col='blue')

#-CI 
lines(xsNas, plogis(mdl.clm$Theta[1]-(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2]-(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas)-plogis(mdl.clm$Theta[1]-(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), col='red')
lines(xsNas, plogis(mdl.clm$Theta[3]-(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas)-plogis(mdl.clm$Theta[2]-(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas), col='green')
lines(xsNas, plogis(mdl.clm$Theta[4]-(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)-plogis(mdl.clm$Theta[3]-(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas), col='orange')
lines(xsNas, 1-(plogis(mdl.clm$Theta[4]-(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)), col='blue')

# fill area around CI using c(x, rev(x)), c(y2, rev(y1))
polygon(c(xsNas, rev(xsNas)),
        c(plogis(mdl.clm$Theta[1]+(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), rev(plogis(mdl.clm$Theta[1]-(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas))), col = "gray90")

polygon(c(xsNas, rev(xsNas)),
        c(plogis(mdl.clm$Theta[2]+(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas)-plogis(mdl.clm$Theta[1]+(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), rev(plogis(mdl.clm$Theta[2]-(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas)-plogis(mdl.clm$Theta[1]-(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas))), col = "gray90")


polygon(c(xsNas, rev(xsNas)),
        c(plogis(mdl.clm$Theta[3]+(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas)-plogis(mdl.clm$Theta[2]+(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas), rev(plogis(mdl.clm$Theta[3]-(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas)-plogis(mdl.clm$Theta[2]-(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas))), col = "gray90")

polygon(c(xsNas, rev(xsNas)),
        c(plogis(mdl.clm$Theta[4]+(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)-plogis(mdl.clm$Theta[3]+(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas), rev(plogis(mdl.clm$Theta[4]-(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)-plogis(mdl.clm$Theta[3]-(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas))), col = "gray90")

        
polygon(c(xsNas, rev(xsNas)),
        c(1-(plogis(mdl.clm$Theta[4]-(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)), rev(1-(plogis(mdl.clm$Theta[4]+(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)))), col = "gray90")       

lines(xsNas, plogis(mdl.clm$Theta[1] - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2] - xsNas)-plogis(mdl.clm$Theta[1] - xsNas), col='red')
lines(xsNas, plogis(mdl.clm$Theta[3] - xsNas)-plogis(mdl.clm$Theta[2] - xsNas), col='green')
lines(xsNas, plogis(mdl.clm$Theta[4] - xsNas)-plogis(mdl.clm$Theta[3] - xsNas), col='orange')
lines(xsNas, 1-(plogis(mdl.clm$Theta[4] - xsNas)), col='blue')
abline(v=c(0,mdl.clm$beta),lty=3)

abline(h=0, lty="dashed")
abline(h=0.2, lty="dashed")
abline(h=0.4, lty="dashed")
abline(h=0.6, lty="dashed")
abline(h=0.8, lty="dashed")
abline(h=1, lty="dashed")


legend(par('usr')[2], par('usr')[4], bty='n', xpd=NA,lty=1, col=c("black", "red", "green", "orange", "blue"), 
       legend=c("Oral", "2", "3", "4", "Nasal"),cex=0.75)

```


# Linear Mixed-effects Models. Why random effects matter {.tabset .tabset-fade .tabset-pills}

Let's generate a new dataframe that we will use later on for our mixed models

```{r warning=FALSE, message=FALSE, error=FALSE}
## Courtesy of Bodo Winter
set.seed(666)
#we create 6 subjects
subjects <- paste0('S', 1:6)
#here we add repetitions within speakers
subjects <- rep(subjects, each = 20)
items <- paste0('Item', 1:20)
#below repeats
items <- rep(items, 6)
#below is to generate random numbers that are log values
logFreq <- round(rexp(20)*5, 2)
#below we are repeating the logFreq 6 times to fit with the number of speakers and items
logFreq <- rep(logFreq, 6)
xdf <- data.frame(subjects, items, logFreq)
#below removes the individual variables we had created because they are already in the dataframe
rm(subjects, items, logFreq)

xdf$Intercept <- 300
submeans <- rep(rnorm(6, sd = 40), 20)
#sort make the means for each subject is the same...
submeans <- sort(submeans)
xdf$submeans <- submeans
#we create the same thing for items... we allow the items mean to vary between words...
itsmeans <- rep(rnorm(20, sd = 20), 6)
xdf$itsmeans <- itsmeans
xdf$error <- rnorm(120, sd = 20)
#here we create an effect column,  
#here for each logFreq, we have a decrease of -5 of that particular logFreq 
xdf$effect <- -5 * xdf$logFreq

xdf$dur <- xdf$Intercept + xdf$submeans + xdf$itsmeans + xdf$error + xdf$effect
#below is to subset the data and get only a few columns.. the -c(4:8) removes the columns 4 to 8..
xreal <- xdf[,-c(4:8)]
head(xreal)
rm(xdf, submeans, itsmeans)
```

## Plots
Let's start by doing a correlation test and plotting the data. Our results show that there is a negative correlation between duration and LogFrequency, and the plot shows this decrease. 

```{r warning=FALSE, message=FALSE, error=FALSE}
corrMixed <- as.matrix(xreal[-c(1:2)]) %>% 
  rcorr(type="pearson")
print(corrMixed)
corrplot(corrMixed$r, method = "circle", type = "upper", tl.srt = 45,
         addCoef.col = "black", diag = FALSE,
         p.mat = corrMixed$p, sig.level = 0.05)



ggplot.xreal <- xreal %>% 
  ggplot(aes(x = logFreq, y = dur)) +
  geom_point()+ theme_bw(base_size = 20) +
  labs(y = "Duration", x = "Frequency (Log)") +
  geom_smooth(method = lm, se=F)
ggplot.xreal
```


## Linear model

Let's run a simple linear model on the data. As we can see below, there are some issues with the "simple" linear model: we had set our SD for subjects to be 40, but this was picked up as 120 (see histogram of residuals). The QQ Plot is not "normal". 

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lm.xreal <- xreal %>% 
  lm(dur ~ logFreq, data = .)
summary(mdl.lm.xreal)
hist(residuals(mdl.lm.xreal))
qqnorm(residuals(mdl.lm.xreal)); qqline(residuals(mdl.lm.xreal))
plot(fitted(mdl.lm.xreal), residuals(mdl.lm.xreal), cex = 4)
```

## Linear Mixed Model

Our Linear Mixed effects Model will take into account the random effects we added and also our model specifications. We use a Maximum Likelihood estimate (REML = FALSE) as this is what we will use for model comparison. The Linear Mixed Model is reflecting our model specifications The SD of our subjects is picked up correctly. The model results are "almost" the same as our linear model above. The coefficient for the "Intercept" is at 337.973 and the coefficient for LogFrequency is at -5.460. This indicates that for each unit of increase in the LogFrequency, there is a decrease by 5.460 (ms).

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal <- xreal %>% 
  lmer(dur ~ logFreq  +(1|subjects) + (1|items), data = ., REML = FALSE)
summary(mdl.lmer.xreal)
hist(residuals(mdl.lmer.xreal))
qqnorm(residuals(mdl.lmer.xreal)); qqline(residuals(mdl.lmer.xreal))
plot(fitted(mdl.lmer.xreal), residuals(mdl.lmer.xreal), cex = 4)
```

## Our second Mixed model

This second model add a by-subject random slope. Random slopes allow for the variation that exists in the random effects to be taken into account. An intercept only model provides an averaged values to our participants.

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.2 <- xreal %>% 
  lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = FALSE)
summary(mdl.lmer.xreal.2)
hist(residuals(mdl.lmer.xreal.2))
qqnorm(residuals(mdl.lmer.xreal.2)); qqline(residuals(mdl.lmer.xreal.2))
plot(fitted(mdl.lmer.xreal.2), residuals(mdl.lmer.xreal.2), cex = 4)
```

## Model comparison

But where are our p values? The lme4 developers decided not to include p values due to various issues with estimating df. What we can do instead is to compare models. We need to create a null model to allow for significance testing. As expected our predictor is significantly contributing to the difference. 

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.Null <- xreal %>% 
  lmer(dur ~ 1 + (logFreq|subjects) + (1|items), data = ., REML = FALSE)
anova(mdl.lmer.xreal.Null, mdl.lmer.xreal.2)
```

Also, do we really need random slopes? From the result below, we don't seem to need random slopes at all, given that adding random slopes does not improve the model fit. I always recommend testing this. Most of the time I keep random slopes.

```{r warning=FALSE, message=FALSE, error=FALSE}
anova(mdl.lmer.xreal, mdl.lmer.xreal.2)
```

But if you are really (really!!!) obsessed by p values, then you can also use lmerTest. BUT use after comparing models to evaluate contribution of predictors

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.lmerTest <- xreal %>% 
  lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = TRUE)
summary(mdl.lmer.xreal.lmerTest)
detach("package:lmerTest", unload = TRUE)
```


## Our final Mixed model

Our final model uses REML (or Restricted Maximum Likelihood Estimate of Variance Component) to estimate the model. 

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.Full <- xreal %>% 
  lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = TRUE)
summary(mdl.lmer.xreal.Full)
anova(mdl.lmer.xreal.Full)
hist(residuals(mdl.lmer.xreal.Full))
qqnorm(residuals(mdl.lmer.xreal.Full)); qqline(residuals(mdl.lmer.xreal.Full))
plot(fitted(mdl.lmer.xreal.Full), residuals(mdl.lmer.xreal.Full), cex = 4)
```


## Dissecting the model

```{r warning=FALSE, message=FALSE, error=FALSE}
coef(mdl.lmer.xreal.Full)
fixef(mdl.lmer.xreal.Full)
fixef(mdl.lmer.xreal.Full)[1]
fixef(mdl.lmer.xreal.Full)[2]

coef(mdl.lmer.xreal.Full)$`subjects`[1]
coef(mdl.lmer.xreal.Full)$`subjects`[2]

coef(mdl.lmer.xreal.Full)$`items`[1]
coef(mdl.lmer.xreal.Full)$`items`[2]

```

## Using predictions from our model
In general, I use the prediction from my final model in any plots. To generate this, we can use the following

```{r warning=FALSE, message=FALSE, error=FALSE}
xreal <- xreal %>% 
  mutate(Pred_Dur = predict(mdl.lmer.xreal.Full))

xreal %>% 
  ggplot(aes(x = logFreq, y = Pred_Dur)) +
  geom_point() + theme_bw(base_size = 20) +
  labs(y = "Duration", x = "Frequency (Log)", title = "Predicted") +
  geom_smooth(method = lm, se = F) + coord_cartesian(ylim = c(200,450))

## original plot
xreal %>% 
  ggplot(aes(x = logFreq , y = dur)) +
  geom_point() + theme_bw(base_size = 20)+
  labs(y = "Duration", x = "Frequency (Log)", title = "Original")+
  geom_smooth(method = lm, se = F) + coord_cartesian(ylim = c(200,450))

```

## GLMM and CLMM

The code above was using a Linear Mixed Effects Modelling. The outcome was a numeric object. In some cases (as we have seen above), we may have: 

1. Binary outcome (binomial)
2. Count data (poisson), 
3. Multi-category outcome (multinomial)
4. Rating data (cumulative function)

The code below gives you an idea of how to specify these models

```{r warning=FALSE, message=FALSE, error=FALSE}

## Binomial family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=binomial)

## Poisson family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=poisson)

## Multinomial family
## a bit complicated as there is a need to use Bayesian approaches, see e.g., 
## glmmADMB
## mixcat
## MCMCglmm
## see https://gist.github.com/casallas/8263818

## Rating data, use following
## ordinal::clmm(outcome~predictor(s)+(1|subject)+(1|items)..., data=data)


## Remember to test for random effects and whether slopes are needed.

```


# Principal Component Analyses (PCA)


## Read dataset

```{r}
dfPharV2 <- read_csv("dfPharV2.csv")
dfPharV2
dfPharV2 <- dfPharV2 %>% 
  mutate(context = factor(context, levels = c("Non-Guttural", "Guttural")))
```


## Model specification

We use the package `FactoMineR` to run our PCA. We use all acoustic measures as predictors and our qualitative variable as the `context`.

```{r}
pcaDat1 <- PCA(dfPharV2,
               quali.sup = 1, graph = TRUE,
               scale.unit = TRUE, ncp = 5) 
```



## Results

### Summary of results

Based on the summary of results, we observe that the first 6 dimensions account 64% of the variance in the data; each contribute individually to more than 5% of the variance.


```{r}
summary(pcaDat1)
```

### Contribution of predictors and groups

Below, we look at the contributions of the main 5 dimensions.

```{r}
dimdesc(pcaDat1, axes = 1:5, proba = 0.05)
```


### Contribution of variables

We look next at the contribution of the top 10 predictors on each of the 6 dimensions

#### Dimension 1

```{r}
fviz_contrib(pcaDat1, choice = "var", axes = 1, top = 10)
```

#### Dimension 2

```{r}
fviz_contrib(pcaDat1, choice = "var", axes = 2, top = 10)
```

#### Dimension 3

```{r}
fviz_contrib(pcaDat1, choice = "var", axes = 3, top = 10)
```


#### Dimension 4

```{r}
fviz_contrib(pcaDat1, choice = "var", axes = 4, top = 10)
```


#### Dimension 5

```{r}
fviz_contrib(pcaDat1, choice = "var", axes = 5, top = 10)
```



## Plots

### PCA Individuals

```{r}
fviz_pca_ind(pcaDat1, col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping (slow if many points)
             )
```

### PCA Biplot 1:2

```{r}
fviz_pca_biplot(pcaDat1, repel = TRUE, habillage = dfPharV2$context, addEllipses = TRUE, title = "MSA - Biplot")
```

### PCA Biplot 3:4

```{r}
fviz_pca_biplot(pcaDat1, axes = c(3, 4), repel = TRUE, habillage = dfPharV2$context, addEllipses = TRUE, title = "MCA - Biplot")
```


## Clustering

```{r}
fviz_pca_ind(pcaDat1,
             label = "none", # hide individual labels
             habillage = dfPharV2$context, # color by groups
             addEllipses = TRUE # Concentration ellipses
             )
```


## 3-D By Groups

```{r}
coord <- pcaDat1$quali.sup$coord[1:2,0]
coord
#
with(pcaDat1, {
  s3d <- scatterplot3d(pcaDat1$quali.sup$coord[,1], pcaDat1$quali.sup$coord[,2], pcaDat1$quali.sup$coord[,3],        # x y and z axis
                       color=c("blue", "red"), pch=19,        # filled blue and red circles
                       type="h",                    # vertical lines to the x-y plane
                       main="PCA 3-D Scatterplot",
                       xlab="Dim1(37.7%)",
                       ylab="",
                       zlab="Dim3(11.4%)",
                       #xlim = c(-1.5, 1.5), ylim = c(-1.5, 1.5), zlim = c(-0.8, 0.8)
)
  s3d.coords <- s3d$xyz.convert(pcaDat1$quali.sup$coord[,1], pcaDat1$quali.sup$coord[,2], pcaDat1$quali.sup$coord[,3]) # convert 3D coords to 2D projection
  text(s3d.coords$x, s3d.coords$y,             # x and y coordinates
       labels=row.names(coord), col = c("blue", "red"),              # text to plot
       cex=1, pos=4)           # shrink text 50% and place to right of points)
})
dims <- par("usr")
x <- dims[1]+ 0.8*diff(dims[1:2])
y <- dims[3]+ 0.08*diff(dims[3:4])
text(x, y, "Dim2(25.5%)", srt = 25,col="black")
```




# Decision Trees {.tabset .tabset-fade .tabset-pills}

Decision trees are a statistical tool that uses the combination of predictors to identify patterns in the data and provides classification accuracy for the model. 

The decision tree used is based on `conditional inference trees` that looks at each predictor and splits the data into multiple nodes (branches) through recursive partitioning in a `tree-structured regression model`. Each node is also split into leaves (difference between levels of outcome).

Decision trees via `ctree` does the following: 

1. Test global null hypothesis of independence between predictors and outcome. 
2. Select the predictor with the strongest association with the outcome measured based on a multiplicity adjusted p-values with Bonferroni correction
3. Implement a binary split in the selected input variable. 
4. Recursively repeat steps 1), 2). and 3).

Let's see this in an example using the same dataset. To understand what the decision tree is doing, we will dissect it, by creating one tree with one predictor and move to the next.


## GLM as a classification tool

### Model specification

We run a GLM with `context` as our outcome, and `Z2-Z1` as our predictor. We want to evaluate whether the two classes can be separated when using the acoustic metric `Z2-Z1`. Context has two levels, and this will be considered as a binomial distribution. 


```{r}
mdl.glm.Z2mnZ1 <- dfPharV2 %>% 
  glm(context ~ Z2mnZ1, data = ., family = binomial)
summary(mdl.glm.Z2mnZ1)
tidy(mdl.glm.Z2mnZ1) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm.Z2mnZ1) %>% pull(estimate)
```

### Plogis

The result above shows that when moving from the `non-guttural` (intercept), a unit increase (i.e., `guttural`) yields a statistically significant decrease in the logodds associated with `Z2-Z1`. We can evaluate this further from a classification point of view, using `plogis`.


```{r}
# non-guttural
plogis(mycoef2[1])
#guttural
plogis(mycoef2[1] + mycoef2[2])
```


This shows that `Z2-Z1` is able to explain the difference in the `guttural` class with an accuracy of 59%. Let's continue with this model further.

### Model predictions

As above, we obtain predictions from the model. Because we are using a numeric predictor, we need to assign a threshold for the predict function. The threshold can be thought of as telling the predict function to assign any predictions lower than 50% to one group, and any higher to another. 

```{r}
pred.glm.Z2mnZ1 <- predict(mdl.glm.Z2mnZ1, type = "response")>0.5

tbl.glm.Z2mnZ1 <- table(pred.glm.Z2mnZ1, dfPharV2$context)
rownames(tbl.glm.Z2mnZ1) <- c("Non-Guttural", "Guttural")
tbl.glm.Z2mnZ1
# from PresenceAbsence
PresenceAbsence::pcc(tbl.glm.Z2mnZ1)
PresenceAbsence::specificity(tbl.glm.Z2mnZ1)
PresenceAbsence::sensitivity(tbl.glm.Z2mnZ1)

roc.glm.Z2mnZ1 <- pROC::roc(dfPharV2$context, as.numeric(pred.glm.Z2mnZ1))
roc.glm.Z2mnZ1
pROC::plot.roc(roc.glm.Z2mnZ1, legacy.axes = TRUE)
```


The model above was able to explain the difference between the two classes with an accuracy of 67.7%. It has a slightly low specificity (0.58) to detect `gutturals`, but a flighty high sensitivity (0.75) to reject the `non-gutturals`. Looking at the confusion matrix, we observe that both groups were relatively accurately identified, but we have relatively large errors (or confusions). The AUC is at 0.67, which is not too high. 

Let's continue with GLM to evaluate it further. We start by running a correlation test to evaluate issues with GLM.


## Individual trees

### Tree 1

```{r}
## from the package party
set.seed(123456)
tree1 <- dfPharV2 %>% 
  ctree(
    context ~ Z2mnZ1, 
    data = .)
print(tree1)
plot(tree1, main = "Conditional Inference Tree ")
```


How to interpret this figure? Let's look at mean values and a plot for this variable. This is the difference between `F2` and `F1` using the bark scale. Because gutturals are produced within the pharynx (regardless of where), the predictions is that a high `F1` and a low `F2` will be the acoustic correlates related to this constriction location. The closeness between these formants yields a lower `Z2-Z1`. Hence, the prediction is as follow: the smaller the difference, the more pharyngeal-like constriction these consonants have (all else being equal!). Let's compute the mean/median and plot the difference between the two contexts.

```{r}
dfPharV2 %>% 
  group_by(context) %>% 
  summarise(mean = mean(Z2mnZ1),
            median = median(Z2mnZ1), 
            count = n())

dfPharV2 %>% 
  ggplot(aes(x = context, y = Z2mnZ1)) + 
  geom_boxplot()  
```


The table above reports the mean and median of `Z2-Z1` for both levels of context and the plots show the difference between the two. We have a total of 180 cases in the `guttural`, and 222 in the `non-guttural`. 
When considering the conditional inference tree output, various splits were obtained. 
The first is any value higher than 9.55 being assigned to the `non-guttural` class (around 98% of 75 cases)
Then, with anything lower than 9.55, a second split was obtained. A threshold of 6.78: higher assigned to `guttural` (around 98% of 64 cases), lower, were split again with a threshold of 4 Bark. A third split was obtained: values lower of equal to 4 Bark are assigned to the `guttural` (around 70% of 157 cases) and values higher than 4 Barks assigned to the `non-guttural` (around 90% of 106 cases).

Dissecting the tree like this allows interpretation of the output. In this example, this is quite a complex case and `ctree` allowed to fine tune the different patterns seen with 
Now let's look at the full dataset to make sense of the combination of predictors to the difference. 

## Model 1

### Model specification

```{r}
set.seed(123456)
fit <- dfPharV2  %>% 
  ctree(
    context ~ ., 
    data = .)
print(fit)
plot(fit, main = "Conditional Inference Tree")
```


How to interpret this complex decision tree? 

Let's obtain the median value for each predictor grouped by context. Discuss some of the patterns. 

```{r}
dfPharV2 %>% 
  group_by(context) %>% 
  summarize_all(list(mean = mean))
```



We started with `context` as our outcome, and all 23 acoustic measures as predictors. A total of 8 terminal nodes were identified with multiple binary splits in their leaves, allowing separation of the two categories. Looking specifically at the output, we observe a few things.

The first node was based on `A2*-A3*`, detecting a difference between non-gutturals and gutturals. For the first binary split, a threshold of -13.78 Bark was used (mean non guttural = -7.86; mean guttural = -14.58), then for values lower of equal to this threshold, a second split was performed using `Z4-Z3` (mean non guttural = 1.67; mean guttural = 1.43) with any value smaller and equal to 1.59, then another binary split using `H2*-H4*`, etc...

Once done, the `ctree` provides multiple binary splits into guttural or non-guttural. 

Any possible issues/interesting patterns you can identify? Look at the interactions between predictors. 


### Predictions from the full model

Let's obtain some predictions from the model and evaluate how successful it is in dealing with the data. 

```{r}
set.seed(123456)
pred.ctree <- predict(fit)
tbl.ctree <- table(pred.ctree, dfPharV2$context)
tbl.ctree
PresenceAbsence::pcc(tbl.ctree)
PresenceAbsence::specificity(tbl.ctree)
PresenceAbsence::sensitivity(tbl.ctree)

roc.ctree <- pROC::roc(dfPharV2$context, as.numeric(pred.ctree))
roc.ctree
pROC::plot.roc(roc.ctree, legacy.axes = TRUE)
```

This full model has a classification accuracy of 82.8%.This is not bad!! It has a relatively moderate specificity at 0.77 (at detecting the gutturals) but a high sensitivity at 0.87 (at detecting the non-gutturals). The ROC curve shows the relationship between the two with an AUC of 0.823


## Random selection

One important issue is that the trees we grew above are biased. They are based on the full dataset, which means they are very likely to overfit the data. We did not add any random selection and we only grew one tree each time. If you think about it, is it possible that we obtained such results simply by chance? 

What if we add some randomness in the process of creating a conditional inference tree?


We change a small option in `ctree` to allow for random selection of variables, to mimic what Random Forests will do. We use `controls` to specify `mtry = 5`, which is the rounded square root of number of predictors. 


### Model 2

```{r}
set.seed(123456)
fit1 <- dfPharV2  %>% 
  ctree(
    context ~ ., 
    data = .,
    controls = ctree_control(mtry = 5))
plot(fit1, main = "Conditional Inference Tree")
pred.ctree1 <- predict(fit1)
tbl.ctree1 <- table(pred.ctree1, dfPharV2$context)
tbl.ctree1
PresenceAbsence::pcc(tbl.ctree1)
PresenceAbsence::specificity(tbl.ctree1)
PresenceAbsence::sensitivity(tbl.ctree1)

roc.ctree1 <- pROC::roc(dfPharV2$context, as.numeric(pred.ctree1))
roc.ctree1
pROC::plot.roc(roc.ctree1, legacy.axes = TRUE)
```


Can you compare results between you and discuss what is going on? 

When adding one random selection process to our `ctree`, we allow it to obtain more robust predictions. We could even go further and grow multiple small trees with a portion of datapoints (e.g., 100 rows, 200 rows). When doing these multiple random selections, you are growing multiple trees that are decorrelated from each other. These become independent trees and one can combine the results of these trees to come with clear predictions. 

This is how Random Forests work. You would start from a dataset, then grow multiple trees, vary number of observations used (nrow), and number of predictors used (mtry), adjust branches, and depth of nodes and at the end, combine the results in a forest. You can also run permutation tests to evaluate contributions of each predictor to the outcome. This is the beauty of Random Forests. They do all of these steps automatically at once for you! 


# Random Forests {.tabset .tabset-fade .tabset-pills}

As their name indicate, a Random Forest is a forest of trees implemented through bagging ensemble algorithms. Each tree has multiple branches (nodes), and will provide predictions based on recursive partitioning of the data. Then using the predictions from the multiple grown trees, Random Forests will create `averaged` predictions and come up with prediction accuracy, etc. 

There are multiple packages that one can use to grow Random Forests:

1. `randomForest`: The original implementation of Random Forests.
2. `party` and `partykit`: using conditional inference trees as base learners
3. `ranger`: a reimplementation of Random Forests; faster and more flexible than original implementation

The first implementation of Random Forests is widely used in research. One of the issues in this first implementation is that it favoured specific types of predictors (e.g., categorical predictors, predictors with multiple cut-offs, etc). Random Forests grown via Conditional Inference Trees as implemented in `party` guard against this bias, but they are computationally demanding. Random Forests grown via permutation tests as implemented in `ranger` speed up the computations and can mimic the unbiased selection process. 

## Declare parallel computing

We start by declaring parallel computing on your devices. This is essential to run these complex computations. The code below is designed to only use 1 core from your machine (and it is not too complex), but if you try to increase the complexity of your computations, you will need parallel computing. 

```{r}

set.seed(123456)

#Declare parallel computing 
ncores <- availableCores()
cat(paste0("Number of cores available for model calculations set to ", ncores, "."))
registerDoFuture()
makeClusterPSOCK(ncores)
plan(multisession)
ncores

# below we register our random number generator. This will mostly be used within the tidymodels below. This allows replication of the results
# below to suppress any warnings from doFuture
options(doFuture.rng.onMisuse = "ignore")
```



## Party

Random Forests grown via conditional inference trees, are different from the original implementation. They offer an unbiased selection process that guards against overfitting of the data. There are various points we need to consider in growing the forest, including number of trees and predictors to use each time. Let us run our first Random Forest via conditional inference trees. To make sure the code runs as fast as it can, we use a very low number of trees: only 100 It is well known that the more trees you grow, the more confidence you have in the results, as model estimation will be more stable. In this example, I would easily go with 500 trees..

### Model specification

To grow the forest, we use the function `cforest`. We use all of the dataset for the moment. We need to specify a few options within controls: 

1. `ntree = 100` = number of trees to grow. Default = 500.
2. `mtry = round(sqrt(23))`: number of predictors to use each time. Default is 5, but specifying it is advised to account for the structure of the data

By default, `cforest_unbiased` has two additional important options that are used for an unbiased selection process. **WARNING**: you should not change these unless you know what you are doing. Also, by default, the data are split into a training and a testing set. The training is equal to 2/3s of the data; the testing is 1/3.

1. `replace = FALSE` = Use subsampling with or without replacement. Default is `FALSE`, i.e., use subsets of the data without replacing these.  
2. `fraction = 0.632` = Use 63.2% of the data in each split. 


```{r}
set.seed(123456)
mdl.cforest <- dfPharV2 %>% 
  cforest(context ~ ., data = ., 
          controls = cforest_unbiased(ntree = 100, 
                                      mtry = round(sqrt(23))))
```


### Predictions

To obtain predictions from the model, we use the `predict` function and add `OOB = TRUE`. This uses the out-of-bag sample (i.e., 1/3 of the data). 

```{r}
set.seed(123456)
pred.cforest <- predict(mdl.cforest, OOB = TRUE)
tbl.cforest <- table(pred.cforest, dfPharV2$context)
tbl.cforest
PresenceAbsence::pcc(tbl.cforest)
PresenceAbsence::specificity(tbl.cforest)
PresenceAbsence::sensitivity(tbl.cforest)

roc.cforest <- pROC::roc(dfPharV2$context, as.numeric(pred.cforest))
roc.cforest
pROC::plot.roc(roc.cforest, legacy.axes = TRUE)
```

Compared with the 82.8% classification accuracy we obtained using `ctree` using our full dataset above (model 1), here we obtain 85.5% with an 2.7% increase. Compared with the 67.4% from model 2 from `ctree` with random selection of predictors, we have an 18.1% increase in classification accuracy!

We could test whether there is statistically significant difference between our `ctree` and `cforest` models. Using the ROC curves, the `roc.test` conducts a non-parametric Z test of significance on the correlated ROC curves. The results show a statistically significant improvement using the `cforest` model. This is normal because we are growing 100 different trees, with random selection of both predictors and samples and provide an `averaged` prediction. 

```{r}
pROC::roc.test(roc.ctree, roc.cforest)
pROC::roc.test(roc.ctree1, roc.cforest)
```


### Variable Importance Scores

One important feature in `ctree` was to show which predictor was used first is splitting the data, which was then followed by the other predictors. We use a similar functionality with `cforest` to obtain variable importance scores to pinpoint `strong` and `weak` predictors. 

There are two ways to obtain this:

1. Simple permutation tests (conditional = FALSE)
2. Conditional permutation tests (conditional = TRUE)

The former is generally comparable across packages and provides a normal permutation test; the latter runs a permutation test on a grid defined by the correlation matrix and corrects for possible collinearity. This is similar to a regression analysis, but looks at both main effects and interactions. 

You could use the normal `varimp` as implemented in `party`. This uses mean decrease in accuracy scores. We will use variable importance scores via an AUC based permutation tests as this uses both accuracy and errors in the model, using `varImpAUC` from the `varImp` package.

**DANGER ZONE**: using conditional permutation test requires a lot of RAMs, unless you have access to a cluster, and/or a lot of RAMs, do not attempt running it. We will run the non-conditional version here for demonstration.

#### Non-conditional permutation tests

```{r}
set.seed(123456)
VarImp.cforest <- varImp::varImpAUC(mdl.cforest, conditional = FALSE)
lattice::barchart(sort(VarImp.cforest))
```

The Variable Importance Scores via non-conditional permutation tests showed that `A2*-A3*` (i.e., energy in mid-high frequencies around F2 and F3) is the most important variable at explaining the difference between gutturals and non-gutturals, followed by `Z4-Z3` (pharyngeal constriction), `H1*-A3*` (energy in mid-high frequency component), `Z2-Z1` (degree of compactness), `Z3-Z2` (spectral divergence), `H1*-A2` (energy in mid frequency component) and `Z1-Z0` (degree of openness). All other predictors contribute to the contrast but to varying degrees (from `H1*-H2*` to `H1*-A1*`). The last 5 predictors are the least important and and the CPP has a 0 mean decrease in accuracy and can even be ignored. 


#### Conditional permutation tests

```{r}
set.seed(123456)
VarImp.cforest <- varImp::varImpAUC(mdl.cforest, conditional = TRUE)
lattice::barchart(sort(VarImp.cforest))
```




### Conclusion

The `party` package is powerful at growing Random Forests via conditional Inference trees, but is computationally prohibitive when increasing number of trees and using conditional permutation tests of variable importance scores. We look next at the package `ranger` due to its speed in computation and flexibility. 

## Ranger

The `ranger` package proposes a reimplementation of the original Random Forests algorithms, written in C++ and allows for parallel computing. It offers more flexibility in terms of model specification. 

### Model specification

In the model below specification below, there are already a few options we are familiar with, with additional ones described below:

1. `num.tree` = Number of trees to grow. We use the default value
2. `mtry` = Number of predictors to use. Default = `floor(sqrt(Variables))`. For compatibility with `party`, we use `round(sqrt(23))`
3. `replace = FALSE` = Use subsampling with or without replacement. Default `replace = TRUE`, i.e., is **with** replacement. 
4. `sample.fraction = 0.632` = Use 63.2% of the data in each split. Default is full dataset, i.e., `sample.fraction = 1`
5. `importance = "permutation"` = Compute variable importance scores via permutation tests
6. `scale.permutation.importance = FALSE` = whether to scale variable importance scores to be out of 100%. Default is TRUE. This is likely to introduce biases in variable importance estimation. 
7. `splitrule = "extratrees"` = rule used for splitting trees. 
8. `num.threads` = allow for parallel computing. Here we only specify 1 thread, but can use all thread on your computer (or cluster).


We use options 2-7 to make sure we have an unbiased selection process with `ranger`. You can try on your own running the model below by using the defaults to see how the rate of classification increases more, but with the caveat that it has a biased selection process. 

```{r}
set.seed(123456)
mdl.ranger <- dfPharV2 %>% 
  ranger(context ~ ., data = ., num.trees = 500, mtry = round(sqrt(23)),
         replace = FALSE, sample.fraction = 0.632, 
         importance = "permutation", scale.permutation.importance = FALSE,
         splitrule = "extratrees", num.threads = ncores)
mdl.ranger
```


Results of our Random Forest shows an OOB (Out-Of-Bag) error rate of 8.2%, i.e., an accuracy of 91.8%.

### Going further

Unfortunately, when growing a tree with `ranger`, we cannot use predictions from the OOB sample as there are no comparable options to do so on the predictions. We need to hard-code this. We split the data into a training and a testing sets. The training will be on 2/3s of the data; the testing is on the remaining 1/3. 

#### Create a training and a testing set

```{r}
set.seed(123456)
train.idx <- sample(nrow(dfPharV2), 2/3 * nrow(dfPharV2))
gutt.train <- dfPharV2[train.idx, ]
gutt.test <- dfPharV2[-train.idx, ]
```


#### Model specification

We use the same model specification as above, except from using the training set and saving the forest (with `write.forest = TRUE`).

```{r}
set.seed(123456)
mdl.ranger2 <- gutt.train %>% 
  ranger(context ~ ., data = ., num.trees = 500, mtry = round(sqrt(23)),
         replace = FALSE, sample.fraction = 0.632, 
         importance = "permutation", scale.permutation.importance = FALSE,
         splitrule = "extratrees", num.threads = ncores, write.forest = TRUE)
mdl.ranger2
```

With the training set, we have an OOB error rate of 9.3%; i.e., an accuracy rate of 90.7%.

#### Predictions
 
For the predictions, we use the testing set as a validation set. This is to be considered as a true reflection of the model. This is unseen data not used in the training set.


```{r}
set.seed(123456)
pred.ranger2 <- predict(mdl.ranger2, data = gutt.test)
tbl.ranger2 <- table(pred.ranger2$predictions, gutt.test$context)
tbl.ranger2
PresenceAbsence::pcc(tbl.ranger2)
PresenceAbsence::specificity(tbl.ranger2)
PresenceAbsence::sensitivity(tbl.ranger2)

roc.ranger <- pROC::roc(gutt.test$context, as.numeric(pred.ranger2$predictions))
roc.ranger
pROC::plot.roc(roc.ranger, legacy.axes = TRUE)
```


The classification rate based on the testing set is 86.6%. This is comparable to the one we obtained with `cforest`. The changes in the settings allow for similarities in the predictions obtained from both `party` and `ranger`. 

#### Variable Importance Scores

##### Default

For the variable importance scores, we obtain them from either the training set or the full model above.


```{r}
set.seed(123456)
lattice::barchart(sort(mdl.ranger2$variable.importance), main = "Variable Importance scores - training set")
lattice::barchart(sort(mdl.ranger$variable.importance), main = "Variable Importance scores - full set")
```


There are similarities between `cforest` and `ranger`, with minor differences. `Z2-Z1` is the best predictor at explaining the differences between gutturals and non-gutturals with `ranger` followed by `Z3-Z2` and then `A2*-A3*`, (reverse with `cforest`!). The order of the additional predictors is sightly different between the two models. This is expected as the `cforest` model only used 100 trees, whereas the `ranger` model used 500 trees.


A clear difference between the packages `party` and `ranger` is that the former allows for conditional permutation tests for variable importance scores; this is absent from `ranger`. However, there is a debate in the literature on whether correlated data are harmful within Random Forests. It is clear that how Random Forests work, i.e., the randomness in the selection process in number of data points, predictors, splitting rules, etc. allow the trees to be decorrelated from each other. Hence, the conditional permutation tests may not be required. But what they offer is to condition variable importance scores on each other (based on correlation tests) to mimic what a multiple regression analysis does (but without suffering from suppression!). Strong predictors will show major contribution, while weak ones will be squashed giving them extremely low (or even negative) scores. Within `ranger`, it is possible to evaluate this by estimating p values associated with each variable importance.We use the `altman` method. See documentation for more details. 

**DANGER ZONE**: This requires heavy computations. Use with all cores on your machine or in the cluster. Recommendations are to use a minimum of 100 permutations or more, i.e., `num.permutations = 100`. Here, we only use 20 to show the output.

##### With p values

```{r}
set.seed(123456)
VarImp.pval <- importance_pvalues(mdl.ranger2, method = "altmann",
                                  num.permutations = 20, 
                                  formula = context ~ ., data = gutt.train,
                                  num.threads = ncores)
VarImp.pval
```


Of course, the output above shows variable p values. The lowest is at 0.048 for all predictors; one at 0.14 for CPP. Recall that CPP received the lowest variable importance score within `ranger` and `cforest`. If you increase permutations to 100 or 200, you will get more confidence in your results and can report the p values


In the next part, we look at the `tidymodels` and introduce their philosophy. 

## Random forests with Tidymodels

The `tidymodels` are a bundle of packages used to streamline and simplify the use of machine learning. The `tidymodels` are not restricted to Random Forests, and you can even use them to run simple linear models, logistic regressions, PCA, Random Forests, Deep Learning, etc.

The `tidymodels`' philosophy is to separate data processing on the training and testing sets, and use of a workflow. Below, is an full example of how one can run Random Forests with via `ranger` using the `tidymodels`.

### Training and testing sets

We start by creating a training and a testing set using the function `initial_split`. Using `strata = context` allows the model to split the data taking into account its structure and splits the data according to proportions of each group. 

```{r}
set.seed(123456)
train_test_split <-
  initial_split(
    data = dfPharV2,
    strata = "context",
    prop = 0.667) 
train_test_split
train_tbl <- train_test_split %>% training() 
test_tbl  <- train_test_split %>% testing()
```


### Set for cross-validation

We can (if we want to), create a 10-folds cross-validation on the training set. This allows to fine tune the training by obtaining the forest with the highest accuracy. This is a clear difference with `ranger`. While it is not impossible to hard code that, `tidymodels` simplify it for us!!

```{r}
set.seed(123456)
train_cv <- vfold_cv(train_tbl, v = 10, strata = "context")
```

### Model Specification

Within the model specification, we need to specify multiple options:

1. A `recipe`: This is the recipe and is related to any data processing one wants to apply on the data.
2. An `engine`: We need to specify the `engine` to use. Here we want to run a Random Forest.
3. A `tuning`: Here we can tune our engine
4. A `workflow`: here we specify the various steps of the workflow


#### Recipe

When defining the recipe, you need to think of the type of "transformations" you will apply to your data. 

1. Z-scoring is the first thing that comes to mind. When you z-score the data, you are allowing all strong and weak predictors to be considered equally by the model. This is important as some of our predictors have very large differences related to the levels of context and have different measurement scales. We could have applied it above, but we need to make sure to apply it separately on both training and testing sets (otherwise, our model suffers from data leakage)
2. If you have any missing data, you can use central imputations to fill in missing data (random forests do not like missing data, though they can work with them). 
3. You can apply PCA on all your predictors to remove collinearity before running random forests. This is a great option to consider, but adds more complexity to your model. 
4.Finally, if you have categorical predictors, you can transform them into dummy variables using `step_dummy()`: 1s and 2s for binary; or use one-hot-encoding `step_dummy(predictor, one_hot = TRUE)`

See documentations of `tidymodels` for what you can apply!!


```{r}
set.seed(123456)
recipe <-  
  train_tbl %>%
  recipe(context ~ .) %>%
  step_center(all_predictors(), -all_outcomes()) %>%
  step_scale(all_predictors(), -all_outcomes()) %>% 
  prep()

trainData_baked <- bake(recipe, new_data = train_tbl) # convert to the train data to the newly imputed data
trainData_baked

```

Once we have prepared the `recipe`, we can `bake it` to see the changes applied to it.

#### Predictors remaining


```{r, fig.width=20, fig.height=15}
box_fun_plot = function(data, x, y) {
  ggplot(data = data, aes(x = .data[[x]],
                          y = .data[[y]],
                          fill = .data[[x]])) +
    geom_boxplot() +
    labs(title = y,
         x = x,
         y = y) +
    theme(
      legend.position = "none"
    ) +
    theme_bw()
}

# Create vector of predictors
expl <- names(trainData_baked)[-(dim(trainData_baked)[2])]#step_corr

# Loop vector with map
expl_plots_box <- map(expl, ~box_fun_plot(data = trainData_baked, x = "context", y = .x) )
plot_grid(plotlist = expl_plots_box)
```




#### Setting the engine

We set the engine here as a `rand_forest`. We specify a classification mode. Then, we set an engine with engine specific parameters. 

```{r}
set.seed(123456)
engine_tidym <- rand_forest(
    mode = "classification",
    engine = "ranger",
    mtry %>% tune(),
    trees %>% tune(),
    min_n = 1
  ) %>% 
  set_engine("ranger", importance = "permutation", sample.fraction = 0.632,
             replace = FALSE, write.forest = T, splitrule = "extratrees",
             scale.permutation.importance = FALSE) # we add engine specific settings
```

#### Settings for tuning

If we want to tune the model, then uncomment the lines below. It is important to use an mtry that hovers around the round(sqrt(Variables)). If you use all available variables, then your forest is biased as it is able to see all predictors. For number of trees, low numbers are not great, as you can easily underfit the data and not produce meaningful results. Large numbers are fine and Random Forests do not overfit (in theory). 

The full dataset has around 2000 observations, and 23 predictors (well even more, but let's ignore it for the moment). I tuned `mtry` to be between 4 and 6, and `trees` to be between 1000 and 5000 in a 30 step increment. In total, with a 10-folds cross validation, I grew 30 random forests on each fold for a total of 300 Random Forests on the training set!!! This of course will take a loooooong time to compute on your computer if using one thread. So use parallel computing or a cluster. When running in the cluster with 20 cores, each with 11GB RAMs, and it took around 260.442 seconds to run with 220GB RAMS! Of course, with smaller RAMs and number of cores, the code will still run but will take longer. 


```{r}
set.seed(123456)
gridy_tidym <- grid_random(
  mtry() %>% range_set(c(4, 6)),
  trees() %>% range_set(c(1000, 2000)),
  size = 30
  )
```

#### Workflow

Now we define the workflow adding the `recipe` and the `model`.

```{r}
set.seed(123456)
wkfl_tidym <- workflow() %>% 
  add_recipe(recipe) %>% 
  add_model(engine_tidym)
```

#### Tuning and running model

Here we run the model starting with the workflow, the cross-validation sample, the tuning parameters and asking for specific metrics. 

The model below will do the following:
1. Use a 10-folds cross validation on the training test
2. Tune the hyper-parameters to reach the model with the best predictions
3. Within each fold, we grow 30 random forests; we have a total of 300 Random Forests, and we use an ROC-AUC based search for the best performing model

Of course, you could use a larger size to grow more trees, with this will take longer to run!

The model will run for about 2-3 minutes with an 8 cores machine and 32GB of RAMs. For demonstration purposes, the tuning of number of trees is restricted to between 1000 and 2000 trees. This can of course be increased to 5000 trees (or more) depending on the size of the dataset
 

```{r}
set.seed(123456)
system.time(grid_tidym <- 
  tune_grid(wkfl_tidym, 
            resamples = train_cv,
            grid = gridy_tidym,
            metrics = metric_set(accuracy, roc_auc, sens, spec,f_meas, precision, recall),
            control = control_grid(save_pred = TRUE, parallel_over = NULL))
)
print(grid_tidym)
```



#### Finalise model

We obtain the best performing model from cross-validation, then finalise the workflow by predicting the results on the testing set and obtain the results of the best performing model

```{r}
set.seed(123456)
collect_metrics(grid_tidym)
grid_tidym_best <- select_best(grid_tidym, metric = "roc_auc")
grid_tidym_best
wkfl_tidym_best <- finalize_workflow(wkfl_tidym, grid_tidym_best)
wkfl_tidym_final <- last_fit(wkfl_tidym_best, split = train_test_split)
```

### Results

For the results, we can obtain various metrics on the training and testing sets. 

#### Cross-validation on training set

##### Accuracy

```{r}
percent(show_best(grid_tidym, metric = "accuracy", n = 1)$mean)
```

##### ROC-AUC

```{r}
# Cross-validated training performance
show_best(grid_tidym, metric = "roc_auc", n = 1)$mean
```


##### Sensitivity

```{r}
show_best(grid_tidym, metric = "sens", n = 1)$mean
```

##### Specificity

```{r}
show_best(grid_tidym, metric = "spec", n = 1)$mean
```



##### F-measure

```{r}
# Cross-validated training performance
show_best(grid_tidym, metric = "f_meas", n = 1)$mean
```


##### Precision

```{r}
# Cross-validated training performance
show_best(grid_tidym, metric = "precision", n = 1)$mean
```


##### Recall

```{r}
# Cross-validated training performance
show_best(grid_tidym, metric = "recall", n = 1)$mean
```


#### Predictions testing set

##### Overall

```{r}
wkfl_tidym_final$.metrics
```

##### Accuracy

```{r}
#accuracy
percent(wkfl_tidym_final$.metrics[[1]]$.estimate[[1]])
```

##### ROC-AUC


```{r}
#roc-auc
wkfl_tidym_final$.metrics[[1]]$.estimate[[2]]
```

#### Confusion Matrix training set

```{r warning=FALSE, message=FALSE, error=FALSE, fig.height = 6, fig.width = 8}
wkfl_tidym_final$.predictions[[1]] %>%
  conf_mat(context, .pred_class) %>%
  pluck(1) %>%
  as_tibble() %>%
  group_by(Truth) %>% # group by Truth to compute percentages
  mutate(prop =percent(prop.table(n))) %>% # calculate percentages row-wise
  ggplot(aes(Prediction, Truth, alpha = prop)) +
  geom_tile(show.legend = FALSE) +
  geom_text(aes(label = prop), colour = "white", alpha = 1, size = 8)
```


#### Variable Importance

##### Best 10

```{r}
vip(pull_workflow_fit(wkfl_tidym_final$.workflow[[1]]))
```


##### All predictors


```{r}
vip(pull_workflow_fit(wkfl_tidym_final$.workflow[[1]]), num_features = 23)
```


#### Gains curves

This is an interesting features that show how much is gained when looking at various portions of the data. We see a gradual increase in the values. When 50% of the data were tested, around 83% of the results within the non-guttural class were already identified. The more testing was performed, the more confidence in the results there are and then when 84.96% of the data were tested, 100% of the cases were found. 

```{r}
wkfl_tidym_final$.predictions[[1]] %>%
  gain_curve(context, `.pred_Non-Guttural`) %>%
  autoplot() 
```       

#### ROC Curves

```{r}
wkfl_tidym_final$.predictions[[1]] %>%
  roc_curve(context, `.pred_Non-Guttural`) %>%
  autoplot() 
``` 





# session info

```{r warning=FALSE, message=FALSE, error=FALSE}
sessionInfo()
```