StatisticalLearning_AboveTheNorm.Rmd

---
title: "Statistical Learning Project"
author: "Above the norm - Cardarello, Ciarrocchi, Cutrera, Fusar Bassini, Maroni, Rossi"
date: "5/21/2021"
output:
  html_document:
    df_print: paged
  pdf_document:
    latex_engine: xelatex
abstract: Environmental protection nowadays is a more and more important issue on which countries are investing many resources. What we are interested in is to discover the most relevant factors which affect pollution in US and to find a way to classify each State in the United States as polluted or not on a bunch of explanatory variables we thought could be relevant. Therefore, we let the data tell us how we can explain pollution in the United States. In the second part we try also to make an exploratory analysis of our dataset finding the principal components which are more relevant, and at least some hidden ways in which countries can be grouped through a hierarchical clustering.
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE)
```

```{r include=FALSE}
data <- read.csv('usa_final.csv', sep=',')
data <- data[,-c(20,21)]
log_aqi <- log(data$aqi)
data$ln_aqi <- log_aqi
```

# Data and Problem Understanding

The starting point of our research question was the effect of the lockdown we experienced in 2020 on the environment. From media and social networks, we have always heard about the positive effects that the lockdowns spread all over the world had on our air quality. Then, we start considering the variable that explains the strictness of lockdown (lockdown yes = 1, lockdown no = 0, soft-lockdown = 0.5) in relation with the air quality index. As you can notice from the graph below, the implementation of lockdown did not have any real effect on the air quality. Indeed, the distributions in the three boxplots are not statistically different: they have quite similar means and the distributions seem to be overlapping. 

```{r include=FALSE}
library(car)
library(readr)
library(dplyr)
library(tidyr)
library(readxl)
library(ggpubr)
library(ggplot2)
library(Hmisc)
library(olsrr)
library(tidyverse)
library(caret)
library(Metrics)
library(corrplot)
library(glmnet)
library(leaps)
library(class)
library(selectiveInference)
library(RCurl)
library(tree)
library(ISLR)
library(plot3D)
library(sjPlot)
```


```{r, fig.align="center", out.width="70%"}
# Visualize the expression profile
dataset <- read.csv("usa_final.csv", sep=',', header = TRUE, dec = ".")
dataset <- dataset %>% mutate(ln_aqi=log(dataset$aqi))
ggboxplot(dataset, x = "lockdown", y = "aqi", color = "lockdown", 
          add = "jitter", legend = "none") +
  geom_hline(yintercept = mean(dataset$aqi), linetype = 2)+ # Add horizontal line at base mean
  ylim(0, 200)+
  stat_compare_means(method = "anova", label.y = 200)+        # Add global ANOVA p-value
  stat_compare_means(label = "p.signif", method = "t.test",
                     ref.group = ".all.", hide.ns = TRUE)      # Pairwise comparison against all


# We can conclude that AQI is not significantly different between States by lockdown measures.
```

Since we got these unexpected results, we decided to go further with our analysis. 
Therefore, from now on we would like to find the real factors that explain the difference of the air quality among US countries in 2020. 
We started constructing our dataset scraping many sources on the web, such as Bureau of Economic Analysis, the Environmental protection agency of US, United States Census Bureau, among the most important. 


The first approach we had with our data consisted in the simple full model of linear regression (OLS). Then, we computed the variance inflation factor (vif) for each variable and we found some 'dangerous' values: according to Hair et al. (1995) the maximum acceptable level of vif is 10, whereas according to Ringle et al. (2015) the maximum acceptable level of vif is 5. 
Starting with the approach of Hair et al. (threshold = 10), we excluded from our dataset the following regressors: waste and healthcare; thereafter, we performed also the vif according to Ringle et al. approach (threshold = 5) and we drop construction, utilities, professional, retail and finance from our analysis. 
Then, we computed the OLS with the remaining variables and we looked for normality on the residuals: they are normally distributed.
We went on performing a model selection in order to find the most relevant regressors among the others.
Given that our number of predictors (17) is not higher than the number of observations (51) we could not perform the backward stepwise model selection; then we decided to implement the forward search in order to choose the "best" predictors for our dependent variable (air quality). 

For the classification task, we decided to adopt the Nearest Neighbor algorithm (K-NN) which seemed to perform pretty well (accuracy = 81.25%). Then, we run the classification trees but it did not output results as good as the K-NN. Therefore, we tried to improve its performance implementing the random forest algorithm and we got an accuracy of 75% (but it is still not good as the one of K-NN).

To sum up our main results we can say that there exist at least one model which describes relevant factors for distinguishing countries with good air quality from those which are more polluted. The relevant factors we found are: the share of rural population out of the total, the manufacturing contribution to GDP, the annual rainfalls and the density of the factories for each State (n° of factories / squared kilometers). 


## Data Description

We started our data scraping from many sources, and we found a dependent variable to describe air quality, and many other independent variables for each of the 51 States of the United States.

We list below the full dataset we built: 

- aqi: Air Quality Index
the Air Quality Index is a yardstick that runs from 0 to 500. The higher the Air Quality Index value, the greater the level of air pollution^[Source: EPA - United States Environmental Protection Agency <https://aqs.epa.gov/aqsweb/airdata/download_files.html#Annual>]. 
We computed aqi as the average of the median value for each county of each state in 2020 and we found a minimum of 59.25 for Hawaii and a maximum of 197.81 for California. 
The Air Quality Index presented six possible discrete categories: 

  + Green (good) = from 0 to 50 

  + Yellow (moderate) = from 51 to 100 

  + Orange (unhealthy for sensitive groups) = from 101 to 150 

  + Red (unhealthy) = from 151 to 200 

  + Purple (very unhealthy) = from 201 to 300 

  + Maroon (hazardous) = 301 and higher. 

From this Air Quality Index we discretize the variable called 'polluted' that categorize our entities in three levels of pollution: Yellow, Orange and Red (since we only have States that range from 59.25 to 197.81).
 
- accommodation, construction, education, finance, healthcare, information, manufacturing, mining, professional, retail, transportation, utilities, waste: they are contributions to Gross Domestic Product by State^[Source: BEA - Bureau of Economic Analysis
<https://apps.bea.gov/itable/iTable.cfm?ReqID=70&step=1#reqid=70&step=1&isuri=1>] for the industry specified in 2020.
They are measured in millions of current dollars. 


- mining: it represents the value of nonfuel mineral production per square kilometer in dollars (2017)^[Source: USGS Minerals Yearbook 2018
<https://prd-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/atoms/files/myb1-2017-stati-adv.xlsx>]. 
(The District of Columbia has no mineral production). 


- precipitations: it is measured in inches and it specifies how much in a country has rained in a year (2020)^[Source: Statista
<https://www.statista.com/statistics/1101518/annual-precipitation-by-us-state/>]. 


- lockdown: it indicates if the state had a stay-at-home order (1), an advisory or a regional measure (0.5) or nothing (0). In the State of Wisconsin, the lockdown was declared unconstitutional after a month so we decided to input it as lockdown=0.5.^[Source: Wikipedia
<https://en.wikipedia.org/wiki/U.S._state_and_local_government_responses_to_the_COVID-19_pandemic#Initial_pandemic_responses,_including_full_lockdowns>]. 


- pop_rural: According to data from the decennial census of the population for each State, it distinguishes rural from urban population^[Source: United States Census Bureau
<https://www.census.gov/programs-surveys/geography/guidance/geo-areas/urban-rural/2010-urban-rural.html>].
We considered the percentage of the rural population out of the total as a measure of how rural intense is a State. 


- n_factories: Number of manufacturing firms in each state in 2017^[Source: National Association of Manufacturers
<https://www.nam.org/state-manufacturing-data/>] divided by the surface of each country. Therefore, it represents the density of factories in each State.

The structure of our dataset, formed by 51 observations and 19 variables without missing values, is as follows:

```{r include=FALSE}
x <- getURL("https://raw.githubusercontent.com/mathicard/Statistical-Learning-DSE/main/usa_final.csv")
dataset <- read.csv(text = x)
```

```{r, out.width="50%", fig.align="center"}
str(dataset[,1:19])
```


## Data Visualization and Exploration

We start the exploration analysis with our dependent variable *aqi* that describes the Air Quality of each of the 51 States of U.S. in 2020. The mean is 95.57 with a standard deviation of 25.76. The complete descriptive statistics are in the following table.

<br/>

```{r message=FALSE, out.width="50%", fig.align="center"}
dataset %>% summarise(
  count = n(), 
  mean = mean(aqi, na.rm = TRUE),
  median = median(aqi),
  min = min(aqi),
  max = max(aqi),
  sd = sd(aqi, na.rm = TRUE)
)
```

<br/>

From the boxplot we can easily visualize the distribution of the Air quality index, and especially the names of the States that were the most polluted in 2020. Therefore, we can consider California (197.81), District of Columbia (166) and Wyoming (148.22) as outliers in the dataset. 

<br/>

```{r, message=FALSE, out.width="60%", fig.align="center"}
Boxplot(~aqi, data=data, col="#69b3a2", id=list(labels=data$state))
```

<br/>

Then, we compare the distribution of the Air Quality Index within the three levels of the factor *polluted*. The share of entities that have a "yellow" level of pollution is more than 70%, while 26% are "orange" and the left are "red" (4%). As we can see in the boxplots, except for the only two States that have a "red" AQI, "orange" has the most disperse distribution with a standard deviation of 14.8. In fact, the two extreme values 101 and 148 are almost in the minimum and maximum boundaries of that level of pollution. Finally, the presence of two States with a high AQI could introduce some noise in our analysis that should be take it into account. 

<br/>

```{r message=FALSE, include=FALSE}

# AQI categorization
dataset$polluted <- cut(dataset$aqi, breaks = c(50,100,150,200),
               labels = c('yellow', 'orange', 'red'))

```


```{r message=FALSE, out.width="50%", fig.align="center"}

## Descriptive statistics by group ##

group_by(dataset, polluted) %>% 
  summarise(
    count = n(), 
    share = n()/51*100,
    mean = mean(aqi, na.rm = TRUE),
    median = median(aqi),
    min = min(aqi),
    max = max(aqi),
    sd = sd(aqi, na.rm = TRUE)
  )

#Box plot

ggboxplot(dataset, x = "polluted", y = "aqi", color = "polluted",
          palette = c("#E7B800", "#FC4E07", "red"), 
          add = "jitter", legend = "none") +
  geom_hline(yintercept = mean(dataset$aqi), linetype = 2)+ # Add horizontal line at base mean
  ylim(0, 200)
```

<br/>

Looking at the distribution plot it seems that AQI has a right-skewed Normal distribution. To be sure, we use the well-known Shapiro-Wilk test in order to test the null hypothesis of normality. Unfortunately it is rejected at 95% of confidence, so our variable needs to be transformed.

<br/>

```{r message=FALSE, out.width="50%", fig.align="center"}
dataset %>%
    ggplot(aes(x=aqi)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
    labs(title = 'Density Plot of Air Quality Index', subtitle='Levels')

shapiro.test(dataset$aqi)
```

<br/>

A possible way to improve our results could be by applying the logarithmic transformation. With the logarithm it could be seen that the skewness to the left is sharpened, and now as a matter of fact, Shapiro-Wilk test is not rejecting the normality hypothesis at 95% of confidence (p-value>0.05).

```{r, include=FALSE}
dataset <- dataset %>% mutate(ln_aqi=log(dataset$aqi))
```

```{r, out.width="80%", fig.align="center"}
shapiro.test(dataset$ln_aqi)
```

In the following density plots we can see the improvement after applying the logarithm function to our dependent variable, which now seems more similar to a Normal distribution. 

<br/>

```{r, fig.align="center", out.width='60%', out.height='45%'}
d1 <- dataset %>%
  ggplot(aes(x=ln_aqi)) +
  geom_density(fill="orangered3", color=FALSE, alpha=0.5)+
  labs(title = 'Density Plot of Air Quality Index', subtitle='Logarithmic scale')
d2 <- dataset %>%
    ggplot(aes(x=aqi)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
    labs(title = 'Density Plot of Air Quality Index', subtitle='Levels')
ggarrange(d2, d1, 
          ncol = 2, nrow = 1)
```

<br/>

As almost all the points of our sample fall approximately along the reference line in the QQ-plot, we can assume normality of the AQI variable. This assumption allows us to work with statistical hypothesis tests that assume that the data follow a Normal distribution, and also to met the Central Limit Theorem assumptions and the normality of the residuals to build a regression model. 

<br/>


```{r, fig.align="center", out.width="50%"}
ggqqplot(dataset$ln_aqi, title = "QQ Plot of log AQI")
```

<br/>

In the following correlation matrix we analyze the relationship between the variables of our dataset, computing the correlation coefficients with their p-values as to check for the significance. 

<br/>

```{r, message=FALSE, out.width="60%", fig.align="center"}
data_corr <- dataset[,-c(1,20:23)]
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
M <- cor(data_corr)

cor.mtest <- function(mat, ...) {
  mat <- as.matrix(mat)
  n <- ncol(mat)
  p.mat<- matrix(NA, n, n)
  diag(p.mat) <- 0
  for (i in 1:(n - 1)) {
    for (j in (i + 1):n) {
      tmp <- cor.test(mat[, i], mat[, j], ...)
      p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
    }
  }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}

# matrix of the p-value of the correlation
p.mat <- cor.mtest(data_corr)
colnames(M) <- c("aqi", "accom", "const", "edu", "fin", "health", "info", 
                 "manuf", "mining", "prof", "retail", "trans", "util", "waste",
                 "precip", "lock", "p_rural", "n_fact")

corrplot(M, method="color", col=col(200),  
         type="upper", order="hclust", 
         addCoef.col = "black", 
         tl.col="black", tl.srt=45, 
         p.mat = p.mat, sig.level = 0.05, 
         number.cex = .6,
         diag=FALSE)
```

<br/>

Almost all variables have a positive correlation with AQI, except for pop_rural which has an expected negative sign due to the fact that the air quality improves when the share of rural population of the State is greater. While education, mining,  precipitations and lockdown have a non significative correlation with our variable of interest. Moreover, as it was expected, the variables related to the GDP of each State show a strong positive correlation between them indicating the presence of a possible multicollinearity problem. Finally, pop_rural has a negative correlation with a large number of variables of the dataset, including AQI as we saw.

<br/>

```{r, message=FALSE, out.width="60%", fig.align="center"}

# Perform the ANOVA test
compare_means(aqi ~ lockdown,  data = dataset,
              ref.group = ".all.", method = "t.test")

```

<br/>

As to check if the air quality differs among the intensity of lockdown measures applied by each State in 2020, we perform an ANOVA test. According to the results, where the t-statistic is lower than 2 with a p-value large than 0.05 (0.37), the *aqi* doesn't differ for *lockdown* at 95% of confidence. Therefore, we can conclude that AQI is not significantly different between States by lockdown measures.

<br/>


```{r, include=FALSE}
log_aqi <- log(data$aqi)
data$ln_aqi <- log_aqi
dt <- data %>% drop_na()
dt <- dt[,-c(1,2)]
```

\tiny
```{r, include=FALSE}
full.model <- lm(ln_aqi~.-waste-healthcare-construction-utilities-professional-retail-finance, data = dt)
summary(full.model)
```
\normalsize

# Data Analysis for Supervised Learning

## Model Selection - Forward stepwise

Our dataset, which is composed by 51 observations (the 51 States of the United States of America) and 18 explanatory variables, does not need any other manipulation since we do not have any missing values. The only things we have done are: the transformation of the dependent variable (Air Quality) with the logarithmic scale, in order to get a "more" normal distribution as explained in the previous section, and the standardization of all our variables, which has given us the re-scaling with the same unit of measurement. 
Once our dataset was ready, we had an attempt with the simplest way we know to construct a model: the Ordinary Least Square. In the full model the most significative variables seem to be the number of factories (with positive sign, as expected) and precipitations (with negative sign, as expected). 

```{r}
full.model <- lm(ln_aqi~., data = dt)
summary(full.model)
```

But before reaching conclusions, we have to control many things in our model of regression; one of the most important is the variance inflation factor (vif) which gives us insights into the variables which can be "dangerous" to what concern multicollinearity. 

```{r}
vif(full.model)
sqrt(vif(full.model))>10
```

In fact some explanatory variables have a very high variance inflation factor, and we decided to drop these ones taking as benchmark the thresholds cited in Hair et al. (in which the maximum acceptable level of vif is 10) before, and then also the more strict one according to Ringle et al. (in which it was 5). 
After that, our linear regression model still has the number of factories and precipitations as the most statistically significatives explanatory variables, excluding the predictors with high vif.

```{r}
full.model <- lm(ln_aqi~.-waste-healthcare-construction-utilities-professional-retail-finance, data = dt)
summary(full.model)
vif(full.model)
sqrt(vif(full.model))>5
```

Fitting our model of regression on a random training set, we can visualize that residuals are normally distributed.

```{r, include=FALSE}
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x))) 
}
dt <- as.data.frame(lapply(dt, normalize))

set.seed(123)
train = sample(1:nrow(dt), 0.7*nrow(dt))
dt_train = dt[train,-18]
dt_test = dt[-train,-18]
dt_train_labels <- dt[train, 18]
dt_test_labels <- dt[-train, 18]
```


```{r, message=FALSE, out.width="50%", fig.align="center"}
full.model <- lm(ln_aqi~.-waste-healthcare-construction-utilities-professional-retail-finance, data = dt[train,])
ols_plot_resid_fit(full.model)
ols_plot_resid_hist(full.model)
```

And making the predictions on our unaccessed test set, we can see the plot actual versus predicted values on the test set, to see whether they are far from being in the straight line of zero errors:

```{r, message=FALSE, out.width="50%", fig.align="center"}
pred_ols <- predict(full.model, dt[-train,])
ggplot(dt_test, aes(x=pred_ols, y=dt_test_labels)) + 
  geom_point() +
  geom_abline(intercept=0, slope=1) +
  labs(x='Predicted Values', y='Actual Values', title='Predicted vs. Actual Values')
```

## Model Selection - LASSO
```{r, message=FALSE, out.width="50%", fig.align="center"}
x=model.matrix(ln_aqi~., dt)[,-1]
x = scale(x, TRUE, TRUE)
y=dt$ln_aqi

fit.lasso=glmnet(x,y, standardize = FALSE)
plot(fit.lasso,xvar="lambda",label=TRUE)

cv.lasso=cv.glmnet(x,y)
plot(cv.lasso)

coef(cv.lasso)

```

We use our earlier train/validation division to select the lambda for the lasso.

```{r, message=FALSE, out.width="50%", fig.align="center"}
lasso.tr=glmnet(x[train,],y[train], standardize = FALSE)
#lasso.tr

pred=predict(lasso.tr,x[-train,])
#dim(pred)

rmse= sqrt(apply((y[-train]-pred)^2,2,mean))
plot(log(lasso.tr$lambda),rmse,type="b",xlab="Log(lambda)")

lam.best=lasso.tr$lambda[order(rmse)[1]]
lam.best

```

The best lambda is smaller than rsme in the linear regression model.

```{r, message=FALSE, out.width="50%", fig.align="center"}
coef(lasso.tr,s=lam.best)
```

Variables with the highest relevance are precipitations, pop_rural and n_factories.
It also adds information in the sub-selection. Manufacture and lockdown have less relevance.

Next, we compute post-lasso inference to check the highest dependent variables on the sub-selection.

```{r, message=FALSE, out.width="50%", fig.align="center"}
sigma = estimateSigma(x,y)$sigmahat

beta = coef(lasso.tr, x=x, y=y, s=lam.best/51, exact = TRUE)[-1]

out_lasso_inf = fixedLassoInf(x,y,beta,lam.best,sigma=sigma)
out_lasso_inf
```

With a lambda of 0,026 the most relevant variables that satisfy the p-value are **precipitations** and **n_factories**. While pop_rural, manufacture and lockdown show a high p-value and are taken out.
With the Lasso, the results are similar to the other models used, but the post-selection inference
reduces the variables just to the two before mentioned.


```{r, include=FALSE, message=FALSE, warning=FALSE}
dt <- dt[,-c(13, 5, 2, 12, 9, 10, 4)]
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x))) 
}
#dt <- as.data.frame(lapply(dt, normalize))
set.seed(123)
train = sample(1:nrow(dt), 0.7*nrow(dt))
dt_train = dt[train,-11]
dt_test = dt[-train,-11]
dt_train_labels <- dt[train, 11]
dt_test_labels <- dt[-train, 11]
```

With the method of the forward search which starts from the null model, and step by step adds the most relevant variable (in the bunch of the remaining ones) included in the model with the lowest Residual Sum of Squares, or highest R^2. Plotting the cross-validated prediction error (with Bayes Information Criterion - BIC) we can find the parsimonious model which is performing best, because the BIC takes also into account the number of predictors which are included in our models (as a penalization term for the models with more variables). 

```{r, message=FALSE, out.width="50%", fig.align="center", warning=FALSE}
regfit.fwd=regsubsets(ln_aqi~.,data=dt,method="forward", nvmax=10)
summary(regfit.fwd)
```

```{r, message=FALSE, out.width="50%", fig.align="center", warning=FALSE}
reg.summary<-summary(regfit.fwd)
plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",types="l")
```

Now in our linear model made up by the 4 variables selected, the only one which becomes no more significative is the rural population indicator, but the signs are all as expected.

```{r, message=FALSE, out.width="50%", fig.align="center"}
model <- lm(ln_aqi~ pop_rural + manufacturing + precipitations + n_factories, 
            data = dt[train,])
summary(model)
```

```{r, message=FALSE, out.width="50%", fig.align="center"}
pred_fwd <- predict(model, dt[-train,])
root_mse = rmse(dt_test_labels, pred_fwd)
root_mse
```

At least our new model created with the forward search has a lower Mean Squared Error on the prediction made on the test set. Before it was 0.1823493, now it is 0.1317688.

## Tree classifiers 

In this section, we try to classify the 51 States according to the pollution of their air quality, that is, to predict the air quality category to which each State belongs in 2020.

We started including all the variables in our analysis since we want the tree to select the most important ones for classification. 
In the plot below there is a graphical representation of the procedure of how the tree splits the data. 
This tree has a misclassification error rate of approximately 18%.

```{r, message=FALSE, out.width="50%", fig.align="center"}
data$cl <- cut(data$aqi, breaks = c(50,100,150,200),
               labels = c('yellow', 'orange', 'red'))
dt <- dt[,-11]
dt$polluted <- data$cl
dt$polluted <- as.factor(dt$polluted)

#train test split

set.seed(123)
train = sample(1:nrow(dt), 0.7*nrow(dt))
dt_train = dt[train,-11]
dt_test = dt[-train,-11]
dt_train_labels <- dt[train, 11]
dt_test_labels <- dt[-train, 11]
summary(dt_train_labels)
summary(dt_test_labels)

###############################################################################
tree = tree(polluted~., dt)
summary(tree)
plot(tree)
text(tree, pretty = 0)
```

We tried to do the train and the test and we got a low accuracy of 62.5%. 

```{r, message=FALSE, out.width="50%", fig.align="center"}
tree_train <- tree(polluted~., dt[train,])
tree_pred <- predict(tree_train, dt[-train,], type = 'class')
table(tree_pred, dt_test_labels)

accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(table(tree_pred, dt_test_labels))
```

We wanted to improve the performance of our classification task and we decided to reduce the dimensionality of our dataset selecting only the most relevant variables that were chosen in the model selection section.

Reducing the number of variables lead us to get a slightly lower misclassification error rate of 16%. Below you can find the plot of the tree. 

```{r, message=FALSE, out.width="50%", fig.align="center"}
#TREE with relevant variables
tree = tree(polluted~pop_rural + manufacturing + precipitations 
            + n_factories, dt)
summary(tree)
plot(tree)
text(tree, pretty = 0)
```

We train our algorithm and we then test since we thought we have found better results. 
However, as it can notice we could not increase our accuracy that remains equal to 62.5%. 

```{r, message=FALSE, out.width="50%", fig.align="center"}
tree_train <- tree(polluted~pop_rural + manufacturing + precipitations 
                   + n_factories, dt[train,])
tree_pred <- predict(tree_train, dt[-train,], type = 'class')
table(tree_pred, dt_test_labels)

accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(table(tree_pred, dt_test_labels))
```

We wanted to do one last attempt to improve the accuracy of our tree classifier and we applied the random forest. 

```{r, message=FALSE, out.width="50%", fig.align="center"}
#random forest
library(randomForest)
rf.tree=randomForest(polluted~.,data=dt,subset=train)
rf.tree

pred_rf <- predict(rf.tree, dt_test)
table(pred_rf, dt_test_labels)
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(table(pred_rf, dt_test_labels))
```

As it can be seen, the predictions on the test set output an improved accuracy of 75%. 

Even though we applied the random forest and we got better results, we were not completely satisfied with the accuracy of this method of classification. Therefore, we decided to apply the K-Nearest Neighbors algorithm which will perform better and it will achieve a higher accuracy. 

## k-Nearest Neighbours

In this section we apply the K-Nearest Neighbor supervised algorithm (KNN) with the aim of classifying the air quality category to which each State belongs, according to the AQI 2020. 

In first place, as we have already done in the previous models, we normalize the variables in our dataset so that distances between variables with larger ranges will not be over-emphasized. Therefore, the values of all the features are in the range of 0 and 1.

```{r, include=FALSE, warning = FALSE}
x <- getURL("https://raw.githubusercontent.com/mathicard/Statistical-Learning-DSE/main/usa_final.csv")
dataset <- read.csv(text = x)

#AQI categorization
dataset$polluted <- cut(dataset$aqi, breaks = c(50,100,150,200),
                        labels = c('yellow', 'orange', 'red'))

dt <- as.data.frame(cbind(lapply(dataset[,-c(1,2,20,21,22)], normalize), dataset[,c(1,20,21,22)]))
```

Then, we split the original labeled dataset into training (70%) and test data (30%), keeping a similar distribution of instances so that we do not favor one or the other class in the predictions.


```{r, include=FALSE, warning = FALSE}

set.seed(12)
library(caret)
train.index <- createDataPartition(dt$polluted, p = .7, list = FALSE)
dt_train <- dt[ train.index,-c(18,19,20)]
dt_test  <- dt[-train.index,-c(18,19,20)]

dt_train_labels <- dt[train.index, 21]
dt_test_labels <- dt[-train.index, 21]

table1 <- table(dt_train_labels)
round(prop.table(table1), 2)

table2 <- table(dt_test_labels)
round(prop.table(table2), 2)

dt_train <- dt_train[,-18]
dt_test <- dt_test[,-18]
```


In order to assess the classification accuracy, we plot the relationship between the training and test error error rates as a function of the number of neighbors selected (K). As the hyperparameter **K** increases, the method becomes less flexible. The training error rate consistently increases as the flexibility decreases. Both error rates declines at first, reaching a minimum rate at K=5, before increasing again when the method becomes more inflexible. Therefore, we chose K=5 as the best number of neighbors for KNN in our classification. 

```{r, message=FALSE, out.width="50%", fig.align="center", warning=FALSE}
##load the package class
library(class)

### Plot training-test error by K value

library(class)
error.train <- replicate(0,20)

for(k in 1:20) {
  pred_pol <- knn(train = dt_train, test = dt_test, cl = dt_train_labels, k)
  error.train[k]<-1-mean(pred_pol==dt_train_labels)
}

error.train <- unlist(error.train, use.names=FALSE)

error.test <- replicate(0,20)
for(k in 1:20) {
  pred_pol <- knn(train = dt_train, test = dt_test, cl = dt_train_labels, k)
  error.test[k]<-1-mean(pred_pol==dt_test_labels)
}

error.test <- unlist(error.test, use.names = FALSE)

plot(error.train, type="o", col="blue", ylim=c(0,0.5), xlab = "K values", ylab = "Misclassification errors")
lines(error.test, type = "o", ylim=c(0,0.5), col="red")
legend("topright", legend=c("Training error","Test error"), col = c("blue","red"), lty=1:1)

#which(error.train==min(error.train)) # k=5
#which(error.test==min(error.test)) # k=5
```

By setting K = 5, we get a model with an accuracy of 85% with the misclassification of only two "orange" States (Arizona and Michigan) which are categorized as "yellow".

```{r, message=FALSE, out.width="50%", fig.align="center", warning=FALSE}

##run knn function (k=5)
pr <- knn(dt_train,dt_test,cl=dt_train_labels,k=5)

##create confusion matrix
tab <- table(pr,dt_test_labels)
tab

##this function divides the correct predictions by total number of predictions that tell us how accurate the model is.

accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)

#Accuracy of 85%
```

Moreover, we perform a repeated Cross Validation algorithm to choose K among the performed models according to its accuracy, our evaluation measure. In the next figure it is plotted the Number of Neighbors Vs Accuracy and we can see that by fixing K = 7 we get the model with the largest accuracy (72%).

```{r, message=FALSE, out.width="50%", fig.align="center", warning=FALSE}
set.seed(12)
ctrl <- trainControl(method="repeatedcv",repeats = 3) #,classProbs=TRUE,summaryFunction = twoClassSummary)

knnFit <- train(polluted ~ .,
                      data = dt[ train.index,-c(18,19,20)], method = "knn", trControl = ctrl, 
                      preProcess = c("center","scale"), tuneLength = 20)

plot(knnFit)
#knnFit$bestTune #higher accuracy with k=7

#Output of kNN fit
knnFit
```

In spite of the presence of multicollinearity among our variables, as we saw in the previous section, the predictions should not been affected in the KNN modeling. Therefore, we decided to run this classification algorithm with only the four most relevant variables to predict AQI selected by the Forward Search that was already performed (*"manufacturing"*,*"pop_rural"*,*"precipitations"* and *"n_factories"*). However, the accuracy was a little bit improved, by choosing K=5 or K=9 the metric was almost 77% in both cases.

```{r, message=FALSE, out.width="50%", fig.align="center", warning=FALSE}
error.train <- replicate(0,20)

dt_train2 <- dt_train[,c("manufacturing","pop_rural","precipitations")]
dt_test2 <- dt_test[,c("manufacturing","pop_rural","precipitations")]


for(k in 1:20) {
  pred_pol <- knn(train = dt_train2, test = dt_test2, cl = dt_train_labels, k)
  error.train[k]<-1-mean(pred_pol==dt_train_labels)
}

error.train <- unlist(error.train, use.names=FALSE)

error.test <- replicate(0,20)
for(k in 1:20) {
  pred_pol <- knn(train = dt_train2, test = dt_test2, cl = dt_train_labels, k)
  error.test[k]<-1-mean(pred_pol==dt_test_labels)
}

error.test <- unlist(error.test, use.names = FALSE)

plot(error.train, type="o", col="blue", ylim=c(0,0.7), xlab = "K values", ylab = "Misclassification errors")
lines(error.test, type = "o", ylim=c(0,0.5), col="red")
legend("topright", legend=c("Training error","Test error"), col = c("blue","red"), lty=1:1)

#which(error.train==min(error.train)) # k>=5
#which(error.test==min(error.test)) # k=5


##run knn function (k=5)
pr5 <- knn(dt_train2,dt_test2,cl=dt_train_labels,k=5)

##create confusion matrix
tab <- table(pr5,dt_test_labels)
tab

accuracy(tab)

#Accuracy of 84.6%

combinetest <- cbind(dt_test, dt[-train.index, 18])
combinetest[dt_test_labels != pr5,]

#Michigan and Oklahoma are the States misclassified. Let see the AQI

dataset[dataset$state=="Oklahoma", 2] #102.1
dataset[dataset$state=="Michigan", 2] #114.11
```


Finally, we keep the three most important variables (*"manufacturing"*,*"pop_rural"*,*"precipitations"*) and with K = 5 the accuracy obtained was 85%, equal to the largest value we saw in the KNN model performed with all the variables. In this case, the only two States that were misclassified are Michigan (114.1) and Oklahoma (102.1), both belonging to the "orange" category and which the model categorized as "yellow". 


```{r, message=FALSE, out.width="70%", fig.align="center"}

#Plot in 3 dimensional space

#install.packages("plot3D")
colVar <- factor(dt$polluted)


scatter3D(dt$pop_rural, dt$manufacturing, dt$precipitations, 
          labels = rownames(dt),
          colvar=as.integer(colVar),
          phi = 0, bty ="g",
          pch = 20, cex = 1.5,
          col = c("gold2", "chocolate1", "red3"),
          xlab = "Rural pop",
          ylab ="Manufacturing", zlab = "Precipitations",
          colkey = list(at = c(1, 2, 3), side = 4, 
                        addlines = TRUE, length = 0.5, width = 0.5,
                        labels = c("yelow", "orange", "red")))

text3D(dt$pop_rural, dt$manufacturing, dt$precipitations,  
       labels = dt$Code,
       add = TRUE, colkey = FALSE, cex = 0.5)
```


In conclusion, we keep this last KNN model to predict the category at which State belongs according to its level of air quality, by only using the GDP of Manufacturing, the Share of Rural population and the Quantity of Precipitations.


# Data Analysis for Unsupervised Learning

## Unsupervised learning

We enrich the analysis performing two unsupervised learning methods. In particular, we use PCA and hierarchical clustering in order to visualize and understand similarities between States.


```{r, message=FALSE, out.width="50%", fig.align="center"}
#load dataset
dataset <- read.csv('usa_final.csv', sep=',')
#prepare dataset for both PCA and clustering
rownames(dataset)<-dataset[,1]
datanames <- dataset
datanames$cl <- cut(datanames$aqi, breaks = c(50,100,150,200),
                   labels = c('yellow', 'orange', 'red'))
datanames$cl <- as.factor(datanames$cl)
aqi_plot <- dataset[,2]
dataset <- dataset[,3:19]
dataset <- scale(dataset)
ds<-dist(dataset)
```

### PCA

PCA highlights the possibility to reduce the model to a much simple one. As our main goal is to explain the variance in the data within a small number of components, we proceed by extracting the correlation matrix from the dataset and then its eigenvalues. As the underlying table shows, there is a first principal component that explains more than 66% of the variance. There are other two components deriving from an eigenvalue just above one.

```{r, message=FALSE, out.width="50%", fig.align="center"}
n <- nrow(dataset)
p <- ncol(dataset)
#create correlation matrix
rho <- cor(dataset)
autoval <- eigen(rho)$values
autovec <- eigen(rho)$vectors
pvarsp = autoval/p
pvarspcum = cumsum(pvarsp)
pvarsp
tab<-round(cbind(autoval,pvarsp*100,pvarspcum*100),3)
colnames(tab)<-c("eigenval","%var","%cumvar")
tab
```
An easy way to select components is a scree plot. Since the table witnesses the presence of a first principal component followed by two other ones lying an order of magnitude below, we prefer to avoid the elbow rule and instead select every component above one. This choice allows us to end up with a model explaining 84% of the variance.

```{r, message=FALSE, out.width="50%", fig.align="center"}
#scree diagram to select the components
plot(autoval,type="b",main="Scree Diagram", xlab="Number of Components", ylab="Eigenvalues")
abline(h=1,lwd=3,col="red")
```

As mentioned before, PCA is a powerful tool for data visualization. In this case, there is a tradeoff between visualization and reliability of the model. On one hand, the fourth component has slightly significant explanatory importance since the eigenvalue is just below 1 and it moves the cumulative variance explained from 84% to 90%. On the other hand, it is still below the threshold and could be neglected without a significant loss. In order to have a higher explained variance one would select a four-dimensional model, while a lower-dimensionality fashion for plotting requires to have at most three components.


#### Three components analysis

Since in this analysis PCA is used for description and visualization, we focus on a three components analysis. The three component analysis is the most useful one to read loadings and thus understand how variables are explained by the principal components. First of all, we select the corresponding eigenvectors. 

We proceed generating the loadings for each component. As shown in the following table, the first principal component has a heavy role in explaining almost every variable, peaking at waste, professional, and healthcare.

```{r, message=FALSE, out.width="50%", fig.align="center"}
#we investigate what is explained by the components
comp<-round(cbind(
  -eigen(rho)$vectors[,1]*sqrt(autoval[1]),
  -eigen(rho)$vectors[,2]*sqrt(autoval[2]),
  -eigen(rho)$vectors[,3]*sqrt(autoval[3])
),3)
rownames(comp)<-colnames(dataset)
colnames(comp)<-c("comp1","comp2","comp3")
communality<-comp[,1]^2+comp[,2]^2+comp[,3]^2
comp<-cbind(comp,communality)
comp
```

The scores for each State in the three components are reported below. Noteworthy, California, a very polluted State, stand out as its far from every other one. The sign of the components varies according to each State therefore we cannot draw a global interpretation of them. In this case, we can interpret them as coordinates of points in a tridimensional space and plot them.

```{r, message=FALSE, out.width="50%", fig.align="center"}
#calculate the scores for the selected components
score <- dataset%*%autovec[,1:3]
round(score,3)
```

We scale the scores and plot them.

```{r, message=FALSE, out.width="80%", fig.align="center"}
#score plot
scorez<-round(cbind
              (-score[,1]/sqrt(autoval[1]),
                score[,2]/sqrt(autoval[2]),
                score[,3]/sqrt(autoval[3])),2)
x <- scorez[,1]
y <- scorez[,2]
z <- scorez[,3]
#plots
scatter3D(x, y, z, colvar = aqi_plot,
          xlab = "comp1", ylab = "comp2", zlab = "comp3",
          main = "PCA - 3 components",
          bty = "g", ticktype = "detailed", d = 2,
          theta = 60, phi = 20, col = gg.col(100),
          clab = "aqi",
          pch = 19, cex = 0.8)
text3D(x,y,z,labels = rownames(dataset),add = TRUE,
       cex = 0.5, col = gg.col(100),
       adj = 0.5, font = 1)
```


### Clustering

The main goal of clustering in this analysis is on one hand grouping States which share common features. On the other hand, it results useful also to discover peculiar States which tend to have marked different behaviours. This could be the hint for a further analysis. To add robustness to the analysis, we perform and compare different clustering methods. In particular, we compare the **average linkage**, the **complete linkage** and the **Ward linkage** methods by plotting the respective dendrogram.

```{r, message=FALSE, out.width="90%", fig.align="center", out.extra='angle=90'}
agglo <- function(hc){
  data.frame(row.names=paste0("Cluster",seq_along(hc$height)),
             height=hc$height,
             compoments=ifelse(hc$merge<0,
        hc$labels[abs(hc$merge)], paste0("Cluster",hc$merge))
  )}
#first clustering: average linkage
h1 <- hclust(ds,method="average")
par(mfrow=c(1,3)) 
plot(h1, main="Average linkage")
#add robustness to the analysis trying other methods
h2 <- hclust(ds,method="complete")
plot(h2, main="Complete linkage")
#Ward method can help us in reducing distances of outliers such as California
h3 <- hclust(ds,method="ward.D2")
plot(h3, main="Ward linkage")
```

The results are pretty similar, with Texas, New York and especially California disjoint from the main tree almost to the root. The result is consistent with the PCA, that already highlighted California as a strong outlier. We choose to go on with Ward method for two reasons. First of all, it is in its implementation to reduce the distances of the outliers, as we can clearly confirm comparing it to the average link method. Secondly, it is the one that subdivides the States in the clearer way. We can see by sight a three clusters dendrogram. To confirm it, we test it with the elbow method. The method relies on the fact that one looks for a number of clusters (k) that minimizes the total within-cluster sum of squares (wss). We plot the curve of wss according to k and locate the bending point (or elbow), a good indicator to decide k.

```{r, message=FALSE, out.width="50%", fig.align="center"}
#choice of number of clusters - elbow method
set.seed(123)
#compute and plot wss
wss <- sapply(1:12, 
              function(k){kmeans(ds, k, nstart=50,iter.max = 12 )$tot.withinss})
wss
plot(1:12, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")
```

The test confirms the presence of three main clusters.

```{r, message=FALSE, out.width="50%", fig.align="center"}
plot(h3, main="Clusters with Ward linkage method")
rect.hclust(h3,3)
```

Worth mentioning, we have to be aware that Ward method is different from the other two, that share the exact same clusters except from one State. The quick comparison below shows that Ward method leads to a more balanced subdivision between clusters.

```{r, message=FALSE, out.width="50%", fig.align="center"}
cut_av <- cutree(h1, k = 3)
cut_compl <- cutree(h2, k = 3)
cut_ward <- cutree(h3, k = 3)
#compare the different clustering methods
table(cut_av,cut_compl)
table(cut_av,cut_ward)
table(cut_compl,cut_ward)
```

#### Comparison

We question now if the clusters resemble the level of pollution for the States. Being the dataset subdivided into three clusters like the qualitative labels for the level of pollution ("red", "orange" and "yellow") we can easily group them to see if they have similar results.

```{r, message=FALSE, out.width="50%", fig.align="center"}
# Lets compare the share of polluted groups within each of the 3 clusters 
cut_ward <- as.data.frame(cut_ward)
cut_ward <- cbind(rownames(cut_ward), cut_ward)
names(cut_ward) <- c("state","cluster")
cut_ward <- merge(cut_ward, datanames[,c(1,22)], by = "state")
group_by(cut_ward, cluster, cl) %>% 
  summarise(
    count = n(),
    share = n()/51*100
  )
tab_xtab(cut_ward$cluster, cut_ward$cl, show.row.prc = TRUE)
```

The table shows no correlation between the labels and the clusters. The correlation test below has a non-significative p-value of 0.708.

```{r, message=FALSE, out.width="50%", fig.align="center"}
#Correlation tests
cut_ward$cl2 <- as.numeric(cut_ward$cl)
cor(cut_ward$cluster, cut_ward$cl2, method = "pearson")
cor.test(cut_ward$cluster, cut_ward$cl2, method = "pearson")
```

# Theoretical Background 

## Supervised learning

### Model selection
In order to perform a *model selection*, we need to focus on the goals of a generic model: maximize the accuracy of the prediction, especially controlling the variance, and make the analysis as interpretable as possible, removing irrelavant features. To pursue these aims, we performed a subset selection, identifying the *p* predictors more related to the response variable and then fitting a model using least squares on the reduced set of variables. There are different models to obtain this best selection, each one with its pros and cons, and the ones we used in our analysis are presented in the following sections:

### Best subset selection
The main idea of this model is that exist a best subset which we should identify.
The model is developed through 3 steps:

1. We start with the *null model* which contains no predictors and then it simply predicts the sample mean for each observation.
2. Considering *p* predictors, for each *k = 1,.., p* we first compute all the possible models whose number is equal to the combinatorial calculus *(p over k)*. Among all the models within the same value of *k* and for each value of *k*, the one with the smallest residual sum of squares or the highest R^2 is picked. 
3. Having computed the best model for each level of *k*, it's possible to select the best one modeling the training error (C~p, AIC, BIC, R^2) or directly computing the test error with a validation set or a cross-validation.

Best subset selection checks for the best subset among all possible values of k in order to find the best subset. Nevertheless,it is a heavy method computationally speaking, especially with large values of *p*; furthermore, from a statistical point of view, this large search space could bring to end up with an overfitted model. 
For this reasons, stepwise approaches could seem appealing alternatives.


### Forward stepwise selection
This method starts with an empty model and then it adds at each step the predictor that provides the greatest additional improvement, following this structure:

1. we start with the *null model* which contains no predictors.
2. then, for *k = 1,.., p-1*, it chooses the variable among the *p - k* still available that added the bigger improvement to the model. Again, the selection is made according to the smallest RSS or the best R^2.
3. at the end, the best model among the *k* available is selected, according to appropriate variation of the training error (C~p, AIC, BIC, R^2) or using a validation set or cross-validation.

This method presents clear computational advantages, but it's not guaranteed to find the best subset selection among all the 2^p models containing *p* predictors.

### Choosing the optimal model
Residual sum of squares and R^2 are not proper tools to select the best model, because they are related to the training error, which too often is a poor estimate of the test error and this could cause overfitting. To solve this problem there are 2 ways:

- *indirectly* estimate of the test error by making adjustment on the training error to avoid overfitting.
- *directly* estimate of the test set using a validation set or a cross-validation approach.

#### Indirect estimates: C~p, AIC, BIC, Adjusted R^2
The *Mallow's C~p* is computed as *C~p = 1/n(RSS + 2dsigma^2)* where *d* is the number of parameters and *sigma^2* an estimate of the variance of the error associated with each response measurement. The lower it is, the smaller the test error is.
*AIC and BIC (Akaike and Bayes information criterion)* are based on the likelihood function and these are the formulas: *AIC = -2logL + 2d* and *BIC = 1/n(RSS + log(n)dsigma^2)*, where *L* is the likelihood function. AS in Mallow, the lower values stand for smaller test errors. AIC is defined for large class of models and BIC is the most cautelative and it ends up with a smaller number of predictors respect to Mallow if *n* is large.
*Adjusted R^2* is computed as *Adjusted R^2 = 1- (RSS/(n + d - 1))/(TSS/(n - 1))*. Unlike the previous indicators, the larger it is, the smaller the test error it is. Respect to R^2, it takes inot account penalties for the inclusion of unnecessary variables.

#### Validation set and Cross-validation
As advantage respect the indicators, they don't require an estimate of the variance *sigma^2*. 
The validation errors are calculated by randomly selecting 3/4 of the observations as training set and th remainder as validation set. 
Cross-validation is computed by considering *k* folders which our dataset is divided in. Then, we consider at each time *t=1,..,k* the dataset without the folder t as training set and observations in the folder t as test set. Repeating the operation *t* times, the test error will be computed by the average over the *t-test errors*.  

### Tree-based methods
Used for both regression and classification problems, *tree-based methods* are so called because the set of splitting rules with which the predictor space is divided can be summarized with a tree. The prediction space is partitioned in simple regions and the method is characterized by its simplicity and interpretability, however it's usually less powerful than other approaches with regard to accuracy.
Since it's computationally unfeasible to consider every possible split of the feature space, the tree is built with a *top-down* and *greedy* approach, respectively because it begins at the top of the tree, splitting then the predictor space, and because at each step the split that adds the best possible improvement is made, without a strategy that would be careful to the future steps.
The tree starts from the root and, passing through *internal nodes*, which represent the results of a splitting internal process, arrives to its *final nodes*, also called *leaves*, which represent the resulting split of the feature space. The latter is composed by rectangles at the end for the sake of simplicity and interpretability, but the split could have any shape.
The aim of the split is to minimize the variance within a group and, therefore, to find boxes which minimize the residual sum of squares. For every observation that falls into the same region, the same prediction is made according to the type of problem:

- *regression problem*: the predicted value is simply the mean of the response values of the training set observations; this holds for each observation which falls into the region.
- *classification problem*: the predicted value is the most common one according to the values of the training set observations which fall in the region, according to the misclassification rate or, also better, to Gini's index or cross entropy's index.

A tree without limitation in its growing could guarantee good performances on the training set, but it might probably end up with overfitting. It seems to be needed some kind of threshold to avoid this problem: *pruning the tree*.

### Pruning a tree
One possible solution might be to grow the tree only if the decrease in the RSS is greater than a certain threshold, but this solution is short-sighted, because it could be not the best one in a deeper perspective.
Instead, *the pruning method* start with a very large tree and then prune it bach in order to obtain a subtree which perform better. This is done with the *cost complexity pruning*, which consider a parameter *alpha* such that, for each *alpha* value, a subtree is corresponded. *alpha* controls a trade-off between the subtree's complexity and its fit to the training data and its optimal value is chosen using cross-validation.
In order to perform a pruning, this process is followed:

1. recursive binary split is used to grow a large tree on the training set, stopping only when the minimum number of observations allowed is reached.
2. cost complexity pruning is applied to the large tree to obtain a sequence of subtrees as a function of *alpha*.
3. using cross validation, choose *alpha*: for each *k = 1,.., K*, repeat steps 1 and 2 on the *(K - 1)/K* fraction of the training data, excluding the *k-th* fold and then, using this latter, evaluate the mean squared prediction error. Then an average of the results is made to take the best *alpha*, which minimize the average error.
4. taking from step 2, the subtree with the chosen *alpha* is returned.

Trees are very simple to explain to people, its interpretability is an important feature, it's common the thought that they reflect the human decision-making process and they can be displayed graphically. 
However, they perform worse than other classification and regression approaches, even if the predictive accuracy can be substantially improved combining decision trees, as with **random forest**.

### Random forest

Random forest is a procedure to reduce the variance of a statistical learning method, especially used with decision trees. This method faces the trade-off between accuracy and interpretability, fostering the first one. In fact, considering a given set of *n* observations each with variance *sigma^2*, the variance of their mean will be *sigma^2/n* and this will reduce the variance. The process starts with a bootstrap from the training data set, obtaining repeated samples. The bootstrapped training sets are trained with the selected method and, at the end, an average of all the prediction is made. This process is shared between *bagging* and *random forest*, but what characterizes random forest is that, when the decision trees are building, *a random selection of m predictors among the whole set of p predictors* is made, where m is usually equal to *m = sqrt(p)*, and one of the *m* selected predictors is chosen for the split, instead from *p* variables. This additional step improves the performance respect to bagging because it allows to decorrelate the trees and improve the reduction in the variance when the average among the trees is made.

### K-NN nearest neighbours selection
The *k-NN nearest neighbours* algorithm belongs to that category which solves a classification problem; in fact, the response variable *Y* is defined as *qualitative* and it should be taken among a defined and finite set of possible values. This algorithm respond to a simple rule: it predicts every point with the label that is more present among the *k* neighbours of the point we are analyzing and, in case of tie, it responds to a predefined decision rule. It's relevant to focus on how it differently works depending on *k*, which takes value *k = 1,.., n* with *n* equal to the size of the training set:

- *1-NN nearest neighbours* has an the best accuracy for point belonging to the training set, since each point would be predicted with its own label. Nevertheless, it provides bad performances on the test points with bad accuracy. For this reason, the case *k = 1* is characterized by *overfitting* because of the different between training and test error and this is due to the excessive variance of a model that consider for its prediction only one neighbor.
- *K-NN* generalizes the previous particular case. The parameter *k* is usually chosen odd to avoid tie situations and if it is different from 1, in general the training error is different from 0. Moreover, as *k* grows, the classifiers generated become simpler. In particular, when *k = n* the classifier is constant and equal to the most common label in the training set. Furthermore, as *k* increases too much, the algorithm ends up with an underfitting situation. In conclusion, there isn't a best value of *k* a priori, but it is usually set a a level where the training error is different from 0.

*K-NN* works with binary classification problem and with multiclass classification problem using same modality, but id could operate also in a regression problem, simply considering the prediction as the average of
the labels of the k closest training points.

## Unsupervised learning

In the unsupervised learning only independent variables are observed and there is no interest in prediction because there isn't a response variable.Here, The goals are to find interesting ways to visualize data and to discover subgroups among the variables and observations. We discussed two methods during the analysis:
- **principal component analysis (PCA)**
- **clustering**

### Principal Component Analysis

Used for data visualization or data pre-processing, the PCA is a technique which gives us a low dimensional representation of the dataset, exploiting covariance matrix of the starting dataset. The aim is to find a sequence of combination of variables that have maximal variance and are uncorrelated. For this reason, the process id developed sequentially.
First, it's needed to find the *first principal component* of a set of features: this is that normalized linear combination of features which has the largest variance, where the sum of squared coefficients is equal to 1 and the set of the coefficients is called the *principal component loading vector*. Then, the first principal component is computed starting from this vector, using singular value decomposition and with the aim to get as much variance as possible. In other words, we want to identify the direction among the feature space where the data vary the most and that is more relevant to explain how data are spread in the our space. The PCA is useful for visualization because we can project data along this dimension to understand a good part of data distribution.
The representation can be done in two dimensions: for this reason, it's relevant do determine the *second principal component*. This is that linear combination of features that has the largest variance and, at the same time, is uncorrelated with the first principal component. Given this last assumption, second component turns out to lay on the orthogonal direction of the first component and this is a great feature to visualize the data in a two dimensional space. Principal components can be more than two but not more the number of observations minus one or number of features and all of them need to be orthogonal to the previous component. 
Principal components are also used to get portion of the total variance present in a dataset and this is the rule used to select the best number of principal components. The idea is to add them to the dataset as new variables until the next one improves significantly the explained variance of our observations. A general rule is based on the use of the singular values associated with each component, which recommends to add components until they have values greater than a threshold, usually 1, because it would mean they are able to explain the spread of the data as much as more than one variable of the original dataset.

### Clustering

The aim of the clustering is different respect to PCA: here, we want to detect if subgroups or clusters are present in the dataset, seeking a data partition such that observations within the same group are *similar*. The concept of similarity becomes central with clustering and it depends on type of variables, which need to be numeric, and on the type of distance we are considering (usually Euclidean, but there are a lot of types: City-Block, Minkowski,..).
The algorithms can be several, but there is a main difference between two groups:
- **K-means**, where the number of clusters is pre-specified
- **hierarchical clustering**, where the number of clusters id not fixed and we end up with a tree-like visual representation, called *dendrogram*.

Our analysis is focused on hierarchical clustering and its different algorithms based on different assumptions.

### Hierachial clustering

This method follow a simple process:
1. it starts with all observations grouped by themselves as clusters.
2. The closest two clusters are identified and merged.
3. The process is repeated until all the points are in the same cluster.

As defined before, the grouping process depends on the kind of distance we are using, but also on the way we want to link the point. 
With regard to the distance-type problem, if the euclidean one is used, we need to standardize the variables previously. Respect to the linkage, there are different **linkage types** with which an algorithm can be developed:

- **complete linkage**, it maximizes inter-cluster dissimilarities, computing all pairwise dissimilarities between observations of different clusters and it records the largest one.
- **average linkage**, computing the mean of the inter-cluster dissimilarities.
- **single linkage**, where the minimal inter-cluster dissimilarity is considered,recording the smallest among all the pairwise dissimilarities between observations of different groups.
- **centroid linkage**, where dissimilarity between centroids of cluster A and cluster B is considered.

At the same time, there is a ig variety of dissimilarities' measures which we can use, depending on the type of data, ordinal or numeric. Here are reported a few examples:
- **Gower index**, which takes value for the each quantitative variable equal to *1 - |x(is) - x(js)|/K(s)*  where *|x(is) - x(js)| is the difference between the values of two observations respect to the variable and K(s) is the range within the variable value can vary.* In the case of qualitative variables, the Gower index takes value 1 if two observations have same mode for each feature, otherwise 0. In case of only binary variables, this index is equal to *Jaccard index*.
- **Jaccard index**, for qualitative variables, it excludes the co-absence of the features.
- **Russel Rao index**, equal to the ratio between number of co-presence and the number of binary variables.
- **Sochal-Mikener index**, equal to the ratio between the sum of co-absence and co-presence and total number of binary variables.

In general, there isn't a univoque best method to determine the best number of cluster. It's possible to use multiple methods to detect potential clusters and a good way to work is to repeat the analysis with appropriate different dissimilarities index and linkage types to check the robustness of the results. 


# Conclusions 

The starting point of our research was to check if lockdowns, used as measures of containment of COVID-19, had influenced the quality of air in the United States in 2020, as often stated by media.

We started using the AQI as a measure of the level of air pollution in each State. Then, we subdivided the States according to their air quality in three categories: yellow, orange and red (from least to most polluted).  
Proceeding with the analysis of the datasets through the boxplots and the ANOVA test, we see that there is no strong relation between lockdowns and the AQI,
which presents a mean and a standard deviation normally distributed in almost all the states, with the exception of a few outliers.
After this first analysis, there's evidence that the AQI doesn't differ significantly in states where there has been a lockdown. In summary, the implementation of this measurements did not have any real effect on the air quality.

Proceeding with the OLS on the best parsimonious model selected through the BIC, we found that the most relevant variables to explain the AQI are: number of factories per sq. km, precipitations, share of rural population and the level of GDP in manufacturing.

The first two present the most significant statistical relation with the AQI, with a positive and negative sign respectively as expected.
Moreover, the level of population in rural areas presents a negative sign, pointing out that the more important the rural population in a State, the more clean would be the air.

We tried another approach using the Lasso. Its sub-selection provides similar results, indicating the same variables aforementioned, with an addiction of two more
(also lockdown is selected, but with a very low coefficient). Proceeding with the post-selection inference on the same subset of variables, only two were picked as relevant,
with a small p-value. These are: precipitations and the number of factories.

Finally, we computed a KNN algorithm to predict the AQI, based on all the variables and the three categories aforementioned, obtaining an 85% accuracy with only two misclassification.
The subselection of the variables most relevant to this model drops to three, the number of factories gets dropped, but the others are the same as above.

The unsupervised analysis purpose was twofold: explore the data through visualization and check if more polluted States are similar in other characteristics. Through PCA we found a way to describe the data using a strong principal component and other smaller ones. The representation highlighted the presence of well defined outliers. Despite clustering subdivided the States into three groups, we could not find any significant correlation between the categories in which we split the data before ("yellow", "orange" and "red") and the clusters we obtained with the unsupervised learning algorithm.

On of the main conclusions is that in the US, contrarily from a lot of statements circulated through the media,
which suggested that lockdowns had a positive influence on the air quality, the relation between these two factors is not conclusive.
In fact, other indicator have much more influence on air quality, such as precipitations, people living in rural areas, 
number of factories (many of them still operating during the pandemic, especially if related to primary goods), all of which
couldn't be controlled by the imposition of a lockdown. According to the models selected, the lockdown doesn't seem to have had a relevant impact on air quality.

To conclude, we can say that the analysis could be improved by adding other factors that could explain the level of pollution of each State. With the analysis we carried out, we tried to explain and categorizes the States of the U.S. with a pollution perspective, using a group of different models and tools, but using almost the same dataset. Future works should also have a deep look on the particularities within the States, by developing a method to better treat the outliers we have.


# Appendix

All the pieces of code not evaluated for all the parts of our analysis.

\tiny
```{r, eval=FALSE, include=TRUE, echo=TRUE}

################################################################################

#SUPERVISED

#load dataset

data <- read.csv('usa_final.csv', sep=',')
data <- data[,-c(20,21)]
str(data)

#Check NORMALITY
qqnorm(data$aqi)
hist(data$aqi)
shapiro.test(data$aqi)

#try with log
log_aqi <- log(data$aqi)
par(mfrow=c(1,2))
qqnorm(data$aqi,  main="(aqi)")
qqnorm(log_aqi,  main="Transformed log(aqi)")

par(mfrow=c(1,2))
hist(data$aqi, main='aqi')
hist(log_aqi, main="Transformed log(aqi)")

#now seems normal
shapiro.test(log_aqi)

#we can substitute the variable with the log
data$ln_aqi <- log_aqi

#California, District of Columbia and Wyoming the cleanest states
library(car)
par(mfrow=c(1,1))
Boxplot(~aqi, data=data, id=list(labels=data$state))
Boxplot(~ln_aqi, data=data, id=list(labels=data$state))


#data without NA
library(tidyr)
dt <- data %>% drop_na()
#and without the first 2 variables
dt <- dt[,-c(1,2)]

################################################################################

#linear FULL model - Adjusted R-squared:  0.4933
full.model <- lm(ln_aqi~., data = dt)
summary(full.model)
vif(full.model)
sqrt(vif(full.model))>10

#DROP WASTE - Adjusted R-squared:  0.4966 
full.model <- lm(ln_aqi~.-waste, data = dt)
summary(full.model)
vif(full.model)
sqrt(vif(full.model))>10
sqrt(vif(full.model))>5

#DROP HEALTHCARE - Adjusted R-squared:  0.5044 
full.model <- lm(ln_aqi~.-waste-healthcare, data = dt)
summary(full.model)
vif(full.model)
sqrt(vif(full.model))>10
sqrt(vif(full.model))>5

#DROP CONSTRUCTION - Adjusted R-squared:  0.503
full.model <- lm(ln_aqi~.-waste-healthcare-construction, data = dt)
summary(full.model)
vif(full.model)
sqrt(vif(full.model))>10
sqrt(vif(full.model))>5

#DROP UTILITIES - Adjusted R-squared:  0.5021 
full.model <- lm(ln_aqi~.-waste-healthcare-construction-utilities, data = dt)
summary(full.model)
vif(full.model)
sqrt(vif(full.model))>10
sqrt(vif(full.model))>5

#DROP PROFESSIONAL - Adjusted R-squared:  0.4296
full.model <- lm(ln_aqi~.-waste-healthcare-construction-utilities-professional, data = dt)
summary(full.model)
vif(full.model)
sqrt(vif(full.model))>10
sqrt(vif(full.model))>5

#DROP RETAIL - Adjusted R-squared:  0.3978
full.model <- lm(ln_aqi~.-waste-healthcare-construction-utilities-professional-retail, data = dt)
summary(full.model)
vif(full.model)
sqrt(vif(full.model))>10
sqrt(vif(full.model))>5

#DROP FINANCE - Adjusted R-squared:  0.4129 
full.model <- lm(ln_aqi~.-waste-healthcare-construction-utilities-professional-retail-finance, data = dt)
summary(full.model)
vif(full.model)
sqrt(vif(full.model))>10
sqrt(vif(full.model))>5

################################################################################
#predictions
set.seed(123)
train = sample(1:nrow(dt), 0.7*nrow(dt))
dt_train = dt[train,-18]
dt_test = dt[-train,-18]
dt_train_labels <- dt[train, 18]
dt_test_labels <- dt[-train, 18]
summary(dt_train_labels)
summary(dt_test_labels)

library(Metrics)
full.model <- lm(ln_aqi~.-waste-healthcare-construction-utilities-professional-retail-finance, data = dt[train,])
pred_ols <- predict(full.model, dt[-train,])
cbind(pred_ols, dt_test_labels)
#is the square root of the mean of the square of all of the error
root_mse = rmse(dt_test_labels, pred_ols)
root_mse
library(tidyverse)
library(caret)
R2(pred_ols, dt_test_labels)
library(olsrr)
#plot of the residuals
ols_plot_resid_qq(full.model)
ols_plot_resid_fit(full.model)
ols_plot_resid_hist(full.model)

#LASSO
library(glmnet)

x=model.matrix(ln_aqi~., dt)[,-1]
x = scale(x, TRUE, TRUE)
y=dt$ln_aqi

fit.lasso=glmnet(x,y, standardize = FALSE)
plot(fit.lasso,xvar="lambda",label=TRUE)

cv.lasso=cv.glmnet(x,y)
plot(cv.lasso)

coef(cv.lasso)

#we use our earlier train/validation division to select the lambda for the lasso.

lasso.tr=glmnet(x[train,],y[train], standardize = FALSE)
lasso.tr

pred=predict(lasso.tr,x[-train,])
dim(pred)

rmse= sqrt(apply((y[-train]-pred)^2,2,mean))
plot(log(lasso.tr$lambda),rmse,type="b",xlab="Log(lambda)")

lam.best=lasso.tr$lambda[order(rmse)[1]]
lam.best

#the best lambda is smaller than rsme in the linear regression model

coef(lasso.tr,s=lam.best)

#vars with the highest relevance are precipitations, pop_rural and n_factories
#it also adds information in the sub-selection. Manufacture and lockdown have less relevance

#now compute post-lasso inference to check the highest dependent variables on the sub-selection

library(selectiveInference)

sigma = estimateSigma(x,y)$sigmahat

beta = coef(lasso.tr, x=x, y=y, s=lam.best/51, exact = TRUE)[-1]

out_lasso_inf = fixedLassoInf(x,y,beta,lam.best,sigma=sigma)
out_lasso_inf

#with a lambda of 0,026 the most relevant variables that satisfy the p-value are precipitations and
#n_factories. pop_rural, manufacture and lockdown show a high p-value and are taken out.
#with the Lasso, the results are similar to the other models used, but the post-selection inference
#reduces the variables just to the two aforementioned.

################################################################################

#deleting all
dt <- dt[,-c(13, 5, 2, 12, 9, 10, 4)]
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x))) 
}
dt <- as.data.frame(lapply(dt, normalize))
################################################################################

#best subset
library(leaps)
regfit.full=regsubsets(ln_aqi~.,data=dt)
reg.summary=summary(regfit.full)
names(reg.summary)
par(mfrow=c(2,2))
plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp")
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS")
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="AdjR2")
plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC")

#STEPWISE FORWARD 
#pop_rural, manufacturing, precipitations, n_factories
regfit.fwd=regsubsets(ln_aqi~.,data=dt,method="forward", nvmax=10)
summary(regfit.fwd)

reg.summary<-summary(regfit.fwd)
par(mfrow=c(1,1))
plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",types="l")


#linear model of selected - Adjusted R-squared:  0.3853 
set.seed(123)
train = sample(1:nrow(dt), 0.7*nrow(dt))
dt_train = dt[train,-11]
dt_test = dt[-train,-11]
dt_train_labels <- dt[train, 11]
dt_test_labels <- dt[-train, 11]
summary(dt_train_labels)
summary(dt_test_labels)
model <- lm(ln_aqi~ pop_rural + manufacturing + precipitations + n_factories, 
            data = dt[train,])
summary(model)

pred_fwd <- predict(model, dt[-train,])
cbind(pred_fwd, dt_test_labels)
#is the square root of the mean of the square of all of the error
root_mse = rmse(dt_test_labels, pred_fwd)
root_mse
################################################################################
################################################################################
################################################################################
################################################################################
################################################################################
################################################################################
################################################################################

#CATEGORIZATION - CLASSIFICATION
library(dplyr)
data$cl <- cut(data$aqi, breaks = c(50,100,150,200),
               labels = c('yellow', 'orange', 'red'))
dt <- dt[,-11]
dt$polluted <- data$cl
dt$polluted <- as.factor(dt$polluted)
summary(dt$polluted)
str(dt)

#train test split

set.seed(123)
train = sample(1:nrow(dt), 0.7*nrow(dt))
dt_train = dt[train,-11]
dt_test = dt[-train,-11]
dt_train_labels <- dt[train, 11]
dt_test_labels <- dt[-train, 11]
summary(dt_train_labels)
summary(dt_test_labels)

###############################################################################
library(tree)
library(ISLR)

tree = tree(polluted~., dt)
summary(tree)
plot(tree)
text(tree, pretty = 0)

tree_train <- tree(polluted~., dt[train,])
tree_pred <- predict(tree_train, dt[-train,], type = 'class')
table(tree_pred, dt_test_labels)

accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(table(tree_pred, dt_test_labels))

#TREE with relevant variables
tree = tree(polluted~pop_rural + manufacturing + precipitations 
            + n_factories, dt)
summary(tree)
plot(tree)
text(tree, pretty = 0)

tree_train <- tree(polluted~pop_rural + manufacturing + precipitations 
                   + n_factories, dt[train,])
tree_pred <- predict(tree_train, dt[-train,], type = 'class')
table(tree_pred, dt_test_labels)

accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(table(tree_pred, dt_test_labels))

#PRUNING
tree_cv <- prune.misclass(tree_train, k = NULL, best = NULL, dt[-train,],
                          eps = 1e-3)

plot(tree_cv)

#cv
dtt <- dt[,c(4,7,9,10,11)]
dtt
tree1=tree(polluted~.,dtt,subset=train)
plot(tree1);text(tree1,pretty=0)
cv_tree1=cv.tree(tree1,FUN=prune.misclass)
cv_tree1
plot(cv_tree1)

#random forest
library(randomForest)
rf.tree=randomForest(polluted~.,data=dt,subset=train)
rf.tree

pred_rf <- predict(rf.tree, dt_test)
table(pred_rf, dt_test_labels)
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(table(pred_rf, dt_test_labels))

### K-NN with Caret package ###

library(RCurl)
x <- getURL("https://raw.githubusercontent.com/mathicard/Statistical-Learning-DSE/main/usa_final.csv")
dataset <- read.csv(text = x)


#AQI categorization
dataset$polluted <- cut(dataset$aqi, breaks = c(50,100,150,200),
                        labels = c('yellow', 'orange', 'red'))


#kNN requires variables to be normalized

#Clustering analysis is not negatively affected by heteroscedasticity 
#but the results are negatively impacted by multicollinearity of features/variables 
#used in clustering as the correlated feature/variable will carry extra weight 
#on the distance calculation than desired.
#These will influence our coefficients, but not the predictions. 
#Therefore, we decided to keep them.


normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x))) 
}

dt <- as.data.frame(cbind(lapply(dataset[,-c(1,2,20,21,22)], normalize), dataset[,c(1,20,21,22)]))

summary(dt)

#Extract training and test sets (stratified)

set.seed(12)
library(caret)
train.index <- createDataPartition(dt$polluted, p = .7, list = FALSE)
dt_train <- dt[ train.index,-c(18,19,20)]
dt_test  <- dt[-train.index,-c(18,19,20)]

dt_train_labels <- dt[train.index, 21]
dt_test_labels <- dt[-train.index, 21]

table1 <- table(dt_train_labels)
round(prop.table(table1), 2)

table2 <- table(dt_test_labels)
round(prop.table(table2), 2)

dt_train <- dt_train[,-18]
dt_test <- dt_test[,-18]


##load the package class
library(class)


### Plot training-test error by K value

library(class)
error.train <- replicate(0,20)

for(k in 1:20) {
  pred_pol <- knn(train = dt_train, test = dt_test, cl = dt_train_labels, k)
  error.train[k]<-1-mean(pred_pol==dt_train_labels)
}

error.train <- unlist(error.train, use.names=FALSE)

error.test <- replicate(0,20)
for(k in 1:20) {
  pred_pol <- knn(train = dt_train, test = dt_test, cl = dt_train_labels, k)
  error.test[k]<-1-mean(pred_pol==dt_test_labels)
}

error.test <- unlist(error.test, use.names = FALSE)

plot(error.train, type="o", col="blue", ylim=c(0,0.5), xlab = "K values", ylab = "Misclassification errors")
lines(error.test, type = "o", ylim=c(0,0.5), col="red")
legend("topright", legend=c("Training error","Test error"), col = c("blue","red"), lty=1:1)

which(error.train==min(error.train)) # k=5
which(error.test==min(error.test)) # k=5

##run knn function (k=5)
pr <- knn(dt_train,dt_test,cl=dt_train_labels,k=5)

##create confusion matrix
tab <- table(pr,dt_test_labels)
tab

##this function divides the correct predictions by total number of predictions that tell us how accurate the model is.

accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)

#Accuracy of 85%

combinetest <- cbind(dt_test, dt[-train.index, 18])
combinetest[dt_test_labels != pr,]

#Arizona and Michigan are the States misclassified. Let see the AQI

dataset[dataset$state=="Arizona", 2] #128.23
dataset[dataset$state=="Michigan", 2] #114.11


## Cross Validation to choose K

set.seed(12)
ctrl <- trainControl(method="repeatedcv",repeats = 3) #,classProbs=TRUE,summaryFunction = twoClassSummary)

knnFit <- train(polluted ~ .,
                data = dt[ train.index,-c(18,19,20)], method = "knn", trControl = ctrl, 
                preProcess = c("center","scale"), tuneLength = 20)

plot(knnFit)
knnFit$bestTune #higher accuracy with k=7

#Output of kNN fit
knnFit


##run knn function (k=7)
pr <- knn(dt_train,dt_test,cl=dt_train_labels,k=7)

##create confusion matrix
tab <- table(pr,dt_test_labels)
tab

accuracy(tab)

#Accuracy of 77%

### Conclusion: we work with only 5 nearest neighbors to classify the air quality of the States.


### Now we only use most relevant variables to predict AQI (upon FORWARD SEARCH)
# (pop_rural + manufacturing + precipitations + n_factories)

### Plot training-test error by K value
set.seed(12)

library(class)
error.train <- replicate(0,20)

dt_train2 <- dt_train[,c("manufacturing","pop_rural","precipitations",'n_factories')]
dt_test2 <- dt_test[,c("manufacturing","pop_rural","precipitations",'n_factories')]


for(k in 1:20) {
  pred_pol <- knn(train = dt_train2, test = dt_test2, cl = dt_train_labels, k)
  error.train[k]<-1-mean(pred_pol==dt_train_labels)
}

error.train <- unlist(error.train, use.names=FALSE)

error.test <- replicate(0,20)
for(k in 1:20) {
  pred_pol <- knn(train = dt_train2, test = dt_test2, cl = dt_train_labels, k)
  error.test[k]<-1-mean(pred_pol==dt_test_labels)
}

error.test <- unlist(error.test, use.names = FALSE)

plot(error.train, type="o", col="blue", ylim=c(0,0.7), xlab = "K values", ylab = "Misclassification errors")
lines(error.test, type = "o", ylim=c(0,0.5), col="red")
legend("topright", legend=c("Training error","Test error"), col = c("blue","red"), lty=1:1)

which(error.train==min(error.train)) # k>=8
which(error.test==min(error.test)) # k>=3 

##run knn function (k=9)
pr9 <- knn(dt_train2,dt_test2,cl=dt_train_labels,k=9)

##create confusion matrix
tab <- table(pr9,dt_test_labels)
tab

accuracy(tab)

#Accuracy of 77%


##run knn function (k=5)
pr5 <- knn(dt_train2,dt_test2,cl=dt_train_labels,k=5)

##create confusion matrix
tab <- table(pr5,dt_test_labels)
tab

accuracy(tab)

#Accuracy of 77%


##run knn function (k=6)
pr6 <- knn(dt_train2,dt_test2,cl=dt_train_labels,k=6)

##create confusion matrix
tab <- table(pr6,dt_test_labels)
tab

accuracy(tab)

#Accuracy of 69%


### Now we only use the 3 most relevant variables to predict AQI (upon FORWARD SEARCH)
# (pop_rural + manufacturing + precipitations)

### Plot training-test error by K value

library(class)
error.train <- replicate(0,20)

dt_train2 <- dt_train[,c("manufacturing","pop_rural","precipitations")]
dt_test2 <- dt_test[,c("manufacturing","pop_rural","precipitations")]


for(k in 1:20) {
  pred_pol <- knn(train = dt_train2, test = dt_test2, cl = dt_train_labels, k)
  error.train[k]<-1-mean(pred_pol==dt_train_labels)
}

error.train <- unlist(error.train, use.names=FALSE)

error.test <- replicate(0,20)
for(k in 1:20) {
  pred_pol <- knn(train = dt_train2, test = dt_test2, cl = dt_train_labels, k)
  error.test[k]<-1-mean(pred_pol==dt_test_labels)
}

error.test <- unlist(error.test, use.names = FALSE)

plot(error.train, type="o", col="blue", ylim=c(0,0.7), xlab = "K values", ylab = "Misclassification errors")
lines(error.test, type = "o", ylim=c(0,0.5), col="red")
legend("topright", legend=c("Training error","Test error"), col = c("blue","red"), lty=1:1)

which(error.train==min(error.train)) # k>=5
which(error.test==min(error.test)) # k=5


##run knn function (k=5)
pr5 <- knn(dt_train2,dt_test2,cl=dt_train_labels,k=5)

##create confusion matrix
tab <- table(pr5,dt_test_labels)
tab

accuracy(tab)

#Accuracy of 84.6%

combinetest <- cbind(dt_test, dt[-train.index, 18])
combinetest[dt_test_labels != pr5,]

#Michigan and Oklahoma are the States misclassified. Let see the AQI

dataset[dataset$state=="Oklahoma", 2] #102.1
dataset[dataset$state=="Michigan", 2] #114.11


#Plot in 3 dimensional space

#install.packages("plot3D")
library(plot3D)

colVar <- factor(dt$polluted)


scatter3D(dt$pop_rural, dt$manufacturing, dt$precipitations, 
          labels = rownames(dt),
          colvar=as.integer(colVar),
          phi = 0, bty ="g",
          pch = 20, cex = 1.5,
          col = c("gold2", "chocolate1", "red3"),
          xlab = "Rural pop",
          ylab ="Manufacturing", zlab = "Precipitations",
          colkey = list(at = c(1, 2, 3), side = 4, 
                        addlines = TRUE, length = 0.5, width = 0.5,
                        labels = c("yelow", "orange", "red")))

text3D(dt$pop_rural, dt$manufacturing, dt$precipitations,  
       labels = dt$Code,
       add = TRUE, colkey = FALSE, cex = 0.5)

library(rgl)
plotrgl()


### CONCLUSION --> We keep the 3 most relevant variables with K=5.

#UNSUPERVISED

#PCA

library("plot3D")

#load dataset
dataset <- read.csv('usa_final.csv', sep=',')
rownames(dataset)<-dataset[,1]
dataset <- dataset[,2:19]
dataset <- scale(dataset)

n <- nrow(dataset)
p <- ncol(dataset)

#generate mean and standard deviation
M <- colMeans(dataset)
sigma <- apply(dataset,2,sd)
descriptive<-round(cbind(M,sigma),2)
descriptive

#create correlation matrix
rho <- cor(dataset)

autoval <- eigen(rho)$values
autovec <- eigen(rho)$vectors

pvarsp = autoval/p
pvarspcum = cumsum(pvarsp)
pvarsp

tab<-round(cbind(autoval,pvarsp*100,pvarspcum*100),3)
colnames(tab)<-c("eigenval","%var","%cumvar")
tab

#scree diagram to select the components
plot(autoval,type="b",main="Scree Diagram", xlab="Number of Component", ylab="Eigenvalues")
abline(h=1,lwd=3,col="red")


#get only the significant eigenvectors

#The main issue in this analysis is the forth component:
#in fact, it has a significant explanatory importance
#since the eigenvalue is above 1 
#and it moves the cumulative variance explained from (82% to 88%).
#On the other hand, using four components would be a problem when we plot the results.

#The analysis will be split into two parts: the first one will deal with four components
#while the second one will use only the first three of them in order to visualize the results.

#part 1 - four components
eigen(rho)$vectors[,1:4]


#we investigate what is explained by the components
comp<-round(cbind(
  -eigen(rho)$vectors[,1]*sqrt(autoval[1]),
  -eigen(rho)$vectors[,2]*sqrt(autoval[2]),
  -eigen(rho)$vectors[,3]*sqrt(autoval[3]),
  -eigen(rho)$vectors[,4]*sqrt(autoval[4])
),3)

rownames(comp)<-row.names(descriptive)
colnames(comp)<-c("comp1","comp2","comp3","comp4")
comp


communality<-comp[,1]^2+comp[,2]^2+comp[,3]^2+comp[,4]^2
comp<-cbind(comp,communality)
comp


#part 2 - three components
eigen(rho)$vectors[,1:3]

#we investigate what is explained by the components
comp<-round(cbind(
  -eigen(rho)$vectors[,1]*sqrt(autoval[1]),
  -eigen(rho)$vectors[,2]*sqrt(autoval[2]),
  -eigen(rho)$vectors[,3]*sqrt(autoval[3])
),3)

rownames(comp)<-row.names(descriptive)

colnames(comp)<-c("comp1","comp2","comp3")
communality<-comp[,1]^2+comp[,2]^2+comp[,3]^2
comp<-cbind(comp,communality)
comp

#calculate the scores for the selected components and graph them
#calculate components for each unit
score <- dataset%*%autovec[,1:3]
round(score,3)


#score plot
scorez<-round(cbind
              (-score[,1]/sqrt(autoval[1]),
                score[,2]/sqrt(autoval[2]),
                score[,3]/sqrt(autoval[3])),2)

x <- scorez[,1]
y <- scorez[,2]
z <- scorez[,3]

#plots
scatter3D(x, y, z, colvar = dataset[,1],
          xlab = "comp1", ylab = "comp2", zlab = "comp3",
          main = "PCA - 3 components",
          bty = "g", ticktype = "detailed", d = 2,
          theta = 60, phi = 20, col = gg.col(100),
          clab = "aqi",
          pch = 19, cex = 0.8)

text3D(x,y,z,labels = rownames(dataset),add = TRUE,
       cex = 0.5, col = gg.col(100),
       adj = 0.5, font = 1)

compz<-round(cbind
             (-comp[,1]/sqrt(autoval[1]),
               comp[,2]/sqrt(autoval[2]),
               comp[,3]/sqrt(autoval[3])),2)
x <- compz[,1]
y <- compz[,2]
z <- compz[,3]

scatter3D(x,y,z,
          col = "red",add = TRUE,
          pch = 19, cex = 0.8)

text3D(x,y,z,labels = row.names(comp),add = TRUE,
       cex = 0.5, col = "red",
       adj = 0.5, font = 1)

#Clustering 

dataset <- read.csv('usa_final.csv', sep=',')
rownames(dataset)<-dataset[,1]
datanames <- dataset
datanames$cl <- cut(datanames$aqi, breaks = c(50,100,150,200),
                    labels = c('yellow', 'orange', 'red'))
datanames$cl <- as.factor(datanames$cl)


dataset <- dataset[,2:19]

ds<-dist(scale(dataset))

agglo <- function(hc){
  data.frame(row.names=paste0("Cluster",seq_along(hc$height)),
             height=hc$height,
             compoments=ifelse(hc$merge<0,
                               hc$labels[abs(hc$merge)], paste0("Cluster",hc$merge))
  )}

opar <- par(mfrow = c(1, 3))


par(mfrow=c(1,3)) 
#first clustering: average linkage
h1 <- hclust(ds,method="average")
agglo(h1)

plot(h1, main="Average linkage")

#add robustness to the analysis trying other methods
h2 <- hclust(ds,method="complete")
agglo(h2)
plot(h2, main="Complete linkage")

#Ward method can help us in reducing distances of outliers such as California
h3 <- hclust(ds,method="ward.D2")
agglo(h3)
plot(h3, main="Ward linkage")

#choice of number of clusters - elbow method
set.seed(123)
#compute and plot wss
wss <- sapply(1:12, 
              function(k){kmeans(ds, k, nstart=50,iter.max = 12 )$tot.withinss})
wss
plot(1:12, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

#the plot suggests the number k of clusters
cut_av <- cutree(h1, k = 3)
cut_compl <- cutree(h2, k = 3)
cut_ward <- cutree(h3, k = 3)

#confront the different clustering methods
table(cut_av,cut_compl)
table(cut_av,cut_ward)

plot(h3, main="Clusters with Ward linkage method")
rect.hclust(h3,3)

# Lets compare the share of polluted groups within each of the 3 clusters 
cut_ward <- as.data.frame(cut_ward)
cut_ward <- cbind(rownames(cut_ward), cut_ward)
names(cut_ward) <- c("state","cluster")
cut_ward <- merge(cut_ward, datanames[,c(1,22)], by = "state")

library(dplyr)

group_by(cut_ward, cluster, cl) %>% 
  summarise(
    count = n(),
    share = n()/51*100
  )

install.packages("sjPlot")
library(sjPlot)

sjPlot::tab_xtab(cut_ward$cluster, cut_ward$cl, show.row.prc = TRUE)

#Correlation tests

cut_ward$cl2 <- as.numeric(cut_ward$cl)

cor(cut_ward$cluster, cut_ward$cl2, method = "pearson")
cor.test(cut_ward$cluster, cut_ward$cl2, method = "pearson") 
# There is no relationship between both indexes (cluster and pollution category)
```