-
Notifications
You must be signed in to change notification settings - Fork 2
/
StatisticalLearning_AboveTheNorm.Rmd
1987 lines (1434 loc) · 87.4 KB
/
StatisticalLearning_AboveTheNorm.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Statistical Learning Project"
author: "Above the norm - Cardarello, Ciarrocchi, Cutrera, Fusar Bassini, Maroni, Rossi"
date: "5/21/2021"
output:
html_document:
df_print: paged
pdf_document:
latex_engine: xelatex
abstract: Environmental protection nowadays is a more and more important issue on which countries are investing many resources. What we are interested in is to discover the most relevant factors which affect pollution in US and to find a way to classify each State in the United States as polluted or not on a bunch of explanatory variables we thought could be relevant. Therefore, we let the data tell us how we can explain pollution in the United States. In the second part we try also to make an exploratory analysis of our dataset finding the principal components which are more relevant, and at least some hidden ways in which countries can be grouped through a hierarchical clustering.
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE)
```
```{r include=FALSE}
data <- read.csv('usa_final.csv', sep=',')
data <- data[,-c(20,21)]
log_aqi <- log(data$aqi)
data$ln_aqi <- log_aqi
```
# Data and Problem Understanding
The starting point of our research question was the effect of the lockdown we experienced in 2020 on the environment. From media and social networks, we have always heard about the positive effects that the lockdowns spread all over the world had on our air quality. Then, we start considering the variable that explains the strictness of lockdown (lockdown yes = 1, lockdown no = 0, soft-lockdown = 0.5) in relation with the air quality index. As you can notice from the graph below, the implementation of lockdown did not have any real effect on the air quality. Indeed, the distributions in the three boxplots are not statistically different: they have quite similar means and the distributions seem to be overlapping.
```{r include=FALSE}
library(car)
library(readr)
library(dplyr)
library(tidyr)
library(readxl)
library(ggpubr)
library(ggplot2)
library(Hmisc)
library(olsrr)
library(tidyverse)
library(caret)
library(Metrics)
library(corrplot)
library(glmnet)
library(leaps)
library(class)
library(selectiveInference)
library(RCurl)
library(tree)
library(ISLR)
library(plot3D)
library(sjPlot)
```
```{r, fig.align="center", out.width="70%"}
# Visualize the expression profile
dataset <- read.csv("usa_final.csv", sep=',', header = TRUE, dec = ".")
dataset <- dataset %>% mutate(ln_aqi=log(dataset$aqi))
ggboxplot(dataset, x = "lockdown", y = "aqi", color = "lockdown",
add = "jitter", legend = "none") +
geom_hline(yintercept = mean(dataset$aqi), linetype = 2)+ # Add horizontal line at base mean
ylim(0, 200)+
stat_compare_means(method = "anova", label.y = 200)+ # Add global ANOVA p-value
stat_compare_means(label = "p.signif", method = "t.test",
ref.group = ".all.", hide.ns = TRUE) # Pairwise comparison against all
# We can conclude that AQI is not significantly different between States by lockdown measures.
```
Since we got these unexpected results, we decided to go further with our analysis.
Therefore, from now on we would like to find the real factors that explain the difference of the air quality among US countries in 2020.
We started constructing our dataset scraping many sources on the web, such as Bureau of Economic Analysis, the Environmental protection agency of US, United States Census Bureau, among the most important.
The first approach we had with our data consisted in the simple full model of linear regression (OLS). Then, we computed the variance inflation factor (vif) for each variable and we found some 'dangerous' values: according to Hair et al. (1995) the maximum acceptable level of vif is 10, whereas according to Ringle et al. (2015) the maximum acceptable level of vif is 5.
Starting with the approach of Hair et al. (threshold = 10), we excluded from our dataset the following regressors: waste and healthcare; thereafter, we performed also the vif according to Ringle et al. approach (threshold = 5) and we drop construction, utilities, professional, retail and finance from our analysis.
Then, we computed the OLS with the remaining variables and we looked for normality on the residuals: they are normally distributed.
We went on performing a model selection in order to find the most relevant regressors among the others.
Given that our number of predictors (17) is not higher than the number of observations (51) we could not perform the backward stepwise model selection; then we decided to implement the forward search in order to choose the "best" predictors for our dependent variable (air quality).
For the classification task, we decided to adopt the Nearest Neighbor algorithm (K-NN) which seemed to perform pretty well (accuracy = 81.25%). Then, we run the classification trees but it did not output results as good as the K-NN. Therefore, we tried to improve its performance implementing the random forest algorithm and we got an accuracy of 75% (but it is still not good as the one of K-NN).
To sum up our main results we can say that there exist at least one model which describes relevant factors for distinguishing countries with good air quality from those which are more polluted. The relevant factors we found are: the share of rural population out of the total, the manufacturing contribution to GDP, the annual rainfalls and the density of the factories for each State (n° of factories / squared kilometers).
## Data Description
We started our data scraping from many sources, and we found a dependent variable to describe air quality, and many other independent variables for each of the 51 States of the United States.
We list below the full dataset we built:
- aqi: Air Quality Index
the Air Quality Index is a yardstick that runs from 0 to 500. The higher the Air Quality Index value, the greater the level of air pollution^[Source: EPA - United States Environmental Protection Agency <https://aqs.epa.gov/aqsweb/airdata/download_files.html#Annual>].
We computed aqi as the average of the median value for each county of each state in 2020 and we found a minimum of 59.25 for Hawaii and a maximum of 197.81 for California.
The Air Quality Index presented six possible discrete categories:
+ Green (good) = from 0 to 50
+ Yellow (moderate) = from 51 to 100
+ Orange (unhealthy for sensitive groups) = from 101 to 150
+ Red (unhealthy) = from 151 to 200
+ Purple (very unhealthy) = from 201 to 300
+ Maroon (hazardous) = 301 and higher.
From this Air Quality Index we discretize the variable called 'polluted' that categorize our entities in three levels of pollution: Yellow, Orange and Red (since we only have States that range from 59.25 to 197.81).
- accommodation, construction, education, finance, healthcare, information, manufacturing, mining, professional, retail, transportation, utilities, waste: they are contributions to Gross Domestic Product by State^[Source: BEA - Bureau of Economic Analysis
<https://apps.bea.gov/itable/iTable.cfm?ReqID=70&step=1#reqid=70&step=1&isuri=1>] for the industry specified in 2020.
They are measured in millions of current dollars.
- mining: it represents the value of nonfuel mineral production per square kilometer in dollars (2017)^[Source: USGS Minerals Yearbook 2018
<https://prd-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/atoms/files/myb1-2017-stati-adv.xlsx>].
(The District of Columbia has no mineral production).
- precipitations: it is measured in inches and it specifies how much in a country has rained in a year (2020)^[Source: Statista
<https://www.statista.com/statistics/1101518/annual-precipitation-by-us-state/>].
- lockdown: it indicates if the state had a stay-at-home order (1), an advisory or a regional measure (0.5) or nothing (0). In the State of Wisconsin, the lockdown was declared unconstitutional after a month so we decided to input it as lockdown=0.5.^[Source: Wikipedia
<https://en.wikipedia.org/wiki/U.S._state_and_local_government_responses_to_the_COVID-19_pandemic#Initial_pandemic_responses,_including_full_lockdowns>].
- pop_rural: According to data from the decennial census of the population for each State, it distinguishes rural from urban population^[Source: United States Census Bureau
<https://www.census.gov/programs-surveys/geography/guidance/geo-areas/urban-rural/2010-urban-rural.html>].
We considered the percentage of the rural population out of the total as a measure of how rural intense is a State.
- n_factories: Number of manufacturing firms in each state in 2017^[Source: National Association of Manufacturers
<https://www.nam.org/state-manufacturing-data/>] divided by the surface of each country. Therefore, it represents the density of factories in each State.
The structure of our dataset, formed by 51 observations and 19 variables without missing values, is as follows:
```{r include=FALSE}
x <- getURL("https://raw.githubusercontent.com/mathicard/Statistical-Learning-DSE/main/usa_final.csv")
dataset <- read.csv(text = x)
```
```{r, out.width="50%", fig.align="center"}
str(dataset[,1:19])
```
## Data Visualization and Exploration
We start the exploration analysis with our dependent variable *aqi* that describes the Air Quality of each of the 51 States of U.S. in 2020. The mean is 95.57 with a standard deviation of 25.76. The complete descriptive statistics are in the following table.
<br/>
```{r message=FALSE, out.width="50%", fig.align="center"}
dataset %>% summarise(
count = n(),
mean = mean(aqi, na.rm = TRUE),
median = median(aqi),
min = min(aqi),
max = max(aqi),
sd = sd(aqi, na.rm = TRUE)
)
```
<br/>
From the boxplot we can easily visualize the distribution of the Air quality index, and especially the names of the States that were the most polluted in 2020. Therefore, we can consider California (197.81), District of Columbia (166) and Wyoming (148.22) as outliers in the dataset.
<br/>
```{r, message=FALSE, out.width="60%", fig.align="center"}
Boxplot(~aqi, data=data, col="#69b3a2", id=list(labels=data$state))
```
<br/>
Then, we compare the distribution of the Air Quality Index within the three levels of the factor *polluted*. The share of entities that have a "yellow" level of pollution is more than 70%, while 26% are "orange" and the left are "red" (4%). As we can see in the boxplots, except for the only two States that have a "red" AQI, "orange" has the most disperse distribution with a standard deviation of 14.8. In fact, the two extreme values 101 and 148 are almost in the minimum and maximum boundaries of that level of pollution. Finally, the presence of two States with a high AQI could introduce some noise in our analysis that should be take it into account.
<br/>
```{r message=FALSE, include=FALSE}
# AQI categorization
dataset$polluted <- cut(dataset$aqi, breaks = c(50,100,150,200),
labels = c('yellow', 'orange', 'red'))
```
```{r message=FALSE, out.width="50%", fig.align="center"}
## Descriptive statistics by group ##
group_by(dataset, polluted) %>%
summarise(
count = n(),
share = n()/51*100,
mean = mean(aqi, na.rm = TRUE),
median = median(aqi),
min = min(aqi),
max = max(aqi),
sd = sd(aqi, na.rm = TRUE)
)
#Box plot
ggboxplot(dataset, x = "polluted", y = "aqi", color = "polluted",
palette = c("#E7B800", "#FC4E07", "red"),
add = "jitter", legend = "none") +
geom_hline(yintercept = mean(dataset$aqi), linetype = 2)+ # Add horizontal line at base mean
ylim(0, 200)
```
<br/>
Looking at the distribution plot it seems that AQI has a right-skewed Normal distribution. To be sure, we use the well-known Shapiro-Wilk test in order to test the null hypothesis of normality. Unfortunately it is rejected at 95% of confidence, so our variable needs to be transformed.
<br/>
```{r message=FALSE, out.width="50%", fig.align="center"}
dataset %>%
ggplot(aes(x=aqi)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
labs(title = 'Density Plot of Air Quality Index', subtitle='Levels')
shapiro.test(dataset$aqi)
```
<br/>
A possible way to improve our results could be by applying the logarithmic transformation. With the logarithm it could be seen that the skewness to the left is sharpened, and now as a matter of fact, Shapiro-Wilk test is not rejecting the normality hypothesis at 95% of confidence (p-value>0.05).
```{r, include=FALSE}
dataset <- dataset %>% mutate(ln_aqi=log(dataset$aqi))
```
```{r, out.width="80%", fig.align="center"}
shapiro.test(dataset$ln_aqi)
```
In the following density plots we can see the improvement after applying the logarithm function to our dependent variable, which now seems more similar to a Normal distribution.
<br/>
```{r, fig.align="center", out.width='60%', out.height='45%'}
d1 <- dataset %>%
ggplot(aes(x=ln_aqi)) +
geom_density(fill="orangered3", color=FALSE, alpha=0.5)+
labs(title = 'Density Plot of Air Quality Index', subtitle='Logarithmic scale')
d2 <- dataset %>%
ggplot(aes(x=aqi)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
labs(title = 'Density Plot of Air Quality Index', subtitle='Levels')
ggarrange(d2, d1,
ncol = 2, nrow = 1)
```
<br/>
As almost all the points of our sample fall approximately along the reference line in the QQ-plot, we can assume normality of the AQI variable. This assumption allows us to work with statistical hypothesis tests that assume that the data follow a Normal distribution, and also to met the Central Limit Theorem assumptions and the normality of the residuals to build a regression model.
<br/>
```{r, fig.align="center", out.width="50%"}
ggqqplot(dataset$ln_aqi, title = "QQ Plot of log AQI")
```
<br/>
In the following correlation matrix we analyze the relationship between the variables of our dataset, computing the correlation coefficients with their p-values as to check for the significance.
<br/>
```{r, message=FALSE, out.width="60%", fig.align="center"}
data_corr <- dataset[,-c(1,20:23)]
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
M <- cor(data_corr)
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
# matrix of the p-value of the correlation
p.mat <- cor.mtest(data_corr)
colnames(M) <- c("aqi", "accom", "const", "edu", "fin", "health", "info",
"manuf", "mining", "prof", "retail", "trans", "util", "waste",
"precip", "lock", "p_rural", "n_fact")
corrplot(M, method="color", col=col(200),
type="upper", order="hclust",
addCoef.col = "black",
tl.col="black", tl.srt=45,
p.mat = p.mat, sig.level = 0.05,
number.cex = .6,
diag=FALSE)
```
<br/>
Almost all variables have a positive correlation with AQI, except for pop_rural which has an expected negative sign due to the fact that the air quality improves when the share of rural population of the State is greater. While education, mining, precipitations and lockdown have a non significative correlation with our variable of interest. Moreover, as it was expected, the variables related to the GDP of each State show a strong positive correlation between them indicating the presence of a possible multicollinearity problem. Finally, pop_rural has a negative correlation with a large number of variables of the dataset, including AQI as we saw.
<br/>
```{r, message=FALSE, out.width="60%", fig.align="center"}
# Perform the ANOVA test
compare_means(aqi ~ lockdown, data = dataset,
ref.group = ".all.", method = "t.test")
```
<br/>
As to check if the air quality differs among the intensity of lockdown measures applied by each State in 2020, we perform an ANOVA test. According to the results, where the t-statistic is lower than 2 with a p-value large than 0.05 (0.37), the *aqi* doesn't differ for *lockdown* at 95% of confidence. Therefore, we can conclude that AQI is not significantly different between States by lockdown measures.
<br/>
```{r, include=FALSE}
log_aqi <- log(data$aqi)
data$ln_aqi <- log_aqi
dt <- data %>% drop_na()
dt <- dt[,-c(1,2)]
```
\tiny
```{r, include=FALSE}
full.model <- lm(ln_aqi~.-waste-healthcare-construction-utilities-professional-retail-finance, data = dt)
summary(full.model)
```
\normalsize
# Data Analysis for Supervised Learning
## Model Selection - Forward stepwise
Our dataset, which is composed by 51 observations (the 51 States of the United States of America) and 18 explanatory variables, does not need any other manipulation since we do not have any missing values. The only things we have done are: the transformation of the dependent variable (Air Quality) with the logarithmic scale, in order to get a "more" normal distribution as explained in the previous section, and the standardization of all our variables, which has given us the re-scaling with the same unit of measurement.
Once our dataset was ready, we had an attempt with the simplest way we know to construct a model: the Ordinary Least Square. In the full model the most significative variables seem to be the number of factories (with positive sign, as expected) and precipitations (with negative sign, as expected).
```{r}
full.model <- lm(ln_aqi~., data = dt)
summary(full.model)
```
But before reaching conclusions, we have to control many things in our model of regression; one of the most important is the variance inflation factor (vif) which gives us insights into the variables which can be "dangerous" to what concern multicollinearity.
```{r}
vif(full.model)
sqrt(vif(full.model))>10
```
In fact some explanatory variables have a very high variance inflation factor, and we decided to drop these ones taking as benchmark the thresholds cited in Hair et al. (in which the maximum acceptable level of vif is 10) before, and then also the more strict one according to Ringle et al. (in which it was 5).
After that, our linear regression model still has the number of factories and precipitations as the most statistically significatives explanatory variables, excluding the predictors with high vif.
```{r}
full.model <- lm(ln_aqi~.-waste-healthcare-construction-utilities-professional-retail-finance, data = dt)
summary(full.model)
vif(full.model)
sqrt(vif(full.model))>5
```
Fitting our model of regression on a random training set, we can visualize that residuals are normally distributed.
```{r, include=FALSE}
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
dt <- as.data.frame(lapply(dt, normalize))
set.seed(123)
train = sample(1:nrow(dt), 0.7*nrow(dt))
dt_train = dt[train,-18]
dt_test = dt[-train,-18]
dt_train_labels <- dt[train, 18]
dt_test_labels <- dt[-train, 18]
```
```{r, message=FALSE, out.width="50%", fig.align="center"}
full.model <- lm(ln_aqi~.-waste-healthcare-construction-utilities-professional-retail-finance, data = dt[train,])
ols_plot_resid_fit(full.model)
ols_plot_resid_hist(full.model)
```
And making the predictions on our unaccessed test set, we can see the plot actual versus predicted values on the test set, to see whether they are far from being in the straight line of zero errors:
```{r, message=FALSE, out.width="50%", fig.align="center"}
pred_ols <- predict(full.model, dt[-train,])
ggplot(dt_test, aes(x=pred_ols, y=dt_test_labels)) +
geom_point() +
geom_abline(intercept=0, slope=1) +
labs(x='Predicted Values', y='Actual Values', title='Predicted vs. Actual Values')
```
## Model Selection - LASSO
```{r, message=FALSE, out.width="50%", fig.align="center"}
x=model.matrix(ln_aqi~., dt)[,-1]
x = scale(x, TRUE, TRUE)
y=dt$ln_aqi
fit.lasso=glmnet(x,y, standardize = FALSE)
plot(fit.lasso,xvar="lambda",label=TRUE)
cv.lasso=cv.glmnet(x,y)
plot(cv.lasso)
coef(cv.lasso)
```
We use our earlier train/validation division to select the lambda for the lasso.
```{r, message=FALSE, out.width="50%", fig.align="center"}
lasso.tr=glmnet(x[train,],y[train], standardize = FALSE)
#lasso.tr
pred=predict(lasso.tr,x[-train,])
#dim(pred)
rmse= sqrt(apply((y[-train]-pred)^2,2,mean))
plot(log(lasso.tr$lambda),rmse,type="b",xlab="Log(lambda)")
lam.best=lasso.tr$lambda[order(rmse)[1]]
lam.best
```
The best lambda is smaller than rsme in the linear regression model.
```{r, message=FALSE, out.width="50%", fig.align="center"}
coef(lasso.tr,s=lam.best)
```
Variables with the highest relevance are precipitations, pop_rural and n_factories.
It also adds information in the sub-selection. Manufacture and lockdown have less relevance.
Next, we compute post-lasso inference to check the highest dependent variables on the sub-selection.
```{r, message=FALSE, out.width="50%", fig.align="center"}
sigma = estimateSigma(x,y)$sigmahat
beta = coef(lasso.tr, x=x, y=y, s=lam.best/51, exact = TRUE)[-1]
out_lasso_inf = fixedLassoInf(x,y,beta,lam.best,sigma=sigma)
out_lasso_inf
```
With a lambda of 0,026 the most relevant variables that satisfy the p-value are **precipitations** and **n_factories**. While pop_rural, manufacture and lockdown show a high p-value and are taken out.
With the Lasso, the results are similar to the other models used, but the post-selection inference
reduces the variables just to the two before mentioned.
```{r, include=FALSE, message=FALSE, warning=FALSE}
dt <- dt[,-c(13, 5, 2, 12, 9, 10, 4)]
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
#dt <- as.data.frame(lapply(dt, normalize))
set.seed(123)
train = sample(1:nrow(dt), 0.7*nrow(dt))
dt_train = dt[train,-11]
dt_test = dt[-train,-11]
dt_train_labels <- dt[train, 11]
dt_test_labels <- dt[-train, 11]
```
With the method of the forward search which starts from the null model, and step by step adds the most relevant variable (in the bunch of the remaining ones) included in the model with the lowest Residual Sum of Squares, or highest R^2. Plotting the cross-validated prediction error (with Bayes Information Criterion - BIC) we can find the parsimonious model which is performing best, because the BIC takes also into account the number of predictors which are included in our models (as a penalization term for the models with more variables).
```{r, message=FALSE, out.width="50%", fig.align="center", warning=FALSE}
regfit.fwd=regsubsets(ln_aqi~.,data=dt,method="forward", nvmax=10)
summary(regfit.fwd)
```
```{r, message=FALSE, out.width="50%", fig.align="center", warning=FALSE}
reg.summary<-summary(regfit.fwd)
plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",types="l")
```
Now in our linear model made up by the 4 variables selected, the only one which becomes no more significative is the rural population indicator, but the signs are all as expected.
```{r, message=FALSE, out.width="50%", fig.align="center"}
model <- lm(ln_aqi~ pop_rural + manufacturing + precipitations + n_factories,
data = dt[train,])
summary(model)
```
```{r, message=FALSE, out.width="50%", fig.align="center"}
pred_fwd <- predict(model, dt[-train,])
root_mse = rmse(dt_test_labels, pred_fwd)
root_mse
```
At least our new model created with the forward search has a lower Mean Squared Error on the prediction made on the test set. Before it was 0.1823493, now it is 0.1317688.
## Tree classifiers
In this section, we try to classify the 51 States according to the pollution of their air quality, that is, to predict the air quality category to which each State belongs in 2020.
We started including all the variables in our analysis since we want the tree to select the most important ones for classification.
In the plot below there is a graphical representation of the procedure of how the tree splits the data.
This tree has a misclassification error rate of approximately 18%.
```{r, message=FALSE, out.width="50%", fig.align="center"}
data$cl <- cut(data$aqi, breaks = c(50,100,150,200),
labels = c('yellow', 'orange', 'red'))
dt <- dt[,-11]
dt$polluted <- data$cl
dt$polluted <- as.factor(dt$polluted)
#train test split
set.seed(123)
train = sample(1:nrow(dt), 0.7*nrow(dt))
dt_train = dt[train,-11]
dt_test = dt[-train,-11]
dt_train_labels <- dt[train, 11]
dt_test_labels <- dt[-train, 11]
summary(dt_train_labels)
summary(dt_test_labels)
###############################################################################
tree = tree(polluted~., dt)
summary(tree)
plot(tree)
text(tree, pretty = 0)
```
We tried to do the train and the test and we got a low accuracy of 62.5%.
```{r, message=FALSE, out.width="50%", fig.align="center"}
tree_train <- tree(polluted~., dt[train,])
tree_pred <- predict(tree_train, dt[-train,], type = 'class')
table(tree_pred, dt_test_labels)
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(table(tree_pred, dt_test_labels))
```
We wanted to improve the performance of our classification task and we decided to reduce the dimensionality of our dataset selecting only the most relevant variables that were chosen in the model selection section.
Reducing the number of variables lead us to get a slightly lower misclassification error rate of 16%. Below you can find the plot of the tree.
```{r, message=FALSE, out.width="50%", fig.align="center"}
#TREE with relevant variables
tree = tree(polluted~pop_rural + manufacturing + precipitations
+ n_factories, dt)
summary(tree)
plot(tree)
text(tree, pretty = 0)
```
We train our algorithm and we then test since we thought we have found better results.
However, as it can notice we could not increase our accuracy that remains equal to 62.5%.
```{r, message=FALSE, out.width="50%", fig.align="center"}
tree_train <- tree(polluted~pop_rural + manufacturing + precipitations
+ n_factories, dt[train,])
tree_pred <- predict(tree_train, dt[-train,], type = 'class')
table(tree_pred, dt_test_labels)
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(table(tree_pred, dt_test_labels))
```
We wanted to do one last attempt to improve the accuracy of our tree classifier and we applied the random forest.
```{r, message=FALSE, out.width="50%", fig.align="center"}
#random forest
library(randomForest)
rf.tree=randomForest(polluted~.,data=dt,subset=train)
rf.tree
pred_rf <- predict(rf.tree, dt_test)
table(pred_rf, dt_test_labels)
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(table(pred_rf, dt_test_labels))
```
As it can be seen, the predictions on the test set output an improved accuracy of 75%.
Even though we applied the random forest and we got better results, we were not completely satisfied with the accuracy of this method of classification. Therefore, we decided to apply the K-Nearest Neighbors algorithm which will perform better and it will achieve a higher accuracy.
## k-Nearest Neighbours
In this section we apply the K-Nearest Neighbor supervised algorithm (KNN) with the aim of classifying the air quality category to which each State belongs, according to the AQI 2020.
In first place, as we have already done in the previous models, we normalize the variables in our dataset so that distances between variables with larger ranges will not be over-emphasized. Therefore, the values of all the features are in the range of 0 and 1.
```{r, include=FALSE, warning = FALSE}
x <- getURL("https://raw.githubusercontent.com/mathicard/Statistical-Learning-DSE/main/usa_final.csv")
dataset <- read.csv(text = x)
#AQI categorization
dataset$polluted <- cut(dataset$aqi, breaks = c(50,100,150,200),
labels = c('yellow', 'orange', 'red'))
dt <- as.data.frame(cbind(lapply(dataset[,-c(1,2,20,21,22)], normalize), dataset[,c(1,20,21,22)]))
```
Then, we split the original labeled dataset into training (70%) and test data (30%), keeping a similar distribution of instances so that we do not favor one or the other class in the predictions.
```{r, include=FALSE, warning = FALSE}
set.seed(12)
library(caret)
train.index <- createDataPartition(dt$polluted, p = .7, list = FALSE)
dt_train <- dt[ train.index,-c(18,19,20)]
dt_test <- dt[-train.index,-c(18,19,20)]
dt_train_labels <- dt[train.index, 21]
dt_test_labels <- dt[-train.index, 21]
table1 <- table(dt_train_labels)
round(prop.table(table1), 2)
table2 <- table(dt_test_labels)
round(prop.table(table2), 2)
dt_train <- dt_train[,-18]
dt_test <- dt_test[,-18]
```
In order to assess the classification accuracy, we plot the relationship between the training and test error error rates as a function of the number of neighbors selected (K). As the hyperparameter **K** increases, the method becomes less flexible. The training error rate consistently increases as the flexibility decreases. Both error rates declines at first, reaching a minimum rate at K=5, before increasing again when the method becomes more inflexible. Therefore, we chose K=5 as the best number of neighbors for KNN in our classification.
```{r, message=FALSE, out.width="50%", fig.align="center", warning=FALSE}
##load the package class
library(class)
### Plot training-test error by K value
library(class)
error.train <- replicate(0,20)
for(k in 1:20) {
pred_pol <- knn(train = dt_train, test = dt_test, cl = dt_train_labels, k)
error.train[k]<-1-mean(pred_pol==dt_train_labels)
}
error.train <- unlist(error.train, use.names=FALSE)
error.test <- replicate(0,20)
for(k in 1:20) {
pred_pol <- knn(train = dt_train, test = dt_test, cl = dt_train_labels, k)
error.test[k]<-1-mean(pred_pol==dt_test_labels)
}
error.test <- unlist(error.test, use.names = FALSE)
plot(error.train, type="o", col="blue", ylim=c(0,0.5), xlab = "K values", ylab = "Misclassification errors")
lines(error.test, type = "o", ylim=c(0,0.5), col="red")
legend("topright", legend=c("Training error","Test error"), col = c("blue","red"), lty=1:1)
#which(error.train==min(error.train)) # k=5
#which(error.test==min(error.test)) # k=5
```
By setting K = 5, we get a model with an accuracy of 85% with the misclassification of only two "orange" States (Arizona and Michigan) which are categorized as "yellow".
```{r, message=FALSE, out.width="50%", fig.align="center", warning=FALSE}
##run knn function (k=5)
pr <- knn(dt_train,dt_test,cl=dt_train_labels,k=5)
##create confusion matrix
tab <- table(pr,dt_test_labels)
tab
##this function divides the correct predictions by total number of predictions that tell us how accurate the model is.
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)
#Accuracy of 85%
```
Moreover, we perform a repeated Cross Validation algorithm to choose K among the performed models according to its accuracy, our evaluation measure. In the next figure it is plotted the Number of Neighbors Vs Accuracy and we can see that by fixing K = 7 we get the model with the largest accuracy (72%).
```{r, message=FALSE, out.width="50%", fig.align="center", warning=FALSE}
set.seed(12)
ctrl <- trainControl(method="repeatedcv",repeats = 3) #,classProbs=TRUE,summaryFunction = twoClassSummary)
knnFit <- train(polluted ~ .,
data = dt[ train.index,-c(18,19,20)], method = "knn", trControl = ctrl,
preProcess = c("center","scale"), tuneLength = 20)
plot(knnFit)
#knnFit$bestTune #higher accuracy with k=7
#Output of kNN fit
knnFit
```
In spite of the presence of multicollinearity among our variables, as we saw in the previous section, the predictions should not been affected in the KNN modeling. Therefore, we decided to run this classification algorithm with only the four most relevant variables to predict AQI selected by the Forward Search that was already performed (*"manufacturing"*,*"pop_rural"*,*"precipitations"* and *"n_factories"*). However, the accuracy was a little bit improved, by choosing K=5 or K=9 the metric was almost 77% in both cases.
```{r, message=FALSE, out.width="50%", fig.align="center", warning=FALSE}
error.train <- replicate(0,20)
dt_train2 <- dt_train[,c("manufacturing","pop_rural","precipitations")]
dt_test2 <- dt_test[,c("manufacturing","pop_rural","precipitations")]
for(k in 1:20) {
pred_pol <- knn(train = dt_train2, test = dt_test2, cl = dt_train_labels, k)
error.train[k]<-1-mean(pred_pol==dt_train_labels)
}
error.train <- unlist(error.train, use.names=FALSE)
error.test <- replicate(0,20)
for(k in 1:20) {
pred_pol <- knn(train = dt_train2, test = dt_test2, cl = dt_train_labels, k)
error.test[k]<-1-mean(pred_pol==dt_test_labels)
}
error.test <- unlist(error.test, use.names = FALSE)
plot(error.train, type="o", col="blue", ylim=c(0,0.7), xlab = "K values", ylab = "Misclassification errors")
lines(error.test, type = "o", ylim=c(0,0.5), col="red")
legend("topright", legend=c("Training error","Test error"), col = c("blue","red"), lty=1:1)
#which(error.train==min(error.train)) # k>=5
#which(error.test==min(error.test)) # k=5
##run knn function (k=5)
pr5 <- knn(dt_train2,dt_test2,cl=dt_train_labels,k=5)
##create confusion matrix
tab <- table(pr5,dt_test_labels)
tab
accuracy(tab)
#Accuracy of 84.6%
combinetest <- cbind(dt_test, dt[-train.index, 18])
combinetest[dt_test_labels != pr5,]
#Michigan and Oklahoma are the States misclassified. Let see the AQI
dataset[dataset$state=="Oklahoma", 2] #102.1
dataset[dataset$state=="Michigan", 2] #114.11
```
Finally, we keep the three most important variables (*"manufacturing"*,*"pop_rural"*,*"precipitations"*) and with K = 5 the accuracy obtained was 85%, equal to the largest value we saw in the KNN model performed with all the variables. In this case, the only two States that were misclassified are Michigan (114.1) and Oklahoma (102.1), both belonging to the "orange" category and which the model categorized as "yellow".
```{r, message=FALSE, out.width="70%", fig.align="center"}
#Plot in 3 dimensional space
#install.packages("plot3D")
colVar <- factor(dt$polluted)
scatter3D(dt$pop_rural, dt$manufacturing, dt$precipitations,
labels = rownames(dt),
colvar=as.integer(colVar),
phi = 0, bty ="g",
pch = 20, cex = 1.5,
col = c("gold2", "chocolate1", "red3"),
xlab = "Rural pop",
ylab ="Manufacturing", zlab = "Precipitations",
colkey = list(at = c(1, 2, 3), side = 4,
addlines = TRUE, length = 0.5, width = 0.5,
labels = c("yelow", "orange", "red")))
text3D(dt$pop_rural, dt$manufacturing, dt$precipitations,
labels = dt$Code,
add = TRUE, colkey = FALSE, cex = 0.5)
```
In conclusion, we keep this last KNN model to predict the category at which State belongs according to its level of air quality, by only using the GDP of Manufacturing, the Share of Rural population and the Quantity of Precipitations.
# Data Analysis for Unsupervised Learning
## Unsupervised learning
We enrich the analysis performing two unsupervised learning methods. In particular, we use PCA and hierarchical clustering in order to visualize and understand similarities between States.
```{r, message=FALSE, out.width="50%", fig.align="center"}
#load dataset
dataset <- read.csv('usa_final.csv', sep=',')
#prepare dataset for both PCA and clustering
rownames(dataset)<-dataset[,1]
datanames <- dataset
datanames$cl <- cut(datanames$aqi, breaks = c(50,100,150,200),
labels = c('yellow', 'orange', 'red'))
datanames$cl <- as.factor(datanames$cl)
aqi_plot <- dataset[,2]
dataset <- dataset[,3:19]
dataset <- scale(dataset)
ds<-dist(dataset)
```
### PCA
PCA highlights the possibility to reduce the model to a much simple one. As our main goal is to explain the variance in the data within a small number of components, we proceed by extracting the correlation matrix from the dataset and then its eigenvalues. As the underlying table shows, there is a first principal component that explains more than 66% of the variance. There are other two components deriving from an eigenvalue just above one.
```{r, message=FALSE, out.width="50%", fig.align="center"}
n <- nrow(dataset)
p <- ncol(dataset)
#create correlation matrix
rho <- cor(dataset)
autoval <- eigen(rho)$values
autovec <- eigen(rho)$vectors
pvarsp = autoval/p
pvarspcum = cumsum(pvarsp)
pvarsp
tab<-round(cbind(autoval,pvarsp*100,pvarspcum*100),3)
colnames(tab)<-c("eigenval","%var","%cumvar")
tab
```
An easy way to select components is a scree plot. Since the table witnesses the presence of a first principal component followed by two other ones lying an order of magnitude below, we prefer to avoid the elbow rule and instead select every component above one. This choice allows us to end up with a model explaining 84% of the variance.
```{r, message=FALSE, out.width="50%", fig.align="center"}
#scree diagram to select the components
plot(autoval,type="b",main="Scree Diagram", xlab="Number of Components", ylab="Eigenvalues")
abline(h=1,lwd=3,col="red")
```
As mentioned before, PCA is a powerful tool for data visualization. In this case, there is a tradeoff between visualization and reliability of the model. On one hand, the fourth component has slightly significant explanatory importance since the eigenvalue is just below 1 and it moves the cumulative variance explained from 84% to 90%. On the other hand, it is still below the threshold and could be neglected without a significant loss. In order to have a higher explained variance one would select a four-dimensional model, while a lower-dimensionality fashion for plotting requires to have at most three components.
#### Three components analysis
Since in this analysis PCA is used for description and visualization, we focus on a three components analysis. The three component analysis is the most useful one to read loadings and thus understand how variables are explained by the principal components. First of all, we select the corresponding eigenvectors.
We proceed generating the loadings for each component. As shown in the following table, the first principal component has a heavy role in explaining almost every variable, peaking at waste, professional, and healthcare.
```{r, message=FALSE, out.width="50%", fig.align="center"}
#we investigate what is explained by the components
comp<-round(cbind(
-eigen(rho)$vectors[,1]*sqrt(autoval[1]),
-eigen(rho)$vectors[,2]*sqrt(autoval[2]),
-eigen(rho)$vectors[,3]*sqrt(autoval[3])
),3)
rownames(comp)<-colnames(dataset)
colnames(comp)<-c("comp1","comp2","comp3")
communality<-comp[,1]^2+comp[,2]^2+comp[,3]^2
comp<-cbind(comp,communality)
comp
```
The scores for each State in the three components are reported below. Noteworthy, California, a very polluted State, stand out as its far from every other one. The sign of the components varies according to each State therefore we cannot draw a global interpretation of them. In this case, we can interpret them as coordinates of points in a tridimensional space and plot them.
```{r, message=FALSE, out.width="50%", fig.align="center"}
#calculate the scores for the selected components
score <- dataset%*%autovec[,1:3]
round(score,3)
```
We scale the scores and plot them.
```{r, message=FALSE, out.width="80%", fig.align="center"}
#score plot
scorez<-round(cbind
(-score[,1]/sqrt(autoval[1]),
score[,2]/sqrt(autoval[2]),
score[,3]/sqrt(autoval[3])),2)
x <- scorez[,1]
y <- scorez[,2]
z <- scorez[,3]
#plots
scatter3D(x, y, z, colvar = aqi_plot,
xlab = "comp1", ylab = "comp2", zlab = "comp3",
main = "PCA - 3 components",
bty = "g", ticktype = "detailed", d = 2,
theta = 60, phi = 20, col = gg.col(100),
clab = "aqi",
pch = 19, cex = 0.8)
text3D(x,y,z,labels = rownames(dataset),add = TRUE,
cex = 0.5, col = gg.col(100),
adj = 0.5, font = 1)
```
### Clustering
The main goal of clustering in this analysis is on one hand grouping States which share common features. On the other hand, it results useful also to discover peculiar States which tend to have marked different behaviours. This could be the hint for a further analysis. To add robustness to the analysis, we perform and compare different clustering methods. In particular, we compare the **average linkage**, the **complete linkage** and the **Ward linkage** methods by plotting the respective dendrogram.
```{r, message=FALSE, out.width="90%", fig.align="center", out.extra='angle=90'}
agglo <- function(hc){
data.frame(row.names=paste0("Cluster",seq_along(hc$height)),
height=hc$height,
compoments=ifelse(hc$merge<0,
hc$labels[abs(hc$merge)], paste0("Cluster",hc$merge))
)}
#first clustering: average linkage
h1 <- hclust(ds,method="average")
par(mfrow=c(1,3))
plot(h1, main="Average linkage")
#add robustness to the analysis trying other methods
h2 <- hclust(ds,method="complete")
plot(h2, main="Complete linkage")
#Ward method can help us in reducing distances of outliers such as California
h3 <- hclust(ds,method="ward.D2")
plot(h3, main="Ward linkage")
```
The results are pretty similar, with Texas, New York and especially California disjoint from the main tree almost to the root. The result is consistent with the PCA, that already highlighted California as a strong outlier. We choose to go on with Ward method for two reasons. First of all, it is in its implementation to reduce the distances of the outliers, as we can clearly confirm comparing it to the average link method. Secondly, it is the one that subdivides the States in the clearer way. We can see by sight a three clusters dendrogram. To confirm it, we test it with the elbow method. The method relies on the fact that one looks for a number of clusters (k) that minimizes the total within-cluster sum of squares (wss). We plot the curve of wss according to k and locate the bending point (or elbow), a good indicator to decide k.
```{r, message=FALSE, out.width="50%", fig.align="center"}
#choice of number of clusters - elbow method
set.seed(123)
#compute and plot wss
wss <- sapply(1:12,
function(k){kmeans(ds, k, nstart=50,iter.max = 12 )$tot.withinss})
wss
plot(1:12, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
```
The test confirms the presence of three main clusters.
```{r, message=FALSE, out.width="50%", fig.align="center"}
plot(h3, main="Clusters with Ward linkage method")
rect.hclust(h3,3)
```
Worth mentioning, we have to be aware that Ward method is different from the other two, that share the exact same clusters except from one State. The quick comparison below shows that Ward method leads to a more balanced subdivision between clusters.
```{r, message=FALSE, out.width="50%", fig.align="center"}
cut_av <- cutree(h1, k = 3)
cut_compl <- cutree(h2, k = 3)
cut_ward <- cutree(h3, k = 3)
#compare the different clustering methods
table(cut_av,cut_compl)
table(cut_av,cut_ward)
table(cut_compl,cut_ward)
```
#### Comparison
We question now if the clusters resemble the level of pollution for the States. Being the dataset subdivided into three clusters like the qualitative labels for the level of pollution ("red", "orange" and "yellow") we can easily group them to see if they have similar results.
```{r, message=FALSE, out.width="50%", fig.align="center"}
# Lets compare the share of polluted groups within each of the 3 clusters
cut_ward <- as.data.frame(cut_ward)
cut_ward <- cbind(rownames(cut_ward), cut_ward)
names(cut_ward) <- c("state","cluster")
cut_ward <- merge(cut_ward, datanames[,c(1,22)], by = "state")
group_by(cut_ward, cluster, cl) %>%
summarise(
count = n(),
share = n()/51*100
)
tab_xtab(cut_ward$cluster, cut_ward$cl, show.row.prc = TRUE)
```
The table shows no correlation between the labels and the clusters. The correlation test below has a non-significative p-value of 0.708.
```{r, message=FALSE, out.width="50%", fig.align="center"}
#Correlation tests
cut_ward$cl2 <- as.numeric(cut_ward$cl)
cor(cut_ward$cluster, cut_ward$cl2, method = "pearson")
cor.test(cut_ward$cluster, cut_ward$cl2, method = "pearson")
```
# Theoretical Background
## Supervised learning
### Model selection
In order to perform a *model selection*, we need to focus on the goals of a generic model: maximize the accuracy of the prediction, especially controlling the variance, and make the analysis as interpretable as possible, removing irrelavant features. To pursue these aims, we performed a subset selection, identifying the *p* predictors more related to the response variable and then fitting a model using least squares on the reduced set of variables. There are different models to obtain this best selection, each one with its pros and cons, and the ones we used in our analysis are presented in the following sections:
### Best subset selection
The main idea of this model is that exist a best subset which we should identify.
The model is developed through 3 steps:
1. We start with the *null model* which contains no predictors and then it simply predicts the sample mean for each observation.
2. Considering *p* predictors, for each *k = 1,.., p* we first compute all the possible models whose number is equal to the combinatorial calculus *(p over k)*. Among all the models within the same value of *k* and for each value of *k*, the one with the smallest residual sum of squares or the highest R^2 is picked.
3. Having computed the best model for each level of *k*, it's possible to select the best one modeling the training error (C~p, AIC, BIC, R^2) or directly computing the test error with a validation set or a cross-validation.
Best subset selection checks for the best subset among all possible values of k in order to find the best subset. Nevertheless,it is a heavy method computationally speaking, especially with large values of *p*; furthermore, from a statistical point of view, this large search space could bring to end up with an overfitted model.
For this reasons, stepwise approaches could seem appealing alternatives.
### Forward stepwise selection