-
Notifications
You must be signed in to change notification settings - Fork 2
/
Unsupervised_learning.Rmd
282 lines (213 loc) · 10.9 KB
/
Unsupervised_learning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
---
title: "Unsupervised learning"
output: html_document
---
## Unsupervised learning
We enrich the analysis performing two unsupervised learning methods for classification. In particular, we use PCA and hierarchical clustering in order to visualize and understand similarities between States.
```{r setup, include=FALSE, echo=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r include=FALSE}
library(plot3D)
library(dplyr)
library(sjPlot)
#load dataset
dataset <- read.csv('usa_final.csv', sep=',')
#prepare dataset for both PCA and clustering
rownames(dataset)<-dataset[,1]
datanames <- dataset
datanames$cl <- cut(datanames$aqi, breaks = c(50,100,150,200),
labels = c('yellow', 'orange', 'red'))
datanames$cl <- as.factor(datanames$cl)
dataset <- dataset[,2:19]
dataset <- scale(dataset)
ds<-dist(dataset)
```
### PCA
PCA highlights the possibility to reduce the model to a much simple one. As our main goal is to explain the variance in the data within a small number of components, we proceed by extracting the correlation matrix from the dataset and then its eigenvalues. As the underlying table shows, there is a first principal component that explains over 64% of the variance. There are other three components deriving from an eigenvalue just above one.
```{r}
n <- nrow(dataset)
p <- ncol(dataset)
#create correlation matrix
rho <- cor(dataset)
autoval <- eigen(rho)$values
autovec <- eigen(rho)$vectors
pvarsp = autoval/p
pvarspcum = cumsum(pvarsp)
pvarsp
tab<-round(cbind(autoval,pvarsp*100,pvarspcum*100),3)
colnames(tab)<-c("eigenval","%var","%cumvar")
tab
```
An easy way to select components is a scree plot. Since the table witnesses the presence of a first principal component followed by three other ones lying an order of magnitude below, we prefer to avoid the elbow rule and instead select every component above one. This choice allows us to end up with a model explaining 88% of the variance.
```{r, fig.align="center", out.width="70%"}
#scree diagram to select the components
plot(autoval,type="b",main="Scree Diagram", xlab="Number of Components", ylab="Eigenvalues")
abline(h=1,lwd=3,col="red")
```
As mentioned before, PCA is a powerful tool for data visualization. In this case, there is a tradeoff between visualization and reliability of the model. On one hand, the forth component has a significant explanatory importance since the eigenvalue is above 1 and it moves the cumulative variance explained from 82% to 88%. On the other hand, it is slightly above the threshold and could be neglected without a significant loss. In order to have a higher explained variance one would select a four-dimensional model, while a lower-dimensionality fashion for plotting requires to have at most three components. We decide to split the analysis into two. The first one uses four components while the other focuses on visualization and elides the last component.
#### Four components model
The four component analysis is the most useful one to read loadings and thus understand how variables are explained by the principal components. First of all, we select the corresponding eigenvectors.
```{r}
eigen(rho)$vectors[,1:4]
```
We proceed generating the loadings for each component. As shown in the following table, the first principal component has a heavy role in explaining almost every variable, peaking at professional, utilities and waste. Small loadings in precipitation and numbe of factories are compensated by the ones respectively of component two and four. The most information about precipitation is retained in the fourth component: this will be a loss in the subsequent three components analysis.
```{r}
#we investigate what is explained by the components
comp<-round(cbind(
-eigen(rho)$vectors[,1]*sqrt(autoval[1]),
-eigen(rho)$vectors[,2]*sqrt(autoval[2]),
-eigen(rho)$vectors[,3]*sqrt(autoval[3]),
-eigen(rho)$vectors[,4]*sqrt(autoval[4])
),3)
rownames(comp)<-colnames(dataset)
colnames(comp)<-c("comp1","comp2","comp3","comp4")
communality<-comp[,1]^2+comp[,2]^2+comp[,3]^2+comp[,4]^2
comp<-cbind(comp,communality)
comp
```
The scores for each State in the four components are reported below. Noteworthy, California, a very polluted State, stand out as its far from every other one.
```{r}
#calculate the scores for the selected components
score <- dataset%*%autovec[,1:4]
round(score,3)
```
#### Three components analysis
Since in this analysis PCA is used for description and visualization, we focus more on the three components analyisis. We proceed in the same way as before and read the loadings.
```{r}
#we investigate what is explained by the components
comp<-round(cbind(
-eigen(rho)$vectors[,1]*sqrt(autoval[1]),
-eigen(rho)$vectors[,2]*sqrt(autoval[2]),
-eigen(rho)$vectors[,3]*sqrt(autoval[3])
),3)
rownames(comp)<-colnames(dataset)
colnames(comp)<-c("comp1","comp2","comp3")
communality<-comp[,1]^2+comp[,2]^2+comp[,3]^2
comp<-cbind(comp,communality)
comp
```
We then compute the scores for each State. In this case, we can interpret them as coordinates of points in a tridimensional space and plot them.
```{r}
#calculate the scores for the selected components
score <- dataset%*%autovec[,1:3]
round(score,3)
```
We scale the scores and plot them.
```{r, fig.align="center", out.width="70%"}
#score plot
scorez<-round(cbind
(-score[,1]/sqrt(autoval[1]),
score[,2]/sqrt(autoval[2]),
score[,3]/sqrt(autoval[3])),2)
x <- scorez[,1]
y <- scorez[,2]
z <- scorez[,3]
#plots
scatter3D(x, y, z, colvar = dataset[,1],
xlab = "comp1", ylab = "comp2", zlab = "comp3",
main = "PCA - 3 components",
bty = "g", ticktype = "detailed", d = 2,
theta = 60, phi = 20, col = gg.col(100),
clab = "aqi",
pch = 19, cex = 0.8)
text3D(x,y,z,labels = rownames(dataset),add = TRUE,
cex = 0.5, col = gg.col(100),
adj = 0.5, font = 1)
```
A further exploration of the data is possible by combining the previous plot with the initial variables loadings. Thanks to this, we can recognize similarities between States by visualizing them.
```{r, fig.align="center", out.width="70%"}
scatter3D(x, y, z, colvar = dataset[,1],
xlab = "comp1", ylab = "comp2", zlab = "comp3",
main = "PCA - 3 components",
bty = "g", ticktype = "detailed", d = 2,
theta = 60, phi = 20, col = gg.col(100),
clab = "aqi",
pch = 19, cex = 0.8)
text3D(x,y,z,labels = rownames(dataset),add = TRUE,
cex = 0.5, col = gg.col(100),
adj = 0.5, font = 1)
compz<-round(cbind
(-comp[,1]/sqrt(autoval[1]),
comp[,2]/sqrt(autoval[2]),
comp[,3]/sqrt(autoval[3])),2)
x <- compz[,1]
y <- compz[,2]
z <- compz[,3]
scatter3D(x,y,z,
col = "red",add = TRUE,
pch = 19, cex = 0.8)
text3D(x,y,z,labels = row.names(comp),add = TRUE,
cex = 0.5, col = "red",
adj = 0.5, font = 1)
```
### Clustering
The main goal of clustering in this analysis is on one hand subdivide States in groups that share common features. On the other hand, it results useful also to discover peculiar States which tend to have marked different behaviours. This could be the hint for a further analysis. To add robustness to the analysis, we perform and compare different clustering methods. In particular, we compare the average linkage, the complete linkage and the Ward linkage methods by plotting the respective dendrogram.
```{r, fig.align="center", out.width="70%"}
agglo <- function(hc){
data.frame(row.names=paste0("Cluster",seq_along(hc$height)),
height=hc$height,
compoments=ifelse(hc$merge<0,
hc$labels[abs(hc$merge)], paste0("Cluster",hc$merge))
)}
#first clustering: average linkage
h1 <- hclust(ds,method="average")
par(mfrow=c(1,3))
plot(h1, main="Average linkage")
#add robustness to the analysis trying other methods
h2 <- hclust(ds,method="complete")
plot(h2, main="Complete linkage")
#Ward method can help us in reducing distances of outliers such as California
h3 <- hclust(ds,method="ward.D2")
plot(h3, main="Ward linkage")
```
The results are pretty similar, with Texas, New York and especially California disjoint from the main tree almost to the root. The result is consistent with the PCA, that already highlighted California as a strong outlier. We choose to go on with Ward method for to reasons. First of all, it is in its implementation to reduce the distances of the outliers, as we can clearly confirm comparing it to the average link method. Secondly, it is the one that subdivides the States in the clearer way. We can see by sight a three clusters dendrogram. To confirm it, we test it with the elbow method. The method relies on the fact that one looks for a number of clusters (k) that minimizes the total within-cluster sum of squares (wss). We plot the curve of wss according to k and locate the bending point (or elbow), a good indicator to decide k.
```{r, fig.align="center", out.width="70%"}
#choice of number of clusters - elbow method
set.seed(123)
#compute and plot wss
wss <- sapply(1:12,
function(k){kmeans(ds, k, nstart=50,iter.max = 12 )$tot.withinss})
wss
plot(1:12, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
```
The test confirms the presence of three main clusters.
```{r, fig.align="center", out.width="70%"}
plot(h3, main="Clusters with Ward linkage method")
rect.hclust(h3,3)
```
Worth mentioning, we have to be aware that Ward method is different from the other two, that share the exact same clusters except from one State. The quick comparison below shows that Ward method leads to a more balanced subdivision between clusters.
```{r}
cut_av <- cutree(h1, k = 3)
cut_compl <- cutree(h2, k = 3)
cut_ward <- cutree(h3, k = 3)
#compare the different clustering methods
table(cut_av,cut_compl)
table(cut_av,cut_ward)
table(cut_compl,cut_ward)
```
#### Comparison
We question now if the clusters resemble the level of pollution for the States. Being the dataset subdivided into three clusters like the qualitative labels for the level of pollution ("red", "orange" and "yellow") we can easily group them to see if they have similar results.
```{r, fig.align="center", out.width="70%"}
# Lets compare the share of polluted groups within each of the 3 clusters
cut_ward <- as.data.frame(cut_ward)
cut_ward <- cbind(rownames(cut_ward), cut_ward)
names(cut_ward) <- c("state","cluster")
cut_ward <- merge(cut_ward, datanames[,c(1,22)], by = "state")
group_by(cut_ward, cluster, cl) %>%
summarise(
count = n(),
share = n()/51*100
)
tab_xtab(cut_ward$cluster, cut_ward$cl, show.row.prc = TRUE)
```
The table shows no correlation between the labels and the clusters. The correlation test below has a non-significative p-value of 0.708.
```{r}
#Correlation tests
cut_ward$cl2 <- as.numeric(cut_ward$cl)
cor(cut_ward$cluster, cut_ward$cl2, method = "pearson")
cor.test(cut_ward$cluster, cut_ward$cl2, method = "pearson")
```