-
Notifications
You must be signed in to change notification settings - Fork 0
/
project.Rmd
344 lines (199 loc) · 11.9 KB
/
project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
Data Exploration by Rahul
===========================================================
# Loading the required packages
```{r packages, echo=FALSE, message=FALSE, warning=FALSE}
library(ggplot2) # ggplot2
library(dplyr) #dplyr
```
# Loading the data
```{r data, echo=FALSE, message=FALSE, warning=FALSE}
data=read.csv("wineQualityReds.csv")
str(data)
```
We can see that there are 1599 rows and 13 variables.This data contains the contents of different components in different red wines. The quality of the wines differ because of the presence of the components in different quantity.
#Univariate Plots Section
```{r Univariate Plot1, echo=FALSE, message=FALSE, warning=FALSE}
qplot(fixed.acidity, data=data, bins=40)+
scale_x_log10(breaks=seq(1,16,1))
summary(data$fixed.acidity)
```
Using the logrithmic function we can see distribution which is close to normal distribution, and from the summary data we can see that more than 50% of the value lie between 7 and 9.2 which can also be seen in the distribution due to presence of few high values.
```{r Univariate Plot2, echo=FALSE, message=FALSE, warning=FALSE}
qplot(volatile.acidity, data=data, bins=20)+
scale_x_log10(limits = c(0.2,1.2) , breaks = seq(0,1.2,0.1))
summary(data$volatile.acidity)
```
Using the bin size 20 and logrithmic scale we can see that the distribution looks much better.
```{r Univariate Plot3, echo=FALSE, message=FALSE, warning=FALSE}
qplot(citric.acid, data=data, bins=40)+
scale_x_sqrt()
summary(data$citric.acid)
```
From the distribution it looks like much of the data is towards the right and even in the summary, it shows that the mean is greater than the median.
```{r Univariate Plot4, echo=FALSE, message=FALSE, warning=FALSE}
qplot(residual.sugar, data=data, bins=40)+
scale_x_log10(limits = c(1,7), breaks=seq(1,7,1))
summary(data$residual.sugar)
```
Due to presence of some high values of residual sugar on the right side of the distribution the mean is pulled towards the right.
```{r Univariate Plot5, echo=FALSE, message=FALSE, warning=FALSE}
qplot(chlorides, data=data, bins=60)+
scale_x_log10(breaks=seq(0.02,0.2,0.02), limits=c(0.04,0.2))
summary(data$chlorides)
```
This distribution looks normally distributed after using logrithmic scale and removing the outliers, but due to the presence of outliers we can see from the summary data, the mean is greater than the median.
```{r Univariate Plot6, echo=FALSE, message=FALSE, warning=FALSE}
qplot(free.sulfur.dioxide, data=data, bins=30)+
scale_x_continuous(breaks=c(1,2,5,10,15,21,30,40,50,60,80), limits=c(4,60))
summary(data$free.sulfur.dioxide)
```
Using continuous scale because there is not much improvement in the distribution in using the logrithmic or sqrt scale, we can see that lots of values in the distribution are very small, that is more than 75% have values lower than 21.
```{r Univariate Plot7, echo=FALSE, message=FALSE, warning=FALSE}
qplot(density, data=data, bins=50)+
scale_x_continuous(breaks = seq(0.990,1.005,0.002))
summary(data$density)
```
This is normally distributed as we can see also see the values from the summary for mean and median. The range for density is very small.
```{r Univariate Plot8, echo=FALSE, message=FALSE, warning=FALSE}
qplot(pH, data=data, bins=60)+
scale_x_continuous(breaks = seq(2.9,4,0.1))
summary(data$pH)
```
From the summary we can see that most red wines have a pH below 3.4 which is also shown from the distribution.
```{r Univariate Plot9, echo=FALSE, message=FALSE, warning=FALSE}
qplot(sulphates, data=data, bins=60)+
scale_x_log10(limits = c(0.3,1.4), breaks=seq(0.3,1.4,0.2))
summary(data$sulphates)
```
After removing some outliers and using logrithmic scale this distribution is normal.
```{r Univariate Plot10, echo=FALSE, message=FALSE, warning=FALSE}
qplot(alcohol, data=data, bins=40)+
scale_x_continuous(limits = c(8,14), breaks = seq(8.5,13.5,.5))
summary(data$alcohol)
```
More than 75% alcohol have alcohol content less than 11.10%
```{r Univariate Plot11, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(factor(quality)), data=data)+
geom_bar()
table(data$quality)
```
From the above plot we see that the observation has most no of wines of level 5, for which we can get the value from the table(681).
# Univariate Analysis
From the above plots checked the distribution of all the variables. For some distributions changing the scale to log10 or sqrt and removing the outliers, the distributions became much better and for some there was no difference. For some variables the range of values is so small that they may not be of any use. Along with the plots the summary helps us to see the mean, median values.
# Bivariate Plots Section
My main objective is to find the variation of quality with different variables.
```{r Bivariate Plot1, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x=quality, y=volatile.acidity), data=data)+
geom_jitter(alpha=0.2)+
geom_line(stat = "summary", fun.y=mean)
```
From the geom point we can assume nothing but from the mean we can see that if the volatile.acidity is kept to certain level it can be used to improve the wine quality.
```{r Bivariate Plot2, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x=quality, y=citric.acid), data=data)+
geom_point(alpha=0.2,position=position_jitter(h=0))+
geom_line(stat = "summary", fun.y=mean)
```
From the mean line we can see that citric acid improves the quality of wine.Although the amount in which it is used is very small.
```{r Bivariate Plot3, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x=quality, y=chlorides), data=data)+
geom_jitter(alpha=0.1)+
geom_line(stat = "summary", fun.y=mean)
```
On average with lower quantity of chlorides(amount of salt) the quality of the wine increase.
```{r Bivariate Plot4, echo=FALSE, message=FALSE, warning=FALSE}
data.group_by_quality=data%>%
group_by(quality)%>%
summarise(mean_pH=mean(pH),
median_pH=median(pH),
count=n())
head(data.group_by_quality)
ggplot(aes(x=quality, y=mean_pH), data=data.group_by_quality)+
geom_line()
ggplot(aes(x=factor(quality),y=pH), data=data)+
geom_boxplot()
```
From both types of plots we can see that the quality improves with decrease in pH(Increase in acidity).
```{r Bivariate Plot5, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x=quality, y=total.sulfur.dioxide), data=data)+
geom_jitter(alpha=0.2)+
geom_line(stat = "summary", fun.y=mean)
```
We cannot assume anything from the above plot.
```{r Bivariate Plot6, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x=quality, y=sulphates), data=data)+
geom_jitter(alpha=0.2)+
geom_line(stat = "summary", fun.y=mean)
```
The variation in the mean values of sulphates is very low(from 0.6 to 0.75) but as the quantity of sulphates is so small we can assume that with increase in sulphates quality improves.
```{r Bivariate Plot7, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x=quality, y=alcohol), data=data)+
geom_jitter(alpha=0.1)+
geom_line(stat = "summary", fun.y=mean)
ggplot(aes(x=factor(quality), y=alcohol), data=data)+
geom_boxplot()
```
From the above plots we can say that in general with increase in alcohol content the wine quality improves.
#Bivariate Analysis
From the above plots I found that variation of quality of wines do depend on no. of variable in different way. Although the correlation is very small which we can see if we use the cor function.
# Multivariate Plots Section
```{r Multivariate Plot1, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x=alcohol, y= volatile.acidity, color=factor(quality)), data=data)+
geom_point()+
scale_color_brewer("Quality", palette = "Blues")+
geom_smooth(method = "lm", se=FALSE)+
theme_dark()
```
We can see that for lower value of alcohol and higher value of volatile.acidity the quality of wine is lower.
```{r Multivariate Plot2, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x=alcohol, y=(volatile.acidity+citric.acid+fixed.acidity),
color=factor(quality)), data=data)+
geom_point()+
scale_color_brewer("Quality", palette = "Greens")+
geom_smooth(method = "lm", se=FALSE, size=1)+
theme_dark()
```
Although volatile acidity decreases the quality of the wine, the total acidity improves the wine quality. Variation of fixed acidity is not fixed we can assume that quantity of citric acid is important for wine quality.
```{r Multivariate Plot3, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x=pH,y=alcohol,color=factor(quality)), data=data)+
geom_point()+
scale_color_brewer("Quality", palette = "Reds")+
geom_smooth(method = "lm", se=FALSE, size=1)+
theme_dark()
```
From this plot we can clearly see that most wines of lower quality have lower alcohol content and higher pH.
# Multivariate Analysis
Used the variation from the bivariate plots for attaching the two variables to get the plots for multivariate plots. From the above observations it is seen that wine quality is good with high alcohol content, low pH, low volatile.acidity and high total acidity.
# Plot 1
```{r Plot one, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x=pH), data=data)+
geom_histogram(bins=60, color="black", fill="blue")+
scale_x_continuous("pH",breaks=seq(2,4,0.2))+
ylab("Frequency")+
ggtitle("Distribution of pH for Red Wine")
summary(data$pH)
```
The pH for wine is between 3 and 4.Lower pH indicates acidic and higher indicates basic. The distribution is close to normal distribution. From the summary we can see that around 75% of the observations have pH below 3.4 which can be seen in the plot.
# Plot 2
```{r Plot two, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x=factor(quality), y=alcohol), data=data)+
geom_boxplot(color="black", fill="red")+
stat_summary(fun.y="mean", color="blue",geom="point", shape=7, size=4)+
xlab("Quality Of Wine")+
ylab("Percentage Content of Alcohol(by Volume)")+
ggtitle("BoxPlot of Percentage of Alcohol Content by Volume with Wine Quality")
```
From this plot we can see that the median value of alcohol content increases with wine quality, So we can assume that to improve wine quality higher percentage of alcohol content is required.
# Plot 3
```{r Plot three, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x=alcohol, y=(volatile.acidity+citric.acid+fixed.acidity),
color=factor(quality)), data=data)+
geom_point()+
xlab("Percentage Alcohol Content by Volume")+
ylab("Total Acid Content(g/dm^3)")+
ggtitle("Wine Quality with respect of Total Acid Content and percentage Alcohol Content")+
scale_color_brewer("Quality", palette = "Blues")+
theme_dark()
```
From this plot we can see that Higher alcohol content and higher total acidity have higher wine quality.
# Reflection
From the above observations it is seen that in general red wine quality can be improved with increasing alcohol content, decreasing pH, which will also cause the wine to be more acidic.This has also been shown from the previous plots that increasing acidic content increases the wine quality. Other variations are very small and to make any assumptions more data might be required. I faced lots of problem while doing this project as befor this I was not familier with Rand this was the first time I did something like this all by my self. For two days I kept thinking about what plots to create and finally when after starting this project and creating some plots I got lazy and started studing the next parts of the program because I felt more comfortable with Python infact I have completed my second project. For me most difficult part of this project was to choose which variables to use for the plots.