project.Rmd

Data Exploration by Rahul
===========================================================

# Loading the required packages

```{r packages, echo=FALSE, message=FALSE, warning=FALSE}

library(ggplot2) # ggplot2
library(dplyr)   #dplyr
```

# Loading the data

```{r data, echo=FALSE, message=FALSE, warning=FALSE}

data=read.csv("wineQualityReds.csv")
str(data)
```

We can see that there are 1599 rows and 13 variables.This data contains the contents of different components in different red wines. The quality of the wines differ because of the presence of the components in different quantity.


#Univariate Plots Section

```{r Univariate Plot1, echo=FALSE, message=FALSE, warning=FALSE}

qplot(fixed.acidity, data=data, bins=40)+
  scale_x_log10(breaks=seq(1,16,1))

summary(data$fixed.acidity)
```

Using the logrithmic function we can see distribution which is close to normal distribution, and from the summary data we can see that more than 50% of the value lie between 7 and 9.2 which can also be seen in the distribution due to presence of few high values.


```{r Univariate Plot2, echo=FALSE, message=FALSE, warning=FALSE}

qplot(volatile.acidity, data=data, bins=20)+
  scale_x_log10(limits = c(0.2,1.2) , breaks = seq(0,1.2,0.1))

summary(data$volatile.acidity)
```

Using the bin size 20 and logrithmic scale we can see that the distribution looks much better.


```{r Univariate Plot3, echo=FALSE, message=FALSE, warning=FALSE}

qplot(citric.acid, data=data, bins=40)+
  scale_x_sqrt()

summary(data$citric.acid)

```

From the distribution it looks like much of the data is towards the right and even in the summary, it shows that the mean is greater than the median.


```{r Univariate Plot4, echo=FALSE, message=FALSE, warning=FALSE}

qplot(residual.sugar, data=data, bins=40)+
  scale_x_log10(limits = c(1,7), breaks=seq(1,7,1))

summary(data$residual.sugar)
```

Due to presence of some high values of residual sugar on the right side of the distribution the mean is pulled towards the right.


```{r Univariate Plot5, echo=FALSE, message=FALSE, warning=FALSE}

qplot(chlorides, data=data, bins=60)+
  scale_x_log10(breaks=seq(0.02,0.2,0.02), limits=c(0.04,0.2))


summary(data$chlorides)
```

This distribution looks normally distributed  after using logrithmic scale and removing the outliers, but due to the presence of outliers we can see from the summary data, the mean is greater than the median.


```{r Univariate Plot6, echo=FALSE, message=FALSE, warning=FALSE}

qplot(free.sulfur.dioxide, data=data, bins=30)+
  scale_x_continuous(breaks=c(1,2,5,10,15,21,30,40,50,60,80), limits=c(4,60))

summary(data$free.sulfur.dioxide)
```

Using continuous scale because there is not much improvement in the distribution in using the logrithmic or sqrt scale, we can see that lots of values in the distribution are very small, that is more than 75% have values lower than 21.


```{r Univariate Plot7, echo=FALSE, message=FALSE, warning=FALSE}

qplot(density, data=data, bins=50)+
  scale_x_continuous(breaks = seq(0.990,1.005,0.002))

summary(data$density)
```

This is normally distributed as we can see also see the values from the summary for mean and median. The range for density is very small.


```{r Univariate Plot8, echo=FALSE, message=FALSE, warning=FALSE}

qplot(pH, data=data, bins=60)+
  scale_x_continuous(breaks = seq(2.9,4,0.1))

summary(data$pH)

```

From the summary we can see that most red wines have a pH below 3.4 which is also shown from the distribution.


```{r Univariate Plot9, echo=FALSE, message=FALSE, warning=FALSE}

qplot(sulphates, data=data, bins=60)+
  scale_x_log10(limits = c(0.3,1.4), breaks=seq(0.3,1.4,0.2))

summary(data$sulphates)
```

After removing some outliers and using logrithmic scale this distribution is normal.


```{r Univariate Plot10, echo=FALSE, message=FALSE, warning=FALSE}

qplot(alcohol, data=data, bins=40)+
  scale_x_continuous(limits = c(8,14), breaks = seq(8.5,13.5,.5))

summary(data$alcohol)
```

More than 75% alcohol have alcohol content less than 11.10%


```{r Univariate Plot11, echo=FALSE, message=FALSE, warning=FALSE}

ggplot(aes(factor(quality)), data=data)+
  geom_bar()

table(data$quality)
```

From the above plot we see that the observation has most no of wines of level 5, for which we can get the value from the table(681).


# Univariate Analysis

From the above plots checked the distribution of all the variables. For some distributions changing the scale to log10 or sqrt and removing the outliers, the distributions became much better and for some there was no difference. For some variables the range of values is so small that they may not be of any use. Along with the plots the summary helps us to see the mean, median values.


# Bivariate Plots Section

My main objective is to find the variation of quality with different variables.

```{r Bivariate Plot1, echo=FALSE, message=FALSE, warning=FALSE}

ggplot(aes(x=quality, y=volatile.acidity), data=data)+
  geom_jitter(alpha=0.2)+
  geom_line(stat = "summary", fun.y=mean)
```

From the geom point we can assume nothing but from the mean we can see that if the volatile.acidity is kept to certain level it can be used to improve the wine quality.


```{r Bivariate Plot2, echo=FALSE, message=FALSE, warning=FALSE}

ggplot(aes(x=quality, y=citric.acid), data=data)+
  geom_point(alpha=0.2,position=position_jitter(h=0))+
  geom_line(stat = "summary", fun.y=mean)
```

From the mean line we can see that citric acid improves the quality of wine.Although the amount in which it is used is very small.


```{r Bivariate Plot3, echo=FALSE, message=FALSE, warning=FALSE}

ggplot(aes(x=quality, y=chlorides), data=data)+
  geom_jitter(alpha=0.1)+
  geom_line(stat = "summary", fun.y=mean)
```

On average with lower quantity of chlorides(amount of salt) the quality of the wine increase.


```{r Bivariate Plot4, echo=FALSE, message=FALSE, warning=FALSE}

data.group_by_quality=data%>%
  group_by(quality)%>%
  summarise(mean_pH=mean(pH),
            median_pH=median(pH),
            count=n())

head(data.group_by_quality)


ggplot(aes(x=quality, y=mean_pH), data=data.group_by_quality)+
  geom_line()

ggplot(aes(x=factor(quality),y=pH), data=data)+
  geom_boxplot()

```


From both types of plots we can see that the  quality improves with decrease in pH(Increase in acidity).


```{r Bivariate Plot5, echo=FALSE, message=FALSE, warning=FALSE}

ggplot(aes(x=quality, y=total.sulfur.dioxide), data=data)+
  geom_jitter(alpha=0.2)+
  geom_line(stat = "summary", fun.y=mean)
```

We cannot assume anything from the above plot.


```{r Bivariate Plot6, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x=quality, y=sulphates), data=data)+
  geom_jitter(alpha=0.2)+
  geom_line(stat = "summary", fun.y=mean)
```

The variation in the mean values of sulphates is very low(from 0.6 to 0.75) but as the quantity of sulphates is so small we can assume that with increase in sulphates quality improves.


```{r Bivariate Plot7, echo=FALSE, message=FALSE, warning=FALSE}

ggplot(aes(x=quality, y=alcohol), data=data)+
  geom_jitter(alpha=0.1)+
  geom_line(stat = "summary", fun.y=mean)


ggplot(aes(x=factor(quality), y=alcohol), data=data)+
  geom_boxplot()
```

From the above plots we can say that in general with increase in alcohol content the wine quality improves.


#Bivariate Analysis

From the above plots I found that variation of quality of wines do depend on no. of variable in different way. Although the correlation is very small which we can see if we use the cor function.


# Multivariate Plots Section


```{r Multivariate Plot1, echo=FALSE, message=FALSE, warning=FALSE}

ggplot(aes(x=alcohol, y= volatile.acidity, color=factor(quality)), data=data)+
  geom_point()+
  scale_color_brewer("Quality", palette = "Blues")+
  geom_smooth(method = "lm", se=FALSE)+
  theme_dark()
```

We can see that for lower value of alcohol and higher value of volatile.acidity the quality of wine is lower.


```{r Multivariate Plot2, echo=FALSE, message=FALSE, warning=FALSE}

ggplot(aes(x=alcohol, y=(volatile.acidity+citric.acid+fixed.acidity),
                         color=factor(quality)), data=data)+
  geom_point()+
  scale_color_brewer("Quality", palette = "Greens")+
  geom_smooth(method = "lm", se=FALSE, size=1)+
  theme_dark()
```

Although volatile acidity decreases the quality of the wine, the total acidity improves the wine quality. Variation of fixed acidity is not fixed we can assume that quantity of citric acid is important for wine quality.


```{r Multivariate Plot3, echo=FALSE, message=FALSE, warning=FALSE}

ggplot(aes(x=pH,y=alcohol,color=factor(quality)), data=data)+
  geom_point()+
  scale_color_brewer("Quality", palette = "Reds")+
  geom_smooth(method = "lm", se=FALSE, size=1)+
  theme_dark()
```

From this plot we can clearly see that most wines of lower quality have lower alcohol content and higher pH.

# Multivariate Analysis

Used the variation from the bivariate plots for attaching the two variables to get the plots for multivariate plots. From the above observations it is seen that wine quality is good with high alcohol content, low pH, low volatile.acidity and high total acidity.


# Plot 1

```{r Plot one, echo=FALSE, message=FALSE, warning=FALSE}

ggplot(aes(x=pH), data=data)+
  geom_histogram(bins=60, color="black", fill="blue")+
  scale_x_continuous("pH",breaks=seq(2,4,0.2))+
  ylab("Frequency")+
  ggtitle("Distribution of pH for Red Wine")

summary(data$pH)
  
```

The pH for wine is between 3 and 4.Lower pH indicates acidic and higher indicates basic. The distribution is close to normal distribution. From the summary we can see that around 75% of the observations have pH below 3.4 which can be seen in the plot.


# Plot 2

```{r Plot two, echo=FALSE, message=FALSE, warning=FALSE}

ggplot(aes(x=factor(quality), y=alcohol), data=data)+
  geom_boxplot(color="black", fill="red")+
  stat_summary(fun.y="mean", color="blue",geom="point", shape=7, size=4)+
  xlab("Quality Of Wine")+
  ylab("Percentage Content of Alcohol(by Volume)")+
  ggtitle("BoxPlot of Percentage of Alcohol Content by Volume with Wine Quality")
```

From this plot we can see that the median value of alcohol content increases with wine quality, So we can assume that to improve wine quality higher percentage of alcohol content is required.


# Plot 3

```{r Plot three, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x=alcohol, y=(volatile.acidity+citric.acid+fixed.acidity),
                         color=factor(quality)), data=data)+
  geom_point()+
  xlab("Percentage Alcohol Content by Volume")+
  ylab("Total Acid Content(g/dm^3)")+
  ggtitle("Wine Quality with respect of Total Acid Content and percentage Alcohol Content")+
  scale_color_brewer("Quality", palette = "Blues")+
  theme_dark()
```

From this plot we can see that Higher alcohol content and higher total acidity have higher wine quality.


# Reflection

From the above observations it is seen that in general red wine quality can be improved with increasing alcohol content, decreasing pH, which will also cause the wine to be more acidic.This has also been shown from the previous plots that increasing acidic content increases the wine quality. Other variations are very small and to make any assumptions more data might be required. I faced lots of problem while doing this project as befor this I was not familier with Rand this was the first time I did something like this all by my self. For two days I kept thinking about what plots to create and finally when after starting this project and creating some plots I got lazy and started studing the next parts of the program because I felt more comfortable with Python infact I have completed my second project. For me most difficult part of this project was to choose which variables to use for the plots.