Add files via upload

fderyckel · Nov 19, 2017 · 4acbc5c · 4acbc5c
1 parent 3416010
commit 4acbc5c
Show file tree

Hide file tree

Showing 11 changed files with 1,597 additions and 38 deletions.
diff --git a/01-intro.Rmd b/01-intro.Rmd
@@ -1,11 +1,37 @@
-# Tests and inferences {#intro}
+# Tests and inferences {#testinference}
 
 Definitely the first thing to be familiar with while doing machine learning works is the basic of statistical inferences.  
-In this chapter, we go over some of the few important topics and r-ways to do them.  
+In this chapter, we go over some of these important concepts and the r-ways to do them.  
 
 Let's get started.  
 
-## T-tests
+## Assumption of normality {#normality}
+Copied from [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693611/)
+
+Many of the statistical procedures including correlation, regression, t tests, and analysis of variance, namely parametric tests, are based on the assumption that the data follows a normal distribution or a Gaussian distribution (after Johann Karl Gauss, 1777–1855); that is, it is assumed that the populations from which the samples are taken are normally distributed. The assumption of normality is especially critical when constructing reference intervals for variables. Normality and other assumptions should be taken seriously, for when these assumptions do not hold, it is impossible to draw accurate and reliable conclusions about reality.
+
+\index{Test of normality}
+
+With large enough sample sizes (> 30 or 40), the violation of the normality assumption should not cause major problems; this implies that we can use parametric procedures even when the data are not normally distributed (8). If we have samples consisting of hundreds of observations, we can ignore the distribution of the data (3). According to the central limit theorem, 
+
+* if the sample data are approximately normal then the sampling distribution too will be normal; 
+* in large samples (> 30 or 40), the sampling distribution tends to be normal, regardless of the shape of the data 
+* means of random samples from any distribution will themselves have normal distribution. 
+
+Although true normality is considered to be a myth, we can look for normality visually by using normal plots or by significance tests, that is, comparing the sample distribution to a normal one. It is important to ascertain whether data show a serious deviation from normality.
+
+### Visual check of normality
+
+Visual inspection of the distribution may be used for assessing normality, although this approach is usually unreliable and does not guarantee that the distribution is normal. However, when data are presented visually, readers of an article can judge the distribution assumption by themselves. The frequency distribution (histogram), stem-and-leaf plot, boxplot, P-P plot (probability-probability plot), and Q-Q plot (quantile-quantile plot) \index{Q-Q plot} are used for checking normality visually. The frequency distribution that plots the observed values against their frequency, provides both a visual judgment about whether the distribution is bell shaped and insights about gaps in the data and outliers outlying values. A Q-Q plot is very similar to the P-P plot except that it plots the quantiles (values that split a data set into equal portions) of the data set instead of every individual score in the data. Moreover, the Q-Q plots are easier to interpret in case of large sample sizes. The boxplot shows the median as a horizontal line inside the box and the interquartile range (range between the 25 th to 75 th percentiles) as the length of the box. The whiskers (line extending from the top and bottom of the box) represent the minimum and maximum values when they are within 1.5 times the interquartile range from either end of the box. Scores greater than 1.5 times the interquartile range are out of the boxplot and are considered as outliers, and those greater than 3 times the interquartile range are extreme outliers. A boxplot that is symmetric with the median line at approximately the center of the box and with symmetric whiskers that are slightly longer than the subsections of the center box suggests that the data may have come from a normal distribution.
+
+### Normality tests
+
+The various normality tests compare the scores in the sample to a normally distributed set of scores with the same mean and standard deviation; the null hypothesis is that “sample distribution is normal.” If the test is significant, the distribution is non-normal. For small sample sizes, normality tests have little power to reject the null hypothesis and therefore small samples most often pass normality tests. For large sample sizes, significant results would be derived even in the case of a small deviation from normality, although this small deviation will not affect the results of a parametric test. It has been reported that the K-S test has low power and it should not be seriously considered for testing normality (11). Moreover, it is not recommended when parameters are estimated from the data, regardless of sample size (12).
+
+The Shapiro-Wilk test \index{Shapiro-Wilk test} is based on the correlation between the data and the corresponding normal scores and provides better power than the K-S test even after the Lilliefors correction. Power is the most frequent measure of the value of a test for normality. Some researchers recommend the Shapiro-Wilk test as the best choice for testing the normality of data.
+
+## T-tests {#ttest}
+\index{T test}
 The __independent t test__ is used to test if there is any statistically *significant difference between two means*. Use of an independent t test requires several assumptions to be satisfied.  
 
 1.  The variables are continuous and independent
@@ -17,7 +43,7 @@ When these assumptions are satisfied the results of the t test are valid. Otherw
 Using the `mtcars` data set, we check if there are any difference in mile per gallon (mpg) for each of the automatic and manual group.  
 
 Check the data and mark as factor the driving system. 
-
+\index{mtcars}
 ```{r intro01, message=FALSE}
 library(tidyverse)
 glimpse(mtcars)

diff --git a/04-linear_regressions.Rmd b/04-linear_regressions.Rmd
@@ -35,18 +35,17 @@ Some ways to assess how good our model is to:
 The wine.csv file is used. 
 
 Let's load it and then have a quick look at its structure.
-```{r linreg01}
+```{r linreg01, message=FALSE, warning=FALSE}
 library(tidyverse)
 library(skimr)
-wine = read.csv("../datasets/Wine.csv")
-glimpse(wine)
-skim(wine)
+df = read_csv("dataset/Wine.csv")
+skim(df)
 ```
 
 We use the `lm` function to find our linear regression model.  We use *AGST* as the independent variable while the *price* is the dependent variable.
 ```{r linreg02_model}
-model1 = lm(Price ~ AGST, data = wine)
-summary(model1)
+model_lm_df = lm(Price ~ AGST, data = df)
+summary(model_lm_df)
 ```
 
 The `summary` function applied on the model is giving us a bunch of important information
@@ -56,16 +55,15 @@ The `summary` function applied on the model is giving us a bunch of important in
 
 We could have calculated the R value ourselves:
 ```{r linreg03_ssquare}
-SSE = sum(model1$residuals^2)
-SST = sum((wine$Price - mean(wine$Price))^2)
+SSE = sum(model_lm_df$residuals^2)
+SST = sum((df$Price - mean(df$Price))^2)
 r_squared = 1 - SSE/SST
 r_squared
 ```
 
 We can now plot the observations and the line of regression; and see how the linear model fits the data.
 ```{r linreg04_graph}
-library(ggplot2)
-ggplot(wine, aes(AGST, Price)) + 
+ggplot(df, aes(AGST, Price)) + 
   geom_point(shape = 1, col = "blue") + 
   geom_smooth(method = "lm", col = "red")
 ```
@@ -74,7 +72,7 @@ By default, the `geom_smooth()` will use a 95% confidence interval (which is the
 It is always nice to see how our residuals are distributed.  
 We use the `ggplot2` library and the `fortify` function which transform the `summary(model1)` into a data frame usable for plotting. 
 ```{r linreg05_residuals}
-model1 <- fortify(model1)
+model1 <- fortify(model_lm_df)
 p <- ggplot(model1, aes(.fitted, .resid)) + geom_point() 
 p <- p + geom_hline(yintercept = 0, col = "red", linetype = "dashed") 
 p <- p + xlab("Fitted values") + ylab("Residuals") + ggtitle("Plot of the residuals in function of the fitted values")
@@ -100,24 +98,24 @@ There are a bit of trials and errors to make while trying to fit multiple variab
 We continue here with the same dataset, *wine.csv*.  
 First, we can see how each variable is correlated with each other ones, using
 ```{r linreg06_wine}
-cor(wine)
+cor(df)
 ```
 by default, R uses the Pearson coefficient of correlation.   
 So let's start by using all variables.  
 ```{r}
-model2 <- lm(Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop, data = wine)
-summary(model2)
+model2_lm_df <- lm(Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop, data = df)
+summary(model2_lm_df)
 ```
 While doing so, we notice that the variable *Age* has NA (issues with missing data?) and that the variable *FrancePop* isn't very predictive of the price of wine.  So we can refine our models, by taking out these 2 variables, and as we'll see, it won't affect much our $R^2$ value.  Note that with multiple variables regression, it is important to look at the **Adjusted R-squared** as it take into consideration the amount of variables in the model.
 ```{r}
-model3 <- lm(Price ~ Year + WinterRain + AGST + HarvestRain, data = wine)
-summary(model3)
+model3_lm_df <- lm(Price ~ Year + WinterRain + AGST + HarvestRain, data = df)
+summary(model3_lm_df)
 ```
 
 Although it isn't now feasible to graph in 2D the *Price* in function of the other variables, we can still graph our residuals.  
 
 ```{r}
-model3 <- fortify(model3)
+model3 <- fortify(model3_lm_df)
 p <- ggplot(model3, aes(.fitted, .resid)) + geom_point()
 p <- p + geom_hline(yintercept = 0, col = "red", linetype = "dashed") + xlab("Fitted values")
 p <- p + ylab("Residuals") + ggtitle("Plot of the residuals in function of the fitted values (multiple variables)")

diff --git a/05-logistic_regression.Rmd b/05-logistic_regression.Rmd
@@ -1,4 +1,4 @@
-# Logistic Regression
+# Logistic Regression {#logistic}
 
 ## Introduction
 Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome, we use dummy variables. You can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function.  
@@ -366,10 +366,8 @@ prediction_lgr_df4 <- if_else(prediction_lgr_df4 > 0.5, "positive", "negative")
 prediction_lgr_df4 <- factor(prediction_lgr_df4)
 levels(prediction_lgr_df4) <- c("negative", "positive")
 
-
-
-table(df4$test, prediction_lgr_df4)
-table(df4$test)
+#table(df4$test, prediction_lgr_df4)
+#table(df4$test)
 ########
 
 #confusionMatrix(data = accuracy_model_lr3, 
@@ -380,17 +378,17 @@ table(df4$test)
 ### ROC and AUC
 ```{r roc_model3_pic1}
 prediction_lgr_df4 <- predict(model_lgr_df4, data = df4, type="response")
-pr <- prediction(prediction_lgr_df4, df4$test)
-prf <- performance(pr, measure = "tpr", x.measure = "fpr")
-plot(prf)
+#pr <- prediction(prediction_lgr_df4, df4$test)
+#prf <- performance(pr, measure = "tpr", x.measure = "fpr")
+#plot(prf)
 
 ```
 
 
 Let's go back to the ideal cut off point that would balance the sensitivity and specificity. 
 ```{r}
-cost_diabetes_perf <- performance(pr, "cost")
-cutoff <- pr@cutoffs[[1]][which.min([email protected][[1]])]
+#cost_diabetes_perf <- performance(pr, "cost")
+#cutoff <- pr@cutoffs[[1]][which.min([email protected][[1]])]
 ```
 
 So for maximum accuracy, the ideal cutoff point is `0.487194`.  
@@ -409,23 +407,23 @@ Another cost measure that is popular is overall accuracy. This measure optimizes
 
 Actually the `ROCR` package can also give us a plot of accuracy for various cutoff points
 ```{r roc_model3_pic2}
-prediction_lgr_df4 <- performance(pr, measure = "acc")
-plot(prediction_lgr_df4)
+#prediction_lgr_df4 <- performance(pr, measure = "acc")
+#plot(prediction_lgr_df4)
 ```
 
 
 Often in medical research for instance, there is a cost in having false negative is quite higher than a false positve.  
 Let's say the cost of missing someone having diabetes is 3 times the cost of telling someone that he has diabetes when in reality he/she doesn't.  
 ```{r}
-cost_diabetes_perf <- performance(pr, "cost", cost.fp = 1, cost.fn = 3)
-cutoff <- pr@cutoffs[[1]][which.min([email protected][[1]])]
+#cost_diabetes_perf <- performance(pr, "cost", cost.fp = 1, cost.fn = 3)
+#cutoff <- pr@cutoffs[[1]][which.min([email protected][[1]])]
 ```
 
 Lastly, in regards to AUC
 ```{r}
-auc <- performance(pr, measure = "auc")
-auc <- [email protected][[1]]
-auc
+#auc <- performance(pr, measure = "auc")
+#auc <- [email protected][[1]]
+#auc
 ```
 
 

diff --git a/06-softmax_multinomial.Rmd b/06-softmax_multinomial.Rmd
@@ -0,0 +1,8 @@
+# Softmax and multinomial regressions
+
+
+## Multinomial Logistic Regression
+
+
+## References
+If
diff --git a/07-KNN.Rmd b/07-KNN.Rmd
@@ -0,0 +1,160 @@
+# KNN - K Nearest Neighbour {#knnchapter}
+
+The KNN algorithm is a robust and versatile classifier that is often used as a benchmark for more complex classifiers such as Artificial Neural Networks (ANN) and Support Vector Machines (SVM). Despite its simplicity, KNN can outperform more powerful classifiers and is used in a variety of applications.  
+
+The KNN classifier is also a non parametric and instance-based learning algorithm.
+
+**Non-parametric** means it makes no explicit assumptions about the functional form of h, avoiding the dangers of mismodeling the underlying distribution of the data. For example, suppose our data is highly non-Gaussian but the learning model we choose assumes a Gaussian form. In that case, our algorithm would make extremely poor predictions.  
+
+**Instance-based** learning means that our algorithm doesn’t explicitly learn a model (lazy learner). Instead, it chooses to memorize the training instances which are subsequently used as “knowledge” for the prediction phase. Concretely, this means that only when a query to our database is made (i.e. when we ask it to predict a label given an input), will the algorithm use the training instances to spit out an answer.
+
+It is worth noting that the minimal training phase of KNN comes both at a memory cost, since we must store a potentially huge data set, as well as a computational cost during test time since classifying a given observation requires a run down of the whole data set. Practically speaking, this is undesirable since we usually want fast responses.
+
+The principle behind KNN classifier (K-Nearest Neighbor) algorithm is to find K predefined number of training samples that are closest in the distance to a new point & predict a label for our new point using these samples.
+
+When K is small, we are restraining the region of a given prediction and forcing our classifier to be “more blind” to the overall distribution. A small value for K provides the most flexible fit, which will have low bias but high variance. Graphically, our decision boundary will be more jagged.
+![KNN with k = 1](otherpics/knn01.png)
+
+On the other hand, a higher K averages more voters in each prediction and hence is more resilient to outliers. Larger values of K will have smoother decision boundaries which means lower variance but increased bias.
+![KNN with k = 20](otherpics/knn20.png)
+
+## Example 1.  Prostate Cancer dataset  
+\index{Prostate cancer dataset}
+
+```{r knn01, message=FALSE, warning=FALSE}
+library(tidyverse)
+df <- read_csv("dataset/prostate_cancer.csv")
+glimpse(df)
+```
+
+Change the diagnosis result into a factor, then remove the `ID` variable as it does not bring anything. 
+```{r knn02}
+df$diagnosis_result <- factor(df$diagnosis_result, levels = c("B", "M"), 
+                               labels = c("Benign", "Malignant"))
+df2 <- df %>% select(-id)
+
+# Checking how balance is the dependend variable 
+prop.table(table(df2$diagnosis_result))
+```
+
+It is quite typical of such medical dataset to be unbalanced.  We'll have to deal with it.  
+
+Like with PCA, KNN is quite sensitve to the scale of the variable.  So it is important to first standardize the variables. This time we'll do this using the `preProcess` funnction of the `caret` package.  
+\index{Normalisation}
+\index{caret}
+```{r kn03, message=FALSE, warning=FALSE}
+library(caret)
+param_preproc_df2 <- preProcess(df2[,2:9], method = c("scale", "center"))
+df3_stdize <- predict(param_preproc_df2, df2[, 2:9])
+
+summary(df3_stdize)
+```
+
+We can now see that all means are centered around 0.  Now we reconstruct our df with the response variable and we split the df into a training and testing set.  
+\index{Splitting dataset}
+```{r}
+df3_stdize <- bind_cols(diagnosis = df2$diagnosis_result, df3_stdize)
+
+param_split<- createDataPartition(df3_stdize$diagnosis, times = 1, p = 0.8, 
+                                      list = FALSE)
+train_df3 <- df3_stdize[param_split, ]
+test_df3 <- df3_stdize[-param_split, ]
+
+#We can check that we still have the same kind of split
+prop.table(table(train_df3$diagnosis))
+```
+
+Nice to see that the proportion of *Malign* vs *Benin* has been conserved.  
+\index{KNN}
+\index{Cross validation}
+We use KNN with cross-validation (discussed in more details in this section \@ref(crossvalidation) to train our model.  
+```{r knn04}
+trnctrl_df3 <- trainControl(method = "cv", number = 10)
+model_knn_df3 <- train(diagnosis ~., data = train_df3, method = "knn", 
+                       trControl = trnctrl_df3, 
+                       tuneLength = 10)
+
+model_knn_df3
+```
+
+\index{KNN model}
+```{r knn05}
+plot(model_knn_df3)
+```
+
+```{r knn06}
+predict_knn_df3 <- predict(model_knn_df3, test_df3)
+confusionMatrix(predict_knn_df3, test_df3$diagnosis, positive = "Malignant")
+```
+
+## Example 2.  Wine dataset  
+\index{Wine Quality dataset}
+We load the dataset and do some quick cleaning  
+```{r knn07, message=FALSE, warning=FALSE}
+df <- read_csv("dataset/Wine_UCI.csv", col_names = FALSE)
+colnames(df) <- c("Origin", "Alcohol", "Malic_acid", "Ash", "Alkalinity_of_ash", 
+                  "Magnesium", "Total_phenols", "Flavanoids", "Nonflavonoids_phenols", 
+                  "Proanthocyanins", "Color_intensity", "Hue", "OD280_OD315_diluted_wines", 
+                  "Proline")
+
+glimpse(df)
+```
+
+The origin is our dependent variable.  Let's make it a factor. 
+```{r knn08}
+df$Origin <- as.factor(df$Origin)
+
+#Let's check our explained variable distribution of origin
+round(prop.table(table(df$Origin)), 2)
+```
+That's nice, our explained variable is almost equally distributed with the 3 set of origin.  
+
+```{r knn09}
+# Let's also check if we have any NA values
+summary(df)
+```
+Here we noticed that the range of values in our variable is quite wide.  It means our data will need to be standardize. We also note that we no "NA" values.  That's quite a nice surprise!
+
+### Understand the data  
+We first slide our data in a training and testing set.  
+```{r knn10}
+df2 <- df
+param_split_df2 <- createDataPartition(df2$Origin, p = 0.75, list = FALSE)
+
+train_df2 <- df2[param_split_df2, ]
+test_df2 <- df2[-param_split_df2, ]
+```
+
+The great with caret is we can standardize our data in the the training phase.  
+
+#### Model the data  
+Let's keep using `caret` for our training.  
+\index{KNN}
+```{r knn11}
+trnctrl_df2 <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
+model_knn_df2 <- train(Origin ~., data = train_df2, method = "knn", 
+                       trControl = trnctrl_df2, 
+                       preProcess = c("center", "scale"),  
+                       tuneLength = 10)
+
+```
+
+\index{KNN model}
+```{r plot01_knn}
+model_knn_df2
+
+plot(model_knn_df2)
+```
+
+Let's use our model to make our prediction
+```{r knn12}
+prediction_knn_df2 <- predict(model_knn_df2, newdata = test_df2)
+
+confusionMatrix(prediction_knn_df2, reference = test_df2$Origin)
+```
+
+
+## References  
+
+* KNN R, K-Nearest neighbor implementation in R using caret package. [Here](http://dataaspirant.com/2017/01/09/knn-implementation-r-using-caret-package/)
+* A complete guide to KNN.  [Here](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/)