Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
fderyckel authored Feb 23, 2019
1 parent ba4abbb commit 132b66f
Show file tree
Hide file tree
Showing 17 changed files with 1,516 additions and 13 deletions.
21 changes: 18 additions & 3 deletions 01-intro.Rmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# Tests and inferences {#testinference}

```{r message=FALSE}
library(knitr)
library(kableExtra)
library(tidyverse)
```


One of the first thing to be familiar with while doing machine learning works is the basic of statistical inferences.
In this chapter, we go over some of these important concepts and the "R-ways" to do them.

Expand All @@ -26,7 +34,7 @@ Visual inspection of the distribution may be used for assessing normality, altho

The various normality tests compare the scores in the sample to a normally distributed set of scores with the same mean and standard deviation; the null hypothesis is that “sample distribution is normal.” If the test is significant, the distribution is non-normal. For small sample sizes, normality tests have little power to reject the null hypothesis and therefore small samples most often pass normality tests. For large sample sizes, significant results would be derived even in the case of a small deviation from normality, although this small deviation will not affect the results of a parametric test. It has been reported that the K-S test has low power and it should not be seriously considered for testing normality (11). Moreover, it is not recommended when parameters are estimated from the data, regardless of sample size (12).

The Shapiro-Wilk test \index{Shapiro-Wilk test} is based on the correlation between the data and the corresponding normal scores and provides better power than the K-S test even after the Lilliefors correction. Power is the most frequent measure of the value of a test for normality. Some researchers recommend the Shapiro-Wilk test as the best choice for testing the normality of data.
The Shapiro-Wilk test \index{Shapiro-Wilk test} is based on the correlation between the data and the corresponding normal scores provides better power than the K-S test even after the Lilliefors correction. Power is the most frequent measure of the value of a test for normality. Some researchers recommend the Shapiro-Wilk test as the best choice for testing the normality of data.

## T-tests {#ttest}
\index{T test}
Expand All @@ -40,7 +48,7 @@ When these assumptions are satisfied the results of the t test are valid. Otherw

Using the `mtcars` data set, we check if there are any difference in mile per gallon (mpg) for each of the automatic and manual group.

First things, first, let's check the data.
First things first, let's check the data.
\index{mtcars dataset}
```{r intro01}
glimpse(mtcars)
Expand Down Expand Up @@ -238,4 +246,11 @@ The conclusion above, can be supported by the Shapiro-Wilk test on the ANOVA res
shapiro.test(residuals(model_aov_df))
```

Again the p-value indicate no violation from normality.
Again the p-value indicate no violation from normality.

## Covariance
The correlation coefficient between 2 variables can be calculated by
$r = \frac{Cov(x, y)}{\sigma{x} \cdot \sigma{y}}$

The covariance is defined as $\frac {\sum(x - \overline x) \cdot (y - \overline y)}{n-1}$
and the standard deviation is defined as $\sqrt \frac{\sum(x - \overline x)^2}{n-1}$
18 changes: 17 additions & 1 deletion 03-linear_regressions.Rmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,21 @@
---
output:
pdf_document: default
html_document: default
---
# Single & Multiple Linear Regression {#mlr}

```{r message=FALSE}
library(skimr)
library(kableExtra) # for the kable_styling function
library(tibble)
library(dplyr)
library(readr)
library(ggplot2)
```


##Single variable regression

The general equation for a linear regression model
Expand Down Expand Up @@ -285,7 +301,7 @@ plot(model_bwe_df, scale = "adjr2")

Ideally, the model should consider the following variables
```{r linreg12}
model2_mlr_df <- lm(MEDV ~ Crime + NOX + RM + DIS + RAD + PTRATIO + B + LSTAT, data = df)
model2_mlr_df <- lm(MEDV ~ CRIM + NOX + RM + DIS + RAD + PTRATIO + BLACK + LSTAT, data = df)
summary(model2_mlr_df)
```

Expand Down
13 changes: 6 additions & 7 deletions 04-logistic_regression.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,6 @@ As usual we will use the `tidyverse` and `caret` package
```{r message=FALSE, warning=FALSE}
library(caret) # For confusion matrix
library(ROCR) # For the ROC curve
library(tidyverse)
```

We can now get straight to business and see how to model logisitc regression with R and then have the more interesting discussion on its performance.
Expand Down Expand Up @@ -109,8 +108,8 @@ To check the accuracy of the model, we need a confusion matrix with a cut off va

```{r}
prediction_lgr_df2 <- if_else(prediction_lgr_df2 > 0.5 , 1, 0)
confusionMatrix(data = prediction_lgr_df2,
reference = df2$admit, positive = "1")
confusionMatrix(data = factor(prediction_lgr_df2),
reference = factor(df2$admit), positive = "1")
```

We have an interesting situation here. Although all our variables were significant in our model, the accuracy of our model, `71%` is just a little bit higher than the basic benchmark which is the no-information model (ie. we just predict the highest class) in this case `68.25%`.
Expand Down Expand Up @@ -144,8 +143,8 @@ Let's try
```{r}
prediction_lgr_df2 <- predict(model_lgr_df2, data = df2, type = "response")
prediction_lgr_df2 <- if_else(prediction_lgr_df2 > cutoff , 1, 0)
confusionMatrix(data = prediction_lgr_df2,
reference = df2$admit,
confusionMatrix(data = factor(prediction_lgr_df2),
reference = factor(df2$admit),
positive = "1")
```

Expand Down Expand Up @@ -249,8 +248,8 @@ prediction_lgr_df2 <- if_else(prediction_lgr_df2 > 0.5, 1, 0)
table(df2$test)
confusionMatrix(data = prediction_lgr_df2,
reference = df2$test,
confusionMatrix(data = factor(prediction_lgr_df2),
reference = factor(df2$test),
positive = "1")
```
Expand Down
12 changes: 10 additions & 2 deletions 07-KNN.Rmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# KNN - K Nearest Neighbour {#knnchapter}

Clustering is an unsupervised learning technique. It is the task of grouping together a set of objects in a way that objects in the same cluster are more similar to each other than to objects in other clusters. Similarity is an amount that reflects the strength of relationship between two data objects. Clustering is mainly used for exploratory data mining.

The KNN algorithm is a robust and versatile classifier that is often used as a benchmark for more complex classifiers such as Artificial Neural Networks (ANN) and Support Vector Machines (SVM). Despite its simplicity, KNN can outperform more powerful classifiers and is used in a variety of applications.

The KNN classifier is also a non parametric and instance-based learning algorithm.
Expand All @@ -22,11 +24,16 @@ On the other hand, a higher K averages more voters in each prediction and hence

What we are observing here is that increasing k will decrease variance and increase bias. While decreasing k will increase variance and decrease bias. Take a look at how variable the predictions are for different data sets at low k. As k increases this variability is reduced. But if we increase k too much, then we no longer follow the true boundary line and we observe high bias. This is the nature of the Bias-Variance Tradeoff.

Clustering can be broadly divided into two subgroups:

* Hard clustering: in hard clustering, each data object or point either belongs to a cluster completely or not. For example in the Uber dataset, each location belongs to either one borough or the other.
* Soft clustering: in soft clustering, a data point can belong to more than one cluster with some probability or likelihood value. For example, you could identify some locations as the border points belonging to two or more boroughs.


## Example 1. Prostate Cancer dataset
\index{Prostate cancer dataset}

```{r knn01, message=FALSE, warning=FALSE}
library(tidyverse)
df <- read_csv("dataset/prostate_cancer.csv")
glimpse(df)
```
Expand Down Expand Up @@ -161,4 +168,5 @@ confusionMatrix(prediction_knn_df2, reference = test_df2$Origin)
## References

* KNN R, K-Nearest neighbor implementation in R using caret package. [Here](http://dataaspirant.com/2017/01/09/knn-implementation-r-using-caret-package/)
* A complete guide to KNN. [Here](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/)
* A complete guide to KNN. [Here](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/)
* K-Means Clustering in R Tutorial. [Here](https://www.datacamp.com/community/tutorials/k-means-clustering-r?utm_campaign=News&utm_medium=Community&utm_source=DataCamp.com)
26 changes: 26 additions & 0 deletions 08-KMeans_Clustering.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Kmeans clustering {#kmeans}

Distance Calculation for Clustering

With quantitative variables, distance calculations are highly influenced by variable units and magnitude. For example, clustering variable height (in feet) with salary (in rupees) having different units and distribution (skewed) will invariably return biased results. Hence, always make sure to standardize (mean = 0, sd = 1) the variables. Standardization results in unit-less variables.
Use of a particular distance measure depends on the variable types; i.e., formula for calculating distance between numerical variables is different than categorical variables.
Suppose, we are given a 2-dimensional data with *xi = (xi1, xi2, . . . , xip)* and *xj = (xj1, xj2, . . . , xjp)*. Both are numeric variables. We can calculate various distances as follows:

1. **Euclidean Distance**: It is used to calculate the distance between quantitative (numeric) variables. As it involves square terms, it is also known as L2 distance (because it squares the difference in coordinates).

2. **Manhattan Distance**: It is calculated as the absolute value of the sum of differences in the given coordinates. This is known as L1 distance. It is also sometimes called the Minowski Distance. An interesting fact about this distance is that it only calculates the horizontal and vertical distances. It doesn't calculate the diagonal distance.

3. **Hamming Distance**: It is used to calculate the distance between categorical variables. It uses a contingency table to count the number of mismatches among the observations. If a categorical variable is binary (say, male or female), it encodes the variable as male = 0, female = 1.
In case a categorical variable has more than two levels, the Hamming distance is calculated based on dummy encoding.

4. **Gower Distance**: It is used to calculate the distance between mixed (numeric, categorical) variables. It works this way: it computes the distance between observations weighted by its variable type, and then takes the mean across all variables. Technically, the above-mentioned distance measures are a form of Gower distances; i.e. if all the variables are numeric in nature, Gower distance takes the form of Euclidean. If all the values are categorical, it takes the form of Manhattan or Jaccard distance. In R, ClusterOfVar package handles mixed data very well.

5. **Cosine Similarity**: It is the most commonly used similarity metric in text analysis. The closeness of text data is measured by the smallest angle between two vectors. The angle (?) is assumed to be between 0 and 90. Therefore, the maximum dissimilarity between two vectors is measured at Cos 90 (perpendicular). And, two vectors are said to be most similar at Cos 0 (parallel).



## Multinomial Logistic Regression


## References
General overivew. [Here](https://www.hackerearth.com/blog/machine-learning/practical-guide-to-clustering-algorithms-evaluation-in-r/)
96 changes: 96 additions & 0 deletions 09-hierarichal_clustering.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Hierarichal Clustering {#hierclust}

In simple words, hierarchical clustering tries to create a sequence of nested clusters to explore deeper insights from the data. For example, this technique is being popularly used to explore the standard plant taxonomy which would classify plants by family, genus, species, and so on.

Hierarchical clustering technique is of two types:

1. **Agglomerative Clustering** – It starts with treating every observation as a cluster. Then, it merges the most similar observations into a new cluster. This process continues until all the observations are merged into one cluster. It uses a bottoms-up approach (think of an inverted tree).

2. **Divisive Clustering** – In this technique, initially all the observations are partitioned into one cluster (irrespective of their similarities). Then, the cluster splits into two sub-clusters carrying similar observations. These sub-clusters are intrinsically homogeneous. Then, we continue to split the clusters until the leaf cluster contains exactly one observation. It uses a top-down approach.


This technique creates a hierarchy (in a recursive fashion) to partition the data set into clusters. This partitioning is done in a bottoms-up fashion. This hierarchy of clusters is graphically presented using a Dendogram (shown below).
![Terminology of dendograms](otherpics/dendogram01.png)

Let’s understand how to study a dendrogram.

As you know, every leaf in the dendrogram carries one observation. As we move up the leaves, the leaf observations begin to merge into nodes (carrying observations which are similar to each other). As we move further up, these nodes again merge further.

Always remember, lower the merging happens (towards the bottom of the tree), more similar the observations will be. Higher the merging happens (toward the top of the tree), less similar the observations will be.

To determine clusters, we make horizontal cuts across the branches of the dendrogram. The number of clusters is then calculated by the number of vertical lines on the dendrogram, which lies under horizontal line.

![cutting a dendogram for clusters](otherpics/dendogram_cut.png)

As seen above, the horizontal line cuts the dendrogram into three clusters since it surpasses three vertical lines. In a way, the selection of height to make a horizontal cut is similar to finding k in k means since it also controls the number of clusters.

But, how to decide where to cut a dendrogram? Practically, analysts do it based on their judgement and business need. More logically, there are several methods (described below) using which you can calculate the accuracy of your model based on different cuts. Finally, select the cut with a better accuracy.

The advantage of using hierarchical clustering over k means is, it doesn't require advanced knowledge of number of clusters. However, some of the advantages which k means has over hierarchical clustering are as follows:

* It uses less memory.
* It converges faster.
* Unlike hierarchical, k means doesn't get trapped in mistakes made on a previous level. It improves iteratively.
* k means is non-deterministic in nature, i.e.. after every time you initialize, it will produce different clusters. On the contrary, hierarchical clustering is deterministic.
* Note: K means is preferred when the data is numeric. Hierarchical clustering is preferred when the data is categorical.


## Example on the Pokemon dataset
For our first example, we are using the Pokemon dataset. It is available on the Kaggle website [here](https://www.kaggle.com/abcsds/pokemon).
Let's load the data and check what variables are there.
```{r message=FALSE}
df <- read_csv("dataset/pokemon.csv")
glimpse(df)
```

For this example, we are just concerned with a few explanatory variables of the data set: attack, defense and speed.
```{r}
df2 <- df %>% select(name = Name, hit_point = HP, attack = Attack, defense = Defense,
sp_attack = `Sp. Atk`, sp_defense = `Sp. Def`, speed = Speed) %>%
as_tibble()
glimpse(df2)
```

The first step with hierarichal clustering is always to first scale the data we are dealing with. We use the `caret` package and its `preprocess` function.
```{r}
pokemon_preprocess <- caret::preProcess(df2, method = c("center", "scale"))
df_scaled <- predict(pokemon_preprocess, df2)
```

We can now used our standardized data on our hierarchical clustering algorithm.
```{r}
# Create the cluster
hclust_pokemon <- hclust(dist(df_scaled), method = "complete")
# Create the plot of the cluster
plot(hclust_pokemon)
```

Although we do not see any of the terminal leaves... they all clobered together. All the leaves at the bottom carry one observation each, which are then merged into similar values as they rise upward. We can see that there are instance four main branches. With the `cutree` function, we can assign each observation to its specifically assigned cluster.
```{r}
# create a df using the cutree function
df2_clust <- cutree(hclust_pokemon, k = 4)
# visual on defense vs attack
ggplot(df2, aes(x = defense, y = attack, col = as.factor(df2_clust))) +
geom_point()
```

As comment on this graph of defense vs attack ability of pokemon, on can see the clustering has worked more or less well. There are few more questions to answer with the cluster 4 - purple dot at the bottom right corner which is a pokemon with high defense and low attack abilities. Cluster 3 is also not very clear - the turquoise one scatter all under the green and red ones.

Using the `rect.hclust` function we can also see the height where to cut the branches.
```{r}
plot(hclust_pokemon)
rect.hclust(hclust_pokemon, k = 4, border = "red")
```

```{r}
```



## Example on regressions


## References
On the general idea of hierarchical clustering. [Here](https://www.hackerearth.com/blog/machine-learning/practical-guide-to-clustering-algorithms-evaluation-in-r/)
Loading

0 comments on commit 132b66f

Please sign in to comment.