-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit dc94d1a
Showing
2 changed files
with
1,021 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,306 @@ | ||
--- | ||
editor_options: | ||
markdown: | ||
wrap: 72 | ||
--- | ||
|
||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = TRUE) | ||
``` | ||
|
||
# Probability and Statistics | ||
|
||
# Lab Assignment 1: Naive Bayes Classifier | ||
|
||
## Work breakdown | ||
|
||
- Viktoria Prokhorova: Data and metrics visualization | ||
- Dmytro Shumskyi: Predict and fit methods | ||
- Mykola Vysotskyi: Predict method, metrics functions, data | ||
pre-processing | ||
|
||
## Data description | ||
|
||
- **4 - spam** This last data set contains SMS messages classified as | ||
spam or non-spam (ham in the data set). The task is to determine | ||
whether a given message is spam or non-spam. | ||
|
||
Each data set consists of two files: *train.csv* and *test.csv*. The | ||
first one you will need find the probabilities distributions for each of | ||
the features, while the second one is needed for checking how well your | ||
classifier works. | ||
|
||
## **Outline of the work** | ||
|
||
1. **Data pre-processing** (includes removing punctuation marks and | ||
stop words, representing each message as a bag-of-words) | ||
|
||
2. **Data visualization** | ||
|
||
3. **Classifier implementation** (using the training set, calculate all | ||
the conditional probabilities in formula (1) and then use those to | ||
predict classes for messages in the testing set) | ||
|
||
4. **Measurements of effectiveness of your classifier** (accuracy, | ||
precision and recall curves, F1 score metric etc) | ||
|
||
5. **Conclusions** | ||
|
||
## Data pre-processing | ||
|
||
```{r} | ||
# Include all necessary libraries | ||
library(tidytext) | ||
library(readr) | ||
library(dplyr) | ||
library(ggplot2) | ||
library(wordcloud) | ||
library(tm) | ||
``` | ||
|
||
```{r} | ||
# Loading stop words | ||
stop_words <- read_file("data/stop_words.txt") | ||
splittedStopWords <- strsplit(stop_words, split='\n') | ||
splittedStopWords <- splittedStopWords[[1]] | ||
# Loading datasets | ||
test <- read_csv("data/test.csv", show_col_types = FALSE) | ||
train <- read_csv("data/train.csv", show_col_types = FALSE) | ||
# Function to calculate words frequencies in given dataframe | ||
freqDataframe <- function(dataframe, column, stop_words = NULL) | ||
{ | ||
dataframe %>% | ||
unnest_tokens(word, column) %>% | ||
count(word, sort = TRUE) %>% | ||
filter(!word %in% stop_words) | ||
} | ||
``` | ||
|
||
## Data visualization | ||
|
||
```{r} | ||
# Words for visualization | ||
hamFreqVis <- freqDataframe(test[test$Category=="ham", ], "Message", splittedStopWords) | ||
spamFreqVis <- freqDataframe(test[test$Category=="spam", ], "Message", splittedStopWords) | ||
par(mfrow = c(1, 3)) | ||
wordcloud(words = hamFreqVis$word, freq=hamFreqVis$n, max.words=30, random.order = FALSE, | ||
colors = 'green') | ||
plot.new() | ||
wordcloud(words = spamFreqVis$word, freq=spamFreqVis$n, max.words=30, random.order=FALSE, | ||
colors = 'red') | ||
``` | ||
|
||
## Classifier implementation | ||
|
||
```{r} | ||
# All used metrics implementations | ||
recall <- function(y_true, y_pred) { | ||
tp <- sum(y_true == y_pred & y_true == TRUE) | ||
fn <- sum(y_true != y_pred & y_true == TRUE) | ||
return (tp / (tp + fn)) | ||
} | ||
precision <- function(y_true, y_pred) { | ||
tp <- sum(y_true == y_pred & y_true == TRUE) | ||
fp <- sum(y_true != y_pred & y_true == FALSE) | ||
return (tp / (tp + fp)) | ||
} | ||
f1Score <- function(y_true, y_pred) { | ||
recallV <- recall(y_true, y_pred) | ||
precisionV <- precision(y_true, y_pred) | ||
return (2 * precisionV * recallV / (precisionV + recallV)) | ||
} | ||
accuracy <- function(y_true, y_pred) | ||
{ | ||
return (sum(y_true == y_pred) / length(y_true)) | ||
} | ||
``` | ||
|
||
```{r} | ||
naiveBayes <- setRefClass("naiveBayes", | ||
# here it would be wise to have some vars to store intermediate result | ||
# frequency dict etc. Though pay attention to bag of words! | ||
fields = list( | ||
hamFreq = "data.frame", | ||
spamFreq = "data.frame", | ||
hamWordsCount = "numeric", | ||
spamWordsCount = "numeric", | ||
hamClassProb = "numeric", | ||
spamClassProb = "numeric" | ||
), | ||
methods = list( | ||
# prepare your training data as X - bag of words for each of your | ||
# messages and corresponding label for the message encoded as 0 or 1 | ||
# (binary classification task) | ||
fit = function(X, y) | ||
{ | ||
hamFreq <<- freqDataframe(X[X$Category=="ham", ], "Message", splittedStopWords) | ||
spamFreq <<- freqDataframe(X[X$Category=="spam", ], "Message", splittedStopWords) | ||
hamWordsCount <<- sum(hamFreq$n) | ||
spamWordsCount <<- sum(spamFreq$n) | ||
hamClassProb <<- nrow(X[X$Category=="ham", ]) / nrow(X) | ||
spamClassProb <<- nrow(X[X$Category=="spam", ]) / nrow(X) | ||
}, | ||
# return prediction for a single message | ||
predict = function(message) | ||
{ | ||
# Check if message is not empty | ||
if(nchar(message) == 0) { | ||
return (NULL) | ||
} | ||
# Convert message to bag of words | ||
wrapperDf <- data.frame(word = message) | ||
tokens <- unnest_tokens(wrapperDf, word, word) %>% filter(!word %in% splittedStopWords) | ||
hamProb <- hamClassProb | ||
spamProb <- spamClassProb | ||
for (i in 1:nrow(tokens)) | ||
{ | ||
word <- tokens[i, ] | ||
inHamCount <- ifelse(!is.na(any(hamFreq$word == word)) && any(hamFreq$word == word), | ||
hamFreq[hamFreq$word == word, "n"]$n, | ||
0) | ||
inSpamCount <- ifelse(!is.na(any(spamFreq$word == word)) && any(spamFreq$word == word), | ||
spamFreq[spamFreq$word == word, "n"]$n, | ||
0) | ||
hamProb <- hamProb * (inHamCount + 1) / (hamWordsCount + 2) | ||
spamProb <- spamProb * (inSpamCount + 1) / (spamWordsCount + 2) | ||
} | ||
return (hamProb > spamProb) | ||
}, | ||
# score you test set so to get the understanding how well you model | ||
# works. | ||
# look at f1 score or precision and recall | ||
# visualize them | ||
# try how well your model generalizes to real world data! | ||
score = function(X_test, y_test) | ||
{ | ||
y_pred <- lapply(X_test$Message, function(message) { predict(message) }) | ||
scores = c( | ||
"accuracy" = accuracy(y_test, y_pred), | ||
"precision" = precision(y_test, y_pred), | ||
"recall" = recall(y_test, y_pred), | ||
"f1Score" = f1Score(y_test, y_pred) | ||
) | ||
return (scores) | ||
} | ||
)) | ||
# Create and fit model | ||
model = naiveBayes() | ||
model$fit(train, train$Category) | ||
``` | ||
|
||
## Measure effectiveness of your classifier | ||
|
||
```{r} | ||
# Example of score results | ||
model$score(test, test$Category == "ham") | ||
``` | ||
|
||
```{r} | ||
# Get scores for different sizes | ||
trainSizes <- c(0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9, 1.0) | ||
accuracyScores <- c() | ||
precisionScores <- c() | ||
recallScores <- c() | ||
f1Scores <- c() | ||
tmodel = naiveBayes() | ||
for (trainSize in trainSizes) | ||
{ | ||
trainSample <- train[sample(nrow(train), floor(nrow(train) * trainSize)), ] | ||
tmodel$fit(trainSample, trainSample$Category) | ||
scores <- tmodel$score(test, test$Category == "ham") | ||
accuracyScores <- c(accuracyScores, scores["accuracy"]) | ||
precisionScores <- c(precisionScores, scores["precision"]) | ||
recallScores <- c(recallScores, scores["recall"]) | ||
f1Scores <- c(f1Scores, scores["f1Score"]) | ||
} | ||
``` | ||
|
||
```{r} | ||
df_reshaped <- data.frame(x = trainSizes, | ||
y = c(accuracyScores, precisionScores, recallScores, f1Scores), | ||
group = c(rep("Accuracy", length(trainSizes)), | ||
rep("Precision", length(trainSizes)), | ||
rep("Recall", length(trainSizes)), | ||
rep("F1", length(trainSizes)) | ||
)) | ||
ggplot(df_reshaped, aes(x, y, col = group)) + geom_line() | ||
``` | ||
|
||
#### Failure cases | ||
|
||
```{r} | ||
convert_label <- function(x) { ifelse(x == "ham", TRUE, FALSE) } | ||
y_pred <- lapply(test$Message, function(message) { model$predict(message) }) | ||
failureCases <- test[y_pred != convert_label(test$Category), ] | ||
failureCases | ||
``` | ||
|
||
## Check on real world data | ||
|
||
```{r} | ||
model$predict("") # should return NULL | ||
model$predict("Hello, how are you?") # should return TRUE | ||
model$predict("WINNER!! This is the secret code to unlock the money: C3421.") # should return FALSE | ||
``` | ||
|
||
## Conclusions | ||
|
||
Summarize your work by explaining in a few sentences the points listed | ||
below. | ||
|
||
- Describe the method implemented in general. Show what are | ||
mathematical foundations you are basing your solution on. | ||
- List pros and cons of the method. This should include the | ||
limitations of your method, all the assumption you make about the | ||
nature of your data etc. | ||
- The method is called **Naive Bayes classifier**. In our case we used | ||
is to filter **spam** messages from **non-spam**(ham). Firstly, we | ||
calculate frequencies of words to be in spam and ham messages. | ||
Forming bag-of-words. After that, for given text message we can | ||
calculate its probability to be spam/ham using **Bayes formula**. We | ||
assume that all features(words) are **independent**. To find | ||
probability of message belong to some class we multiply probability | ||
of class by product of probability of each word given the class and | ||
divide this by product of probabilities of each word(actually, we | ||
can skip this because it is common for each class and is not useful | ||
in comparison). Also we used **Laplace Smoothing** to prevent | ||
probability of some word being 0. | ||
- **Pros** of this method - easy to implement, computationally light. | ||
**Cons** - *naivety* of this method(we assume that all words are | ||
independent and do not consider words order), words that often | ||
appear in one class can be in other and lead to incorrect | ||
classification results. |
Large diffs are not rendered by default.
Oops, something went wrong.