Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
DmShums committed Jan 4, 2024
0 parents commit dc94d1a
Show file tree
Hide file tree
Showing 2 changed files with 1,021 additions and 0 deletions.
306 changes: 306 additions & 0 deletions Lab1_Naive_Bayes_Classifier.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,306 @@
---
editor_options:
markdown:
wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Probability and Statistics

# Lab Assignment 1: Naive Bayes Classifier

## Work breakdown

- Viktoria Prokhorova: Data and metrics visualization
- Dmytro Shumskyi: Predict and fit methods
- Mykola Vysotskyi: Predict method, metrics functions, data
pre-processing

## Data description

- **4 - spam** This last data set contains SMS messages classified as
spam or non-spam (ham in the data set). The task is to determine
whether a given message is spam or non-spam.

Each data set consists of two files: *train.csv* and *test.csv*. The
first one you will need find the probabilities distributions for each of
the features, while the second one is needed for checking how well your
classifier works.

## **Outline of the work**

1. **Data pre-processing** (includes removing punctuation marks and
stop words, representing each message as a bag-of-words)

2. **Data visualization**

3. **Classifier implementation** (using the training set, calculate all
the conditional probabilities in formula (1) and then use those to
predict classes for messages in the testing set)

4. **Measurements of effectiveness of your classifier** (accuracy,
precision and recall curves, F1 score metric etc)

5. **Conclusions**

## Data pre-processing

```{r}
# Include all necessary libraries
library(tidytext)
library(readr)
library(dplyr)
library(ggplot2)
library(wordcloud)
library(tm)
```

```{r}
# Loading stop words
stop_words <- read_file("data/stop_words.txt")
splittedStopWords <- strsplit(stop_words, split='\n')
splittedStopWords <- splittedStopWords[[1]]
# Loading datasets
test <- read_csv("data/test.csv", show_col_types = FALSE)
train <- read_csv("data/train.csv", show_col_types = FALSE)
# Function to calculate words frequencies in given dataframe
freqDataframe <- function(dataframe, column, stop_words = NULL)
{
dataframe %>%
unnest_tokens(word, column) %>%
count(word, sort = TRUE) %>%
filter(!word %in% stop_words)
}
```

## Data visualization

```{r}
# Words for visualization
hamFreqVis <- freqDataframe(test[test$Category=="ham", ], "Message", splittedStopWords)
spamFreqVis <- freqDataframe(test[test$Category=="spam", ], "Message", splittedStopWords)
par(mfrow = c(1, 3))
wordcloud(words = hamFreqVis$word, freq=hamFreqVis$n, max.words=30, random.order = FALSE,
colors = 'green')
plot.new()
wordcloud(words = spamFreqVis$word, freq=spamFreqVis$n, max.words=30, random.order=FALSE,
colors = 'red')
```

## Classifier implementation

```{r}
# All used metrics implementations
recall <- function(y_true, y_pred) {
tp <- sum(y_true == y_pred & y_true == TRUE)
fn <- sum(y_true != y_pred & y_true == TRUE)
return (tp / (tp + fn))
}
precision <- function(y_true, y_pred) {
tp <- sum(y_true == y_pred & y_true == TRUE)
fp <- sum(y_true != y_pred & y_true == FALSE)
return (tp / (tp + fp))
}
f1Score <- function(y_true, y_pred) {
recallV <- recall(y_true, y_pred)
precisionV <- precision(y_true, y_pred)
return (2 * precisionV * recallV / (precisionV + recallV))
}
accuracy <- function(y_true, y_pred)
{
return (sum(y_true == y_pred) / length(y_true))
}
```

```{r}
naiveBayes <- setRefClass("naiveBayes",
# here it would be wise to have some vars to store intermediate result
# frequency dict etc. Though pay attention to bag of words!
fields = list(
hamFreq = "data.frame",
spamFreq = "data.frame",
hamWordsCount = "numeric",
spamWordsCount = "numeric",
hamClassProb = "numeric",
spamClassProb = "numeric"
),
methods = list(
# prepare your training data as X - bag of words for each of your
# messages and corresponding label for the message encoded as 0 or 1
# (binary classification task)
fit = function(X, y)
{
hamFreq <<- freqDataframe(X[X$Category=="ham", ], "Message", splittedStopWords)
spamFreq <<- freqDataframe(X[X$Category=="spam", ], "Message", splittedStopWords)
hamWordsCount <<- sum(hamFreq$n)
spamWordsCount <<- sum(spamFreq$n)
hamClassProb <<- nrow(X[X$Category=="ham", ]) / nrow(X)
spamClassProb <<- nrow(X[X$Category=="spam", ]) / nrow(X)
},
# return prediction for a single message
predict = function(message)
{
# Check if message is not empty
if(nchar(message) == 0) {
return (NULL)
}
# Convert message to bag of words
wrapperDf <- data.frame(word = message)
tokens <- unnest_tokens(wrapperDf, word, word) %>% filter(!word %in% splittedStopWords)
hamProb <- hamClassProb
spamProb <- spamClassProb
for (i in 1:nrow(tokens))
{
word <- tokens[i, ]
inHamCount <- ifelse(!is.na(any(hamFreq$word == word)) && any(hamFreq$word == word),
hamFreq[hamFreq$word == word, "n"]$n,
0)
inSpamCount <- ifelse(!is.na(any(spamFreq$word == word)) && any(spamFreq$word == word),
spamFreq[spamFreq$word == word, "n"]$n,
0)
hamProb <- hamProb * (inHamCount + 1) / (hamWordsCount + 2)
spamProb <- spamProb * (inSpamCount + 1) / (spamWordsCount + 2)
}
return (hamProb > spamProb)
},
# score you test set so to get the understanding how well you model
# works.
# look at f1 score or precision and recall
# visualize them
# try how well your model generalizes to real world data!
score = function(X_test, y_test)
{
y_pred <- lapply(X_test$Message, function(message) { predict(message) })
scores = c(
"accuracy" = accuracy(y_test, y_pred),
"precision" = precision(y_test, y_pred),
"recall" = recall(y_test, y_pred),
"f1Score" = f1Score(y_test, y_pred)
)
return (scores)
}
))
# Create and fit model
model = naiveBayes()
model$fit(train, train$Category)
```

## Measure effectiveness of your classifier

```{r}
# Example of score results
model$score(test, test$Category == "ham")
```

```{r}
# Get scores for different sizes
trainSizes <- c(0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9, 1.0)
accuracyScores <- c()
precisionScores <- c()
recallScores <- c()
f1Scores <- c()
tmodel = naiveBayes()
for (trainSize in trainSizes)
{
trainSample <- train[sample(nrow(train), floor(nrow(train) * trainSize)), ]
tmodel$fit(trainSample, trainSample$Category)
scores <- tmodel$score(test, test$Category == "ham")
accuracyScores <- c(accuracyScores, scores["accuracy"])
precisionScores <- c(precisionScores, scores["precision"])
recallScores <- c(recallScores, scores["recall"])
f1Scores <- c(f1Scores, scores["f1Score"])
}
```

```{r}
df_reshaped <- data.frame(x = trainSizes,
y = c(accuracyScores, precisionScores, recallScores, f1Scores),
group = c(rep("Accuracy", length(trainSizes)),
rep("Precision", length(trainSizes)),
rep("Recall", length(trainSizes)),
rep("F1", length(trainSizes))
))
ggplot(df_reshaped, aes(x, y, col = group)) + geom_line()
```

#### Failure cases

```{r}
convert_label <- function(x) { ifelse(x == "ham", TRUE, FALSE) }
y_pred <- lapply(test$Message, function(message) { model$predict(message) })
failureCases <- test[y_pred != convert_label(test$Category), ]
failureCases
```

## Check on real world data

```{r}
model$predict("") # should return NULL
model$predict("Hello, how are you?") # should return TRUE
model$predict("WINNER!! This is the secret code to unlock the money: C3421.") # should return FALSE
```

## Conclusions

Summarize your work by explaining in a few sentences the points listed
below.

- Describe the method implemented in general. Show what are
mathematical foundations you are basing your solution on.
- List pros and cons of the method. This should include the
limitations of your method, all the assumption you make about the
nature of your data etc.
- The method is called **Naive Bayes classifier**. In our case we used
is to filter **spam** messages from **non-spam**(ham). Firstly, we
calculate frequencies of words to be in spam and ham messages.
Forming bag-of-words. After that, for given text message we can
calculate its probability to be spam/ham using **Bayes formula**. We
assume that all features(words) are **independent**. To find
probability of message belong to some class we multiply probability
of class by product of probability of each word given the class and
divide this by product of probabilities of each word(actually, we
can skip this because it is common for each class and is not useful
in comparison). Also we used **Laplace Smoothing** to prevent
probability of some word being 0.
- **Pros** of this method - easy to implement, computationally light.
**Cons** - *naivety* of this method(we assume that all words are
independent and do not consider words order), words that often
appear in one class can be in other and lead to incorrect
classification results.
715 changes: 715 additions & 0 deletions Lab1_Naive_Bayes_Classifier.html

Large diffs are not rendered by default.

0 comments on commit dc94d1a

Please sign in to comment.