Email Spam Classifier

Classifying an email as spam or ham, using Naive Bayes and Logistic Regression Algorithms

This project aims to develop a reliable and accurate classifier that marks an email as a spam or ham (non-spam) using only a small dataset (containing 948 labeled emails).

Here, I implemented the multinomial Naive Bayes algorithm for text classification, for which more description can be found here.

Results and further improvements

An accuracy of 96.65% is achieved on the test dataset.

In an attempt to improve the results further, the commonly occurring words (called stopwords, like 'the', 'do', 'each', 'for', etc.) are removed. This is because they don't add much context to the emails (both spams and hams contain a lot of them, so they don't provide much useful information). Some good common stopwords can be found here, and have been included in stopwords.txt.

Stopwords for other languages can also be found on that link.

The removal of stopwords lead to a slightly increased accuracy of 96.86%.

Instructions of use

To train your Naive Bayes Classifier:

Clone this repository

$ git clone https://github.com/SuvanshKumar/spam-classifier.git

or

$ git clone [email protected]:SuvanshKumar/spam-classifier.git

Change to inside the cloned directory

$ cd spam-classifier

Go to the src folder

$ cd src

Run main.py file.

$ python3 main.py

Naive Baye's classifier:
Including stop words, the accuracy is: 0.9665271966527197
After removing stop words, the accuracy is: 0.9686192468619247

There it is. You have successfully run a classifier that gives 96%+ accuracy on classifying an email as spam.

(Optional) Adding your own data for testing/training

The dataset consists of emails, stored as .txt files. The initial training and testing data are stored in the data folder, sorted into hams and spams. You can add your own email text files for training or testing, in the appropriate folders. The stopwords.txt may be edited to suit your needs.

Tip: You can also classify emails in other languages (French? Hindi? Spanish?) using the same classifier. Add your email text file into the dataset and run main.py. The more data you have in the language of your choice, the better the results.

You can also add stopwords of your language to stopwords.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
config		config
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Email Spam Classifier

Results and further improvements

Instructions of use

(Optional) Adding your own data for testing/training

About

Releases

Packages

Languages

License

SuvanshKumar/spam-classifier

Folders and files

Latest commit

History

Repository files navigation

Email Spam Classifier

Results and further improvements

Instructions of use

(Optional) Adding your own data for testing/training

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages