Skip to content

Linear classifier using Support Vector Machines (SVM) which can determine whether an email is Spam or not with an accuracy of 98.7%. Used regularization to prevent over-fitting of data. Pre-processed the E-mails using Porter Stemmer algorithm. Used a spam vocabulary to create a Feature Vector for each E-mail. Prints the top 15 predictors of spam

License

Notifications You must be signed in to change notification settings

williamcfrancis/Email-Spam-Classifier-using-SVM

Repository files navigation

Email Spam Classifier using SVM

Run the code

  1. Download all the files into a single folder
  2. Open octave and make sure you are in the right directory
  3. Run the "Main.m" file

Technical Details

This project has implemented the following email preprocessing and normalization steps:

• Lower-casing: The entire email is converted into lower case

• Stripping HTML: All HTML tags are removed from the emails.

• Normalizing URLs: All URLs are replaced with the text \httpaddr".

• Normalizing Email Addresses: All email addresses are replaced with the text \emailaddr".

• Normalizing Numbers: All numbers are replaced with the text \number".

• Normalizing Dollars: All dollar signs ($) are replaced with the text \dollar".

• Word Stemming: Words are reduced to their stemmed form.

• Removal of non-words: Non-words and punctuation have been re- moved.

The vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words.

About

Linear classifier using Support Vector Machines (SVM) which can determine whether an email is Spam or not with an accuracy of 98.7%. Used regularization to prevent over-fitting of data. Pre-processed the E-mails using Porter Stemmer algorithm. Used a spam vocabulary to create a Feature Vector for each E-mail. Prints the top 15 predictors of spam

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages