Skip to content

gator-ryan/Spam-or-not-a-spam-Message-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Spam or Ham Classification using Word2Vec & Random Forest ๐Ÿ“„ Project Overview

This project implements a spam detection system for SMS messages using Word2Vec embeddings and a Random Forest classifier. It is designed to be data leakage-safe, ensuring the model only learns from the training data and is properly evaluated on unseen test data.

The model converts each SMS into a fixed-size numerical vector by averaging Word2Vec embeddings of its words and then classifies it as either spam or ham.

๐Ÿ› ๏ธ Features

Converts raw text messages to Word2Vec embeddings.

Handles messages of varying lengths using average embeddings.

Uses Random Forest for robust classification.

Prevents data leakage by splitting data before training embeddings.

Evaluates model performance using accuracy, precision, recall, and F1-score.

๐Ÿ“ Dataset

Uses the SMSSpamCollection dataset .

Dataset format: Tab-separated file with two columns:

label: "spam" or "ham"

message: the SMS text message

Example:

ham Go until jurong point, crazy.. Available only ... spam Free entry in 2 a wkly comp to win FA Cup ...

โš™๏ธ Dependencies

Python 3.8+

pandas

numpy

scikit-learn

gensim

Install dependencies:

pip install pandas numpy scikit-learn gensim

Classification Report (Leakage-Safe) Class Precision Recall F1-score Support 0 (ham) 0.95 0.99 0.97 966 1 (spam) 0.94 0.66 0.77 149

Accuracy: 0.95 Macro Avg: Precision = 0.95, Recall = 0.83, F1-score = 0.87 Weighted Avg: Precision = 0.95, Recall = 0.95, F1-score = 0.94

โœ… Observations

Overall accuracy is realistic: 95% on a real test set.

Ham class (0) is classified very well: recall 0.99.

Spam class (1) is harder to detect: recall 0.66 โ†’ the model misses ~34% of spam messages.

F1-score for spam is lower (0.77) โ†’ the model is less balanced in handling minority class.

โš ๏ธ Notes

Data leakage prevention is crucial: Word2Vec is trained only on the training set.

Random Forest does not require normalization, but if you use other models like SVM or Logistic Regression, you may need scaling.

Check for duplicates between train and test sets to avoid artificially high scores.

๐Ÿ“ˆ Next Steps / Improvements

Use TF-IDF weighted Word2Vec for better embedding representation.

Experiment with other classifiers: Logistic Regression, SVM, or Gradient Boosting.

Implement k-fold cross-validation with Word2Vec embeddings to ensure robust evaluation.

Explore deep learning models like LSTM or BERT for even better performance.

About

Spam or Ham Classification using Word2Vec & Random Forest

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published