Spam or Ham Classification using Word2Vec & Random Forest ๐ Project Overview
This project implements a spam detection system for SMS messages using Word2Vec embeddings and a Random Forest classifier. It is designed to be data leakage-safe, ensuring the model only learns from the training data and is properly evaluated on unseen test data.
The model converts each SMS into a fixed-size numerical vector by averaging Word2Vec embeddings of its words and then classifies it as either spam or ham.
๐ ๏ธ Features
Converts raw text messages to Word2Vec embeddings.
Handles messages of varying lengths using average embeddings.
Uses Random Forest for robust classification.
Prevents data leakage by splitting data before training embeddings.
Evaluates model performance using accuracy, precision, recall, and F1-score.
๐ Dataset
Uses the SMSSpamCollection dataset .
Dataset format: Tab-separated file with two columns:
label: "spam" or "ham"
message: the SMS text message
Example:
ham Go until jurong point, crazy.. Available only ... spam Free entry in 2 a wkly comp to win FA Cup ...
โ๏ธ Dependencies
Python 3.8+
pandas
numpy
scikit-learn
gensim
Install dependencies:
pip install pandas numpy scikit-learn gensim
Classification Report (Leakage-Safe) Class Precision Recall F1-score Support 0 (ham) 0.95 0.99 0.97 966 1 (spam) 0.94 0.66 0.77 149
Accuracy: 0.95 Macro Avg: Precision = 0.95, Recall = 0.83, F1-score = 0.87 Weighted Avg: Precision = 0.95, Recall = 0.95, F1-score = 0.94
โ Observations
Overall accuracy is realistic: 95% on a real test set.
Ham class (0) is classified very well: recall 0.99.
Spam class (1) is harder to detect: recall 0.66 โ the model misses ~34% of spam messages.
F1-score for spam is lower (0.77) โ the model is less balanced in handling minority class.
Data leakage prevention is crucial: Word2Vec is trained only on the training set.
Random Forest does not require normalization, but if you use other models like SVM or Logistic Regression, you may need scaling.
Check for duplicates between train and test sets to avoid artificially high scores.
๐ Next Steps / Improvements
Use TF-IDF weighted Word2Vec for better embedding representation.
Experiment with other classifiers: Logistic Regression, SVM, or Gradient Boosting.
Implement k-fold cross-validation with Word2Vec embeddings to ensure robust evaluation.
Explore deep learning models like LSTM or BERT for even better performance.