GitHub - gator-ryan/Spam-or-not-a-spam-Message-detection: Spam or Ham Classification using Word2Vec & Random Forest

Spam or Ham Classification using Word2Vec & Random Forest 📄 Project Overview

This project implements a spam detection system for SMS messages using Word2Vec embeddings and a Random Forest classifier. It is designed to be data leakage-safe, ensuring the model only learns from the training data and is properly evaluated on unseen test data.

The model converts each SMS into a fixed-size numerical vector by averaging Word2Vec embeddings of its words and then classifies it as either spam or ham.

🛠️ Features

Converts raw text messages to Word2Vec embeddings.

Handles messages of varying lengths using average embeddings.

Uses Random Forest for robust classification.

Prevents data leakage by splitting data before training embeddings.

Evaluates model performance using accuracy, precision, recall, and F1-score.

📁 Dataset

Uses the SMSSpamCollection dataset .

Dataset format: Tab-separated file with two columns:

label: "spam" or "ham"

message: the SMS text message

Example:

ham Go until jurong point, crazy.. Available only ... spam Free entry in 2 a wkly comp to win FA Cup ...

⚙️ Dependencies

Python 3.8+

pandas

numpy

scikit-learn

gensim

Install dependencies:

pip install pandas numpy scikit-learn gensim

Classification Report (Leakage-Safe) Class Precision Recall F1-score Support 0 (ham) 0.95 0.99 0.97 966 1 (spam) 0.94 0.66 0.77 149

Accuracy: 0.95 Macro Avg: Precision = 0.95, Recall = 0.83, F1-score = 0.87 Weighted Avg: Precision = 0.95, Recall = 0.95, F1-score = 0.94

✅ Observations

Overall accuracy is realistic: 95% on a real test set.

Ham class (0) is classified very well: recall 0.99.

Spam class (1) is harder to detect: recall 0.66 → the model misses ~34% of spam messages.

F1-score for spam is lower (0.77) → the model is less balanced in handling minority class.

⚠️ Notes

Data leakage prevention is crucial: Word2Vec is trained only on the training set.

Random Forest does not require normalization, but if you use other models like SVM or Logistic Regression, you may need scaling.

Check for duplicates between train and test sets to avoid artificially high scores.

📈 Next Steps / Improvements

Use TF-IDF weighted Word2Vec for better embedding representation.

Experiment with other classifiers: Logistic Regression, SVM, or Gradient Boosting.

Implement k-fold cross-validation with Word2Vec embeddings to ensure robust evaluation.

Explore deep learning models like LSTM or BERT for even better performance.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
SpamClassifier-master		SpamClassifier-master
README.md		README.md
Spam or Ham Classifier.ipynb		Spam or Ham Classifier.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

gator-ryan/Spam-or-not-a-spam-Message-detection

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages