Skip to content

This repository contains an implementation of a spam detection model using Word2Vec for text representation and Logistic Regression for classification. The model is trained to classify messages as spam or not spam based on their content.

License

Notifications You must be signed in to change notification settings

msaadx/Text-Classification-with-Word2Vec-and-Logistic-Regression-for-Spam-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Spam Detection using Word2Vec and Logistic Regression

This repository contains an implementation of a spam detection model using Word2Vec for text representation and Logistic Regression for classification. The model is trained to classify messages as spam or not spam based on their content.

Features

  • Implements spam detection using a machine learning approach.
  • Uses Word2Vec embeddings to convert text into numerical vectors.
  • Applies Logistic Regression for binary classification.
  • Utilizes gensim and scikit-learn for text processing and model training.
  • Evaluates model performance using accuracy, precision, and recall.

Dataset

The project works with a dataset containing labeled messages:

  • Spam messages: Unwanted or fraudulent messages.
  • Non-spam messages: Normal, legitimate messages.
  • Preprocessing steps include tokenization, stopword removal, and embedding conversion.

Model Architecture

Text Embedding using Word2Vec

  • Tokenization & Preprocessing: Messages are cleaned and tokenized.
  • Word2Vec Embeddings: Converts words into dense vector representations based on their semantic similarity.
  • Feature Extraction: Sentence embeddings are generated by averaging word vectors.

Classification Model using Logistic Regression

  • Input Features: Sentence-level Word2Vec embeddings.
  • Logistic Regression Classifier: A simple yet effective model for binary classification.
  • Probability Estimation: Uses the sigmoid function to predict the likelihood of a message being spam.

Training Process

  1. Data Preprocessing:
    • Tokenization and text cleaning.
    • Training or loading pre-trained Word2Vec embeddings.
    • Averaging word vectors to represent sentences.
  2. Model Training:
    • Logistic Regression trained with cross-entropy loss.
    • Optimization using Stochastic Gradient Descent (SGD) or other solvers.
  3. Validation & Evaluation:
    • Accuracy, precision, recall, and F1-score for classification tasks.
    • Confusion matrix for performance analysis.

Evaluation

  • Accuracy: Measures how well the classifier distinguishes spam from non-spam.
  • Precision & Recall: Evaluates classification performance in detecting spam messages.
  • Confusion Matrix: Visualizes model performance on test data.

Usage

  • The model takes an input message and classifies it as spam or not spam.
  • Can be integrated into email filtering, SMS classification, or chatbot moderation systems.

About

This repository contains an implementation of a spam detection model using Word2Vec for text representation and Logistic Regression for classification. The model is trained to classify messages as spam or not spam based on their content.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published