Spam Detection using Word2Vec and Logistic Regression

This repository contains an implementation of a spam detection model using Word2Vec for text representation and Logistic Regression for classification. The model is trained to classify messages as spam or not spam based on their content.

Features

Implements spam detection using a machine learning approach.
Uses Word2Vec embeddings to convert text into numerical vectors.
Applies Logistic Regression for binary classification.
Utilizes gensim and scikit-learn for text processing and model training.
Evaluates model performance using accuracy, precision, and recall.

Dataset

The project works with a dataset containing labeled messages:

Spam messages: Unwanted or fraudulent messages.
Non-spam messages: Normal, legitimate messages.
Preprocessing steps include tokenization, stopword removal, and embedding conversion.

Model Architecture

Text Embedding using Word2Vec

Tokenization & Preprocessing: Messages are cleaned and tokenized.
Word2Vec Embeddings: Converts words into dense vector representations based on their semantic similarity.
Feature Extraction: Sentence embeddings are generated by averaging word vectors.

Classification Model using Logistic Regression

Input Features: Sentence-level Word2Vec embeddings.
Logistic Regression Classifier: A simple yet effective model for binary classification.
Probability Estimation: Uses the sigmoid function to predict the likelihood of a message being spam.

Training Process

Data Preprocessing:
- Tokenization and text cleaning.
- Training or loading pre-trained Word2Vec embeddings.
- Averaging word vectors to represent sentences.
Model Training:
- Logistic Regression trained with cross-entropy loss.
- Optimization using Stochastic Gradient Descent (SGD) or other solvers.
Validation & Evaluation:
- Accuracy, precision, recall, and F1-score for classification tasks.
- Confusion matrix for performance analysis.

Evaluation

Accuracy: Measures how well the classifier distinguishes spam from non-spam.
Precision & Recall: Evaluates classification performance in detecting spam messages.
Confusion Matrix: Visualizes model performance on test data.

Usage

The model takes an input message and classifies it as spam or not spam.
Can be integrated into email filtering, SMS classification, or chatbot moderation systems.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
spam-detection-w2v-logisticRegression.ipynb		spam-detection-w2v-logisticRegression.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spam Detection using Word2Vec and Logistic Regression

Features

Dataset

Model Architecture

Text Embedding using Word2Vec

Classification Model using Logistic Regression

Training Process

Evaluation

Usage

About

Uh oh!

Releases

Packages

Languages

License

msaadx/Text-Classification-with-Word2Vec-and-Logistic-Regression-for-Spam-Detection

Folders and files

Latest commit

History

Repository files navigation

Spam Detection using Word2Vec and Logistic Regression

Features

Dataset

Model Architecture

Text Embedding using Word2Vec

Classification Model using Logistic Regression

Training Process

Evaluation

Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages