This repository contains an implementation of a spam detection model using Word2Vec for text representation and Logistic Regression for classification. The model is trained to classify messages as spam or not spam based on their content.
- Implements spam detection using a machine learning approach.
- Uses Word2Vec embeddings to convert text into numerical vectors.
- Applies Logistic Regression for binary classification.
- Utilizes
gensimandscikit-learnfor text processing and model training. - Evaluates model performance using accuracy, precision, and recall.
The project works with a dataset containing labeled messages:
- Spam messages: Unwanted or fraudulent messages.
- Non-spam messages: Normal, legitimate messages.
- Preprocessing steps include tokenization, stopword removal, and embedding conversion.
- Tokenization & Preprocessing: Messages are cleaned and tokenized.
- Word2Vec Embeddings: Converts words into dense vector representations based on their semantic similarity.
- Feature Extraction: Sentence embeddings are generated by averaging word vectors.
- Input Features: Sentence-level Word2Vec embeddings.
- Logistic Regression Classifier: A simple yet effective model for binary classification.
- Probability Estimation: Uses the sigmoid function to predict the likelihood of a message being spam.
- Data Preprocessing:
- Tokenization and text cleaning.
- Training or loading pre-trained Word2Vec embeddings.
- Averaging word vectors to represent sentences.
- Model Training:
- Logistic Regression trained with cross-entropy loss.
- Optimization using Stochastic Gradient Descent (SGD) or other solvers.
- Validation & Evaluation:
- Accuracy, precision, recall, and F1-score for classification tasks.
- Confusion matrix for performance analysis.
- Accuracy: Measures how well the classifier distinguishes spam from non-spam.
- Precision & Recall: Evaluates classification performance in detecting spam messages.
- Confusion Matrix: Visualizes model performance on test data.
- The model takes an input message and classifies it as spam or not spam.
- Can be integrated into email filtering, SMS classification, or chatbot moderation systems.