This project explores Natural Language Processing (NLP) techniques and Machine Learning models to classify sentiments in text data. The work integrates a combination of feature extraction methods, oversampling techniques, and models to achieve robust and accurate sentiment classification. 🧠📊
NLP and Machine Learning Illustration
- Type: Sentiment Classification
- Goal: Classify positive and negative sentiments with high accuracy.
- Techniques Used: NLP, Feature Engineering, Machine Learning, Oversampling.
The following methods were implemented to represent textual data effectively:
-
📚 Bag of Words (BoW):
- Focuses on word frequencies, representing text as vectors of word occurrences without considering grammar or structure.
-
🔍 TF-IDF (Term Frequency-Inverse Document Frequency):
- Combines Term Frequency (TF) and Inverse Document Frequency (IDF) using logarithmic scaling to emphasize the importance of words relative to both the document and the entire dataset.
-
🤖 Word2Vec:
- Creates dense word vectors by learning semantic relationships and associations from a large corpus of text.
-
🧠 GloVe (Global Vectors):
- Captures global word co-occurrences using pre-trained 100-dimensional word vectors to enhance the understanding of word meanings and contexts.
- To address class imbalance, SMOTE was used to generate synthetic samples for the underrepresented class.
- This ensures a more balanced dataset, improving the model's ability to classify minority sentiments accurately.
Three models were evaluated for sentiment classification:
- 📚 Naive Bayes
- 🌲 Random Forest
- ⚡ XGBoost (Extreme Gradient Boosting)
-
Baseline Performance:
- Accuracy: 91.16%
- Precision: 0.88
- Recall: 0.78
- F1-Score: 0.82
-
TF-IDF + SMOTE with XGBoost:
- Achieved the highest performance:
- Accuracy: 92.71%
- Precision, Recall, and F1-Score: 0.93
- ROC-AUC: 0.98 🎉
- Achieved the highest performance:
- Accuracy: 87.51%
- Highlighted limitations in capturing nuanced sentiment for the underrepresented class.
Standard classification metrics were used:
✅ Accuracy
✅ Precision
✅ Recall
✅ F1-Score
✅ ROC-AUC Curve
- Explore deep learning models like BERT for enhanced text representation and sentiment analysis.
- Investigate techniques such as SHAP and LIME for better model interpretability.
- Experiment with additional NLP techniques and embeddings to refine performance further.
- Clone the repository:
git clone https://github.com/RanaPrince/sentiment-classification.git pip install -r requirements.txt
Run the model scripts and experiment with feature extraction techniques.
Feel free to reach out for questions, collaborations, or feedback!
- GitHub: RanaPrince
- LinkedIn: Prince Rana
- Email: [email protected]