A comprehensive sentiment analysis system for YouTube comments on Indian employment topics. Processes 131,608 comments using classical machine learning and transformer models, achieving 96.47% accuracy with DistilBERT.
Analyze public sentiment on Indian employment through YouTube comments, handling Hinglish (English-Hindi code-mixed) text with production-ready deployment.
- Source: YouTube comments on employment-related videos
- Size: 131,608 comments
- Language: Hinglish (English-Hindi code-mixed)
- Classes: Negative, Neutral, Positive
- Split: 80% train, 10% validation, 10% test
Model Performance Comparison
| Model | Accuracy | F1-Macro | F1-Weighted | Parameters |
|---|---|---|---|---|
| Naive Bayes | 71.39% | 67.23% | 70.92% | Multinomial |
| Random Forest | 73.16% | 68.45% | 72.54% | 100 trees |
| Gradient Boosting | 75.43% | 71.25% | 74.89% | 100 estimators |
| Logistic Regression | 87.25% | 85.12% | 86.98% | 10K features |
| Linear SVM | 88.57% | 86.63% | 88.34% | 10K features |
| DistilBERT | 96.47% | 95.63% | 96.47% | 66.9M |
Research Contributions
- Performance Improvement: 7.90% accuracy gain over best baseline (Linear SVM)
- Multilingual Handling: Effective processing of code-mixed Hinglish text
- Error Analysis: Identified confusion patterns between Neutral/Positive classes
- Production System: Complete deployment pipeline with Docker and FastAPI
- Explainability: SHAP analysis revealing key sentiment indicators
The system follows a five-phase pipeline:
- Data Engineering: Collection, cleaning, language detection, sentiment labeling
- Baseline Models: Classical ML with TF-IDF features
- Transformer Models: DistilBERT fine-tuning
- Explainability: SHAP analysis and error patterns
- Production Deployment: REST API with monitoring
Youtube-Sentiment-Analysis/
├── phase1_data_engineering/ # Data processing pipeline
│ ├── 01_export_data.py
│ ├── 02_clean_data.py
│ ├── 03_language_detection.py
│ ├── 04_sentiment_labeling.py
│ ├── 05_train_test_split.py
│ ├── 06_eda_analysis.py
│ └── figures/ # 5 EDA visualizations
├── phase2_baseline_models/ # Classical ML models
│ ├── 01_tfidf_baseline.py
│ ├── 02_classical_models.py
│ ├── evaluation/
│ └── figures/ # 11 performance plots
├── phase3_transformer_models/ # Deep learning models
│ ├── 01_data_preparation.py
│ ├── 05_distilbert_hf_trainer.py
│ ├── checkpoints/distilbert/ # Trained model (255MB)
│ └── figures/ # 4 training visualizations
├── phase4_explainability/ # Model interpretation
│ ├── 00_generate_predictions.py
│ ├── 01_error_analysis.py
│ ├── 02_shap_analysis.py
│ ├── 03_model_comparison.py
│ ├── figures/ # 7 SHAP plots
│ └── reports/ # 4 analysis reports
└── phase5_production_deployment/ # Production API
├── 01_fastapi_server.py
├── 02_api_client.py
├── 03_load_testing.py
├── Dockerfile
└── docker-compose.yml
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Negative | 92.40% | 92.53% | 92.46% | 2,088 |
| Neutral | 98.56% | 97.05% | 97.80% | 5,721 |
| Positive | 95.86% | 97.38% | 96.62% | 5,352 |
- Total misclassifications: 465 (3.53%)
- Negative errors: 156 (7.47% of class)
- Neutral errors: 169 (2.95% of class)
- Positive errors: 140 (2.62% of class)
- Neutral class achieves highest precision (98.56%)
- Positive class has highest recall (97.38%)
- Negative class shows most confusion with Positive (5.70%)
- Model performs consistently across all sentiment categories
- Code-mixed text handled effectively without language-specific preprocessing
import requests
response = requests.post(
"http://localhost:8000/predict",
json={
"text": "The job market is improving significantly!",
"return_confidence": True
}
)
print(response.json())
# {
# "sentiment": "Positive",
# "confidence": {"Negative": 0.01, "Neutral": 0.04, "Positive": 0.95},
# "inference_time_ms": 45.2
# }response = requests.post(
"http://localhost:8000/predict/batch",
json={
"texts": [
"Great employment opportunities!",
"High unemployment is concerning",
"The situation is stable"
]
}
)Accuracy Improvement Over Baseline
- vs Naive Bayes: +25.08%
- vs Random Forest: +23.31%
- vs Gradient Boosting: +21.04%
- vs Logistic Regression: +9.22%
- vs Linear SVM: +7.90%
F1-Score Improvement
- Macro F1: +9.00% over Linear SVM
- Weighted F1: +8.13% over Linear SVM
- CPU: ~50ms per prediction
- GPU: ~10ms per prediction
- Batch (100): ~2.5s on CPU
- Throughput: 50+ requests/second
Machine Learning
- PyTorch 2.1+
- HuggingFace Transformers
- Scikit-learn
- SHAP
Data Processing
- Pandas
- NumPy
- NLTK
- langdetect
API & Deployment
- FastAPI
- Uvicorn
- Docker
- Docker Compose
Visualization
- Matplotlib
- Seaborn
- Plotly
All experiments are reproducible with provided scripts:
# Phase 1: Data processing
python phase1_data_engineering/01_export_data.py
python phase1_data_engineering/02_clean_data.py
python phase1_data_engineering/03_language_detection.py
python phase1_data_engineering/04_sentiment_labeling.py
python phase1_data_engineering/05_train_test_split.py
python phase1_data_engineering/06_eda_analysis.py
# Phase 2: Baseline models
python phase2_baseline_models/01_tfidf_baseline.py
python phase2_baseline_models/02_classical_models.py
# Phase 3: Transformer training (requires GPU)
python phase3_transformer_models/01_data_preparation.py
python phase3_transformer_models/05_distilbert_hf_trainer.py
# Phase 4: Analysis
python phase4_explainability/00_generate_predictions.py
python phase4_explainability/01_error_analysis.py
python phase4_explainability/02_shap_analysis.py
python phase4_explainability/03_model_comparison.py- Multilingual Models: Test XLM-RoBERTa, mBERT for better Hinglish support
- Active Learning: Iterative labeling for hard examples
- Aspect-Based Analysis: Extract specific employment topics (wages, opportunities, policies)
- Temporal Analysis: Track sentiment trends over time
- Multi-modal: Incorporate video metadata and engagement metrics
- Model Compression: Quantization and distillation for faster inference
- Real-time Processing: Stream processing for live comments
If you use this work in your research, please cite:
@software{youtube_sentiment_analysis_2025,
author = {Tiwari, Rudra},
title = {YouTube Sentiment Analysis: Indian Employment Discourse},
year = {2025},
url = {https://github.com/Rudra-Tiwari-codes/Youtube-Sentiment-Analysis}
}MIT License - see LICENSE file for details.
- HuggingFace for transformer models and infrastructure
- YouTube Data API for data access
- Open-source ML community for tools and libraries
For questions or collaboration:
- GitHub: @Rudra-Tiwari-codes
- Repository: Youtube-Sentiment-Analysis
Last Updated: December 11, 2025
Status: Production-Ready
Model Version: 1.0.0