🎬 IMDB Sentiment Analysis System

A complete sentiment analysis system for IMDB reviews with NLP preprocessing, machine learning models, API, and interactive dashboard.

📋 Project Overview

This system provides:

✅ Text Preprocessing: Lowercase, remove stopwords, tokenize, lemmatize, clean symbols
✅ Feature Extraction: TF-IDF, Word2Vec, BERT embeddings
✅ ML Models: Logistic Regression, SVM, Naive Bayes, LSTM, BERT
✅ Model Evaluation: Accuracy, Precision, Recall, F1-score, Confusion Matrix
✅ Topic Modeling: LDA and BERTopic
✅ REST API: FastAPI with endpoints for predictions and analytics
✅ Database: SQLite for storing predictions and reviews
✅ Dashboard: Streamlit app for visualization and insights

📁 Project Structure

task1/
├── dataset/
│   ├── Train.csv           # Training data (5000 samples)
│   ├── Valid.csv           # Validation data (5000 samples)
│   └── Test.csv            # Test data (5000 samples)
├── models/                 # Saved trained models (auto-generated)
│   ├── logistic_regression_model.pkl
│   ├── svm_model.pkl
│   ├── naive_bayes_model.pkl
│   ├── tfidf_vectorizer.pkl
│   ├── count_vectorizer.pkl
│   ├── lda_model.pkl
│   └── preprocess_function.pkl
├── logs/                   # Application logs (auto-generated)
├── sentiment_analysis_detailed.ipynb  # Jupyter notebook with comprehensive analysis
├── sentiment_analysis_utils.py        # Core utilities, analyzer, and database management
├── api.py                  # FastAPI REST API server
├── streamlit_app.py        # Streamlit interactive dashboard
├── config.py               # Configuration settings and model paths
├── run_training.py         # Script to train and save models
├── requirements.txt        # Python dependencies
└── README.md              # This file

🚀 Quick Start

1. Install Dependencies

# Create virtual environment (recommended)
python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# Install requirements
pip install -r requirements.txt

2. Train Models

Option A: Using run_training.py (Recommended)

# Run the training script
python run_training.py

# This will:
# - Load and preprocess data
# - Train all models (Logistic Regression, SVM, Naive Bayes)
# - Extract TF-IDF features
# - Train LDA topic model
# - Evaluate and save models to /models folder

Option B: Using Jupyter Notebook

# Start Jupyter
jupyter notebook

# Open sentiment_analysis_detailed.ipynb and run all cells
# Provides detailed analysis and visualizations

3. Run FastAPI Server

# Start the API server
python api.py

# Or using uvicorn directly
uvicorn api:app --reload --host 0.0.0.0 --port 8000

# API will be available at:
# http://localhost:8000
# Interactive docs: http://localhost:8000/docs

4. Launch Streamlit Dashboard

# In a new terminal
streamlit run streamlit_app.py

# Dashboard will open at:
# http://localhost:8501

📊 Dataset

IMDB Reviews Dataset

Format: CSV with columns text and label
Labels: 0 (Negative), 1 (Positive)
Train set: 5,000 reviews
Validation set: 5,000 reviews
Test set: 5,000 reviews

🔧 Key Features

Text Preprocessing Pipeline

from sentiment_analysis_utils import preprocess_text

text = "This movie is AMAZING!!! <br/> I loved it..."
processed = preprocess_text(text)
# Output: "movie amazing loved"

Making Predictions

from sentiment_analysis_utils import SentimentAnalyzer

analyzer = SentimentAnalyzer()
result = analyzer.predict("Great movie!")

# Output:
# {
#     'sentiment': 'Positive',
#     'confidence': 0.95,
#     'probabilities': {'negative': 0.05, 'positive': 0.95},
#     'timestamp': '2024-01-25T...'
# }

Database Operations

from sentiment_analysis_utils import DatabaseManager

db = DatabaseManager()

# Store prediction
db.insert_prediction(
    review_text="Amazing film!",
    sentiment="Positive",
    confidence=0.95,
    prob_negative=0.05,
    prob_positive=0.95
)

# Get statistics
stats = db.get_sentiment_stats()

# Get trending keywords
keywords = db.get_trending_keywords(limit=20)

🌐 API Endpoints

Health Check

GET /health

Returns API status and loaded models.

Predict Single Review

POST /predict_review
Content-Type: application/json

{
    "text": "This movie was excellent!",
    "save_to_db": true
}

Response:
{
    "text": "This movie was excellent!",
    "sentiment": "Positive",
    "confidence": 0.92,
    "probabilities": {
        "negative": 0.08,
        "positive": 0.92
    },
    "timestamp": "2024-01-25T10:30:00"
}

Batch Predictions

POST /batch_predict
Content-Type: application/json

{
    "requests": [
        {"text": "Great movie!", "save_to_db": true},
        {"text": "Terrible film", "save_to_db": true}
    ]
}

Get Statistics

GET /stats

Response:
{
    "total_predictions": 150,
    "sentiment_distribution": {
        "Positive": 95,
        "Negative": 55
    },
    "avg_confidence": {
        "Positive": 0.92,
        "Negative": 0.88
    }
}

Get Trending Keywords

GET /trending_keywords?limit=20

Response:
{
    "keywords": {
        "good": 145,
        "movie": 132,
        "great": 118,
        ...
    },
    "count": 20
}

Get Analysis Report

GET /report

Response:
{
    "timestamp": "2024-01-25T...",
    "stats": {...},
    "keywords": {...}
}

📊 Dashboard Features

🏠 Home Page

Quick statistics overview
How the system works
Feature explanations

📈 Analytics

Sentiment distribution charts
Confidence score analysis
Trend visualization

💬 Predictions

Single review prediction
Real-time confidence display
Prediction history

🔍 Trending Topics

Top keywords visualization
Keyword frequency analysis
Word cloud (optional)

📋 Batch Analysis

CSV file upload
Bulk processing
Results download

📊 Reports

Comprehensive analysis report
Statistics summary
Export functionality

🤖 Machine Learning Models

Trained Models

1. Logistic Regression

Best for: Balanced performance, interpretability
Training time: ~30 seconds
Typical accuracy: 88-91%

2. Support Vector Machine (SVM)

Best for: High accuracy
Training time: ~2 minutes
Typical accuracy: 89-92%

3. Naive Bayes

Best for: Fast inference
Training time: ~5 seconds
Typical accuracy: 85-88%

Feature Extraction Methods

1. TF-IDF (Primary)

Max features: 5,000
N-gram range: (1, 2)
Sublinear TF scaling

2. Word2Vec (Optional)

Vector size: 100
Window: 5
Min count: 2

3. BERT Embeddings (Advanced)

Using sentence-transformers
Pre-trained models available

📊 Evaluation Metrics

Metrics Used

Accuracy: Overall correctness
Precision: True positives / All predicted positives
Recall: True positives / All actual positives
F1-Score: Harmonic mean of precision and recall
Confusion Matrix: TP, TN, FP, FN visualization

Performance Thresholds

Good: > 0.85
Excellent: > 0.90
Outstanding: > 0.95

🗂️ Topic Modeling

LDA (Latent Dirichlet Allocation)

Number of topics: 5
Max iterations: 20
Output: Top keywords per topic

BERTopic

Language: English
Based on BERT embeddings
Automatic topic detection

🛠️ Configuration

Config File (config.py)

Key configuration settings:

# Model paths
MODELS_DIR = 'models/'
DEFAULT_MODEL = 'logistic_regression'  # Can be 'svm' or 'naive_bayes'

# Database
DATABASE_PATH = 'sentiment_analysis.db'

# API Server
API_HOST = '0.0.0.0'
API_PORT = 8000
API_DEBUG = False

Environment Variables

Create .env file (optional):

DATABASE_PATH=sentiment_analysis.db
LOG_LEVEL=INFO
API_PORT=8000
API_HOST=0.0.0.0
API_DEBUG=False

Model Selection

To use a different model, edit config.py:

DEFAULT_MODEL = 'svm'  # Change to 'logistic_regression', 'svm', or 'naive_bayes'

Or pass model parameter when initializing analyzer:

from sentiment_analysis_utils import SentimentAnalyzer
from config import MODEL_CONFIGS

analyzer = SentimentAnalyzer(
    model_config=MODEL_CONFIGS['svm']
)

🧪 Testing

Unit Tests

pytest tests/

Quick Manual Tests

# Test preprocessing
python -c "from sentiment_analysis_utils import preprocess_text; print(preprocess_text('Amazing movie!!!'))"

# Test prediction
python -c "from sentiment_analysis_utils import SentimentAnalyzer; a = SentimentAnalyzer(); print(a.predict('Great film!'))"

# Test database
python -c "from sentiment_analysis_utils import DatabaseManager; db = DatabaseManager(); print(db.get_sentiment_stats())"

API Testing with cURL

# Health check
curl http://localhost:8000/health

# Single prediction
curl -X POST http://localhost:8000/predict_review \
  -H "Content-Type: application/json" \
  -d '{"text":"This movie is amazing!","save_to_db":true}'

# Get stats
curl http://localhost:8000/stats

📈 Performance Benchmarks

Model Comparison (on Test Set)

Model	Accuracy	Precision	Recall	F1-Score	Training Time
Logistic Regression	90.2%	0.901	0.902	0.901	30s
SVM	91.1%	0.910	0.911	0.910	120s
Naive Bayes	87.5%	0.874	0.876	0.875	5s

Processing Speed

Operation	Time
Preprocess review	10-50 ms
Prediction (single)	50-100 ms
Batch (100 reviews)	5-10 seconds
Database query (1000 rows)	100-200 ms

🔐 Security Considerations

Input validation on all API endpoints
SQL injection prevention (using parameterized queries)
CORS enabled for dashboard access
Rate limiting (optional, add middleware)
Input length limits (500 chars for storage)

🐛 Troubleshooting

Models not found error

Solution: Run sentiment_analysis.ipynb to train and save models

NLTK data missing

Solution: Automatically downloaded on first use, or:
python -c "import nltk; nltk.download('all')"

Database locked error

Solution: Close other connections or delete sentiment_analysis.db and restart

Out of memory with large batch

Solution: Reduce batch size or process in chunks of 100

📚 File Descriptions

Core Application Files

File	Purpose
`config.py`	Centralized configuration (paths, model names, database settings)
`sentiment_analysis_utils.py`	Core classes: `SentimentAnalyzer`, `DatabaseManager`, utility functions
`api.py`	FastAPI REST server with endpoints for predictions, stats, and reports
`streamlit_app.py`	Interactive Streamlit dashboard for visualization and analysis
`run_training.py`	Script to train models and save to `/models` directory

Data & Models

Directory	Purpose
`dataset/`	CSV files with IMDB reviews (Train, Validation, Test sets)
`models/`	Serialized trained models (pickle files)
`logs/`	Application logs (auto-generated)

Notebooks

File	Purpose
`sentiment_analysis_detailed.ipynb`	Comprehensive analysis, feature extraction, model training, and evaluation

📚 References

Libraries Used

scikit-learn: ML models and evaluation
NLTK: NLP preprocessing
Gensim: Word2Vec embeddings
BERTopic: Advanced topic modeling
FastAPI: REST API framework
Streamlit: Dashboard framework
SQLite: Local database

Papers & Resources

🚨 Important Notes

First Run: Execute python run_training.py to train and save models before running API/Dashboard
Database: SQLite file (sentiment_analysis.db) is auto-created on first prediction
Logs: Check logs/ directory if issues occur
NLTK Data: Automatically downloaded on first use
Windows: Some paths may need adjustments - ensure dataset CSV files exist

🎯 Future Enhancements

📝 License

This project is open source and available under the MIT License.

👤 Author

Created for IMDB review sentiment analysis task.

📞 Support

For issues or questions:

Check troubleshooting section
Review Jupyter notebook examples
Check API documentation at /docs
Review logs for error details

Happy Analyzing! 🎉

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataset		dataset
.gitignore		.gitignore
README.md		README.md
api.py		api.py
config.py		config.py
requirements.txt		requirements.txt
run_training.py		run_training.py
sentiment_analysis_detailed.ipynb		sentiment_analysis_detailed.ipynb
sentiment_analysis_utils.py		sentiment_analysis_utils.py
streamlit_app.py		streamlit_app.py

Folders and files

Latest commit

History

Repository files navigation

🎬 IMDB Sentiment Analysis System

📋 Project Overview

📁 Project Structure

🚀 Quick Start

1. Install Dependencies

2. Train Models

3. Run FastAPI Server

4. Launch Streamlit Dashboard

📊 Dataset

🔧 Key Features

Text Preprocessing Pipeline

Making Predictions

Database Operations

🌐 API Endpoints

Health Check

Predict Single Review

Batch Predictions

Get Statistics

Get Trending Keywords

Get Analysis Report

📊 Dashboard Features

🏠 Home Page

📈 Analytics

💬 Predictions

🔍 Trending Topics

📋 Batch Analysis

📊 Reports

🤖 Machine Learning Models

Trained Models

Feature Extraction Methods

📊 Evaluation Metrics

Metrics Used

Performance Thresholds

🗂️ Topic Modeling

LDA (Latent Dirichlet Allocation)

BERTopic

🛠️ Configuration

Config File (config.py)

Environment Variables

Model Selection

🧪 Testing

Unit Tests

Quick Manual Tests

API Testing with cURL

📈 Performance Benchmarks

Model Comparison (on Test Set)

Processing Speed

🔐 Security Considerations

🐛 Troubleshooting

Models not found error

NLTK data missing

Database locked error

Out of memory with large batch

📚 File Descriptions

Core Application Files

Data & Models

Notebooks

📚 References

Libraries Used

Papers & Resources

🚨 Important Notes

🎯 Future Enhancements

📝 License

👤 Author

📞 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages