This project combines rich data storytelling and graph machine learning to analyze and recommend movies using the TMDB 5000 Movie Dataset from Kaggle.
Built as a research portfolio project for applying to the DICE Group at the University of Paderborn.
Explore patterns in global movie production with 10 interactive visualizations:
- Movie Releases Per Year
- Top 10 Genres
- Top 10 Directors
- Top 10 Actors
- Movies by Production Country (Choropleth)
- Genre Distribution (Treemap)
- Spoken Languages (Bar)
- Language Distribution (Donut)
- Budget vs Revenue (Raw)
- Budget vs Revenue (Log)
- Movies, directors, genres, keywords form a heterogeneous knowledge graph
- Node embeddings learned using Node2Vec or KarateClub
- Recommendations generated via cosine similarity in embedding space
- Fully interactive search using Streamlit dropdown
- Python 3.10
- pandas, numpy, networkx, matplotlib, scikit-learn
- plotly, streamlit
- node2vec / karateclub
- TMDB 5000 dataset (Kaggle)
- Clone the repo:
git clone https://github.com/yourusername/tmdb-recommender-graphml.git
cd tmdb-recommender-graphml2.Create a virtual environment: python3 -m venv graphenv source graphenv/bin/activate
3.Install dependencies:
pip install -r requirements.txt
-
Run the app:
streamlit run app/app.py
TMDB 5000 Movie Dataset: https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata