a personalized movie database for my friend Juan
-
The original source data was a .txt file containing a list of movies/tv shows, the order they were watched that year, and a rating out of 10
-
The .txt file was parsed in
create_silver_jpmdb.py
, including parsing the ratings, seasons, watch order, year specifiers and other metadata -
Downloaded imdb data from IMDb Datasets and converted the .gz files into
silver/imdb/title_basics
andsilver/imdb/title_ratings
usingcreate_silver_imdb.py
-
The jpmdb and imdb datasets were initially joined using standard string cleaning and fuzzy matching approaches into
stg_jpmdb_combined
usingcreate_silver_stg_jpmdb_combined.py
-
Entries were manually reviewed a small CLI tool
review_combined_jpmdb.py
, giving an opportunity to correct fuzzy matching errors and manually add missing entries -
After all entries were validated, the data was moved to the gold table
gold/jpmdb
increate_gold_jpmdb.py
The dashboard is built using Dash and Plotly. It currently includes 4 visualizations:
- A virtualized table of all entries in the database
- A scatter plot of ratings over the watched order to show ratings over time
- A scatter plot comparing ratings to IMDb ratings
- A box plot showing distribution of ratings per IMDb genre
- [] incorporate scraped poster images into the dashboard
- [] cross visualization filtering by genre
- [] short summary of top 10 titles per year