Merge pull request #22 from Eva-Claire/claire_branch

Claire branch
AtomHarris · Jul 29, 2024 · c01ac1b · c01ac1b
2 parents ca06744 + cd98dcd
commit c01ac1b
Show file tree

Hide file tree

Showing 2 changed files with 30 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -6,34 +6,44 @@
 Evaclaire Wamitu - [Email]([email protected])
 Simon Makumi - [Email]([email protected])
 
-## BUSINESS UNDERSTANDING
-
 ### Project Overview
-StreamFlix is developing a personalized movie recommendation system to enhance user experience and retention. The system uses collaborative filtering and content based filtering on the MovieLens dataset to provide tailored top 5 movie suggestions for each user. The system will analyze user ratings to generate recommendations using collaborative filtering techniques. New users will be onboarded through various rating collection methods to quickly build their preference profiles. By implementing this system, StreamFlix aims to create a more engaging and personalized viewing experience, ultimately leading to increased user satisfaction and improved business metrics.
+StreamFlix, a leading online streaming platform, is embarking on a project to revolutionize its user experience through the implementation of a hybrid movie recommendation system. Faced with challenges in user retention and engagement, as well as user overwhelm due to an extensive movie library, StreamFlix aims to leverage the power of data science to provide personalized movie suggestions. This initiative follows the CRISP-DM methodology and utilizes both collaborative filtering and content-based filtering techniques applied to the MovieLens dataset. The primary objective of this project is to develop and deploy an accurate hybrid recommendation system that can effectively predict user preferences and output a tailored list of top 5 movie recommendations. StreamFlix anticipates to not only enhance user engagement and retention but to also set a new standard in the streaming industry for tailored content delivery.
 
-### Business Problem
+### Problem Statement
 StreamFlix is facing challenges with user retention and engagement. Users are also overwhelmed by the vast library of movies available and often spend a considerable amount of time searching for movies they would enjoy. StreamFlix is, therefore, looking for a way to provide personalized movie recommendations to its users to improve their viewing experience and increase platform usage.
 
-### Objectives
-
-#### Main Objective
-To develop and deploy a collaborative filtering-based recommendation system that accurately predicts user preferences and provides relevant movie suggestions.
+This project follows the CRISP-DM methodology which entails:
 
-#### Specific Objectives
-1. To build a collaborative filtering model that uses user ratings to generate top 5 movie recommendations.
-2. To address the cold start problem using content-based filtering for new users.
-3. To evaluate the recommendation system using appropriate metrics like RMSE and MAP.
+## 1. Business Understanding
+Online streaming services such as StreamFlix aim to enhance user experience and retention by providing personalized movie recommendations. By delivering tailored content, these services strive to meet individual user preferences making it easier for users, both new and old, to discover movies they are likely to enjoy. This personalized approach is crucial because it increases user engagement, satisfaction and loyalty leading to higher user retention rates. For the business, this means improved subscription renewals, reduced churn rates and a more competitive edge in the crowded streaming market. Ultimately, providing proper personalized recommendations will help Sttreamflix build a stronger relationship with its users thus fostering long-term growth and success.
 
-## DATA UNDERSTANDING
+## 2. Data Understanding
 The data utilised in this project is the Movielens dataset from GroupLens Research Lab covering movie ratings from 1902 to 2018. The dataset contains 100836 ratings and 3683 tag applications across 9742 movies with each user rating at least 20 films. While the full dataset contains 1.9 million ratings, we focussed on a subset of about 100,000 for our current model due to time and resource constraints. This sample size balances computational efficiency with statistical relevance for our recommendation engine development. The datasets include `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. The following features were utilized  in the development of our recommendation system:  `movieId`, `userId_x`, `rating`, `title` and `genres`. 
 The datasets were merged on the movieId column resulting in a DataFrame with 285783  rows and 11 columns. The data was sufficient in fulfilling our objectives although additional information such as actors, directors, production studio, runtime and user demographics would have provided more context and insights into user preferences and movie characteristics leading to better recommendations.
 
-## Exploratory Data Analysis
-This involves systematically examining and summarizing the datasets to uncover patterns, trends and anomalies that can inform model development. This process includes visualizing the distribution of ratings, movie genres, and user activity to understand the dataset's characteristics. EDA also involves analyzing the relationships between features, such as the correlation between movie ratings and user demographics, to identify potential biases or data quality issues. By exploring these aspects, EDA helps in refining feature engineering, selecting relevant attributes, and ensuring the data is suitable for building and evaluating recommendation models. Overall, EDA is crucial for gaining insights that guide the development of more accurate and effective recommendation systems.
-We decided  to investigate the distribution of ratings and genres and the following
+## 3. Exploratory Data Analysis
+In this exploratory data analysis (EDA), we aim to uncover patterns, trends and relationships within our dataset using visualizations and statistical techniques. 
 
-![attachment:logo.png](logo.png)
+Distribution of Ratings: The most common rating is 4.0 with about 25,000 movies, the distribution is positively skewed towards higher ratings.
+
+![png](project_images/distribution_of_ratings.png)
+
+Distribution of Genres: Drama is the most preffered genre with over 40,000 movies followed closely by comedy while niche genres like animation, war, IMAX, musical and Western appear to have fewer than 10,000 movies each.
+
+![png](project_images/distribution_of_movie_genres.png)
+
+Top 20 Rated Titles by Number of Ratings Plot: 'Forrest Gump' has the highest number of ratings followed by 'The Shawshank Redemption' and 'Pulp Fiction' with top-rated films having between 200 to nearly 300 ratings.
+
+![png](project_images/top_20_rated_titles_by_num_of_ratings.png)
+
+ ## 4. Data Preparation
+We address and correct errors in the dataset by handling missing values, outliers, duplicates and data type inconsistencies. Columns that are least relevant in our analysis are dropped and the userId_x column is then renamed to user_id for ease of reference. Movie release year is then extracted from the movie title column and values in the genres column are split into lists for easier analysis. We then examine the properties of the DataFrame and drop missing values and duplicate rows and finally convert the user_id column data type from float64 to int64 for consistency. This ensures that the data is clean and properly formatted setting the stage for accurate modeling.
+
+ ## 5. Modeling
+A dummy model is first evaluated using the Surprise library as a baseline for the collaborative filtering system generating random predictions based on the observed distribution of ratings. This dummy model is trained and tested providing an RMSE of `1.43`. Gridsearch Cross Validation of the Singular Value Decomposition (SVD) model outputs an RMSE of `0.862` against an RMSE of `0.975` for the K-Nearest Neighbors (KNN) model identifying the `SVD` model as more superior for creating the collaborative filtering system. The `SVD` model is then assessed using cross-validation achieving a mean RMSE of approximately 0.857 and a very low standard deviation of `0.005` across 5 folds. 
+A `CollabBasedModel` class trains the SVD model, collects user ratings, generates movie recommendations (with optional genre filtering) and prints the top recommendations. 
+TF-IDF vectorization and cosine similarity are then encapsulated within a `ContentBasedModel` class inorder to train a content-based movie recommendation system and provide genre-based movie recommendations based on similarity to a specified movie title.
+The `HybridModel` class integrates collaborative filtering and content-based filtering using a weighted average to provide hybrid recommendations with user ratings guiding the process. After training both models, the system generates and prints a list of recommended movies based on user input. The hybrid model with a collab weight of 0.5 produces an RMSE of `1.25`
+### 6. Recommendations
 
- ## Modeling
- A collaborative based model was first developed using the Singular Value Decomposition algorithm
- a content based model and a hybrid model were developed 
+### 7. Conclusions
diff --git a/project_images/top_20_rated_titles_by_num_of_ratings.png b/project_images/top_20_rated_titles_by_num_of_ratings.png