From 14f6ed26982d15ee5dc676e1e056f25fd9f6bc93 Mon Sep 17 00:00:00 2001 From: Evaclaire Wamitu Date: Tue, 30 Jul 2024 00:02:09 +0300 Subject: [PATCH] Finish README.md --- README.md | 29 +++++++++++++++++++++-------- 1 file changed, 21 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 4498071..c00bb9f 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,11 @@ Online streaming services such as StreamFlix aim to enhance user experience and The data utilised in this project is the Movielens dataset from GroupLens Research Lab covering movie ratings from 1902 to 2018. The dataset contains 100836 ratings and 3683 tag applications across 9742 movies with each user rating at least 20 films. While the full dataset contains 1.9 million ratings, we focussed on a subset of about 100,000 for our current model due to time and resource constraints. This sample size balances computational efficiency with statistical relevance for our recommendation engine development. The datasets include `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. The following features were utilized in the development of our recommendation system: `movieId`, `userId_x`, `rating`, `title` and `genres`. The datasets were merged on the movieId column resulting in a DataFrame with 285783 rows and 11 columns. The data was sufficient in fulfilling our objectives although additional information such as actors, directors, production studio, runtime and user demographics would have provided more context and insights into user preferences and movie characteristics leading to better recommendations. -## 3. Exploratory Data Analysis + ## 3. Data Preparation +We address and correct errors in the dataset by handling missing values, outliers, duplicates and data type inconsistencies. Columns that are least relevant in our analysis are dropped and the userId_x column is then renamed to user_id for ease of reference. Movie release year is then extracted from the movie title column and values in the genres colum +n are split into lists for easier analysis. We then examine the properties of the DataFrame and drop missing values and duplicate rows and finally convert the user_id column data type from float64 to int64 for consistency. This ensures that the data is clean and properly formatted setting the stage for accurate modeling. + +### Exploratory Data Analysis In this exploratory data analysis (EDA), we aim to uncover patterns, trends and relationships within our dataset using visualizations and statistical techniques. Distribution of Ratings: The most common rating is 4.0 with about 25,000 movies, the distribution is positively skewed towards higher ratings. @@ -36,14 +40,23 @@ Top 20 Rated Titles by Number of Ratings Plot: 'Forrest Gump' has the highest nu ![png](project_images/top_20_rated_titles_by_num_of_ratings.png) - ## 4. Data Preparation -We address and correct errors in the dataset by handling missing values, outliers, duplicates and data type inconsistencies. Columns that are least relevant in our analysis are dropped and the userId_x column is then renamed to user_id for ease of reference. Movie release year is then extracted from the movie title column and values in the genres column are split into lists for easier analysis. We then examine the properties of the DataFrame and drop missing values and duplicate rows and finally convert the user_id column data type from float64 to int64 for consistency. This ensures that the data is clean and properly formatted setting the stage for accurate modeling. - - ## 5. Modeling + ## 4. Modeling A dummy model is first evaluated using the Surprise library as a baseline for the collaborative filtering system generating random predictions based on the observed distribution of ratings. This dummy model is trained and tested providing an RMSE of `1.43`. Gridsearch Cross Validation of the Singular Value Decomposition (SVD) model outputs an RMSE of `0.862` against an RMSE of `0.975` for the K-Nearest Neighbors (KNN) model identifying the `SVD` model as more superior for creating the collaborative filtering system. The `SVD` model is then assessed using cross-validation achieving a mean RMSE of approximately 0.857 and a very low standard deviation of `0.005` across 5 folds. A `CollabBasedModel` class trains the SVD model, collects user ratings, generates movie recommendations (with optional genre filtering) and prints the top recommendations. TF-IDF vectorization and cosine similarity are then encapsulated within a `ContentBasedModel` class inorder to train a content-based movie recommendation system and provide genre-based movie recommendations based on similarity to a specified movie title. -The `HybridModel` class integrates collaborative filtering and content-based filtering using a weighted average to provide hybrid recommendations with user ratings guiding the process. After training both models, the system generates and prints a list of recommended movies based on user input. The hybrid model with a collab weight of 0.5 produces an RMSE of `1.25` -### 6. Recommendations +The `HybridModel` class integrates collaborative filtering and content-based filtering using a weighted average to provide hybrid recommendations with user ratings guiding the process. After training both models, the system generates and prints a list of recommended movies based on user input. The hybrid model with a collab weight of 0.5 produces an RMSE of `1.25`. +Different collaborative filtering weights are tested to determine their impact on the hybrid model's performance. starting from 0.2 up to 0.8, the RMSE values decrease indicating improved prediction accuracy. With a weight of `0.2` the RMSE is `1.2559` and with a weight of `0.4` it slightly improves to `1.2523`. The RMSE significantly drops to `1.1263` with a weight of `0.6` and further decreases to `1.1221` with a weight of `0.8`. This suggests that higher collaborative filtering weight tends to enhance the model's accuracy leading to lower prediction errors and helps in identifying the most effective balance between collaborative and content-based filtering. + +### Conclusion +The collaborative filtering model with an RMSE of 0.86 outperforms the hybrid model, which shows higher RMSE, suggesting that emphasizing collaborative filtering in a hybrid approach yields better accuracy and recommendation quality. + +### Recommendations +1. **Refine Collaborative Filtering Weights**: Experiment with smaller increments around the optimal collaborative filtering weight (e.g., between 0.6 and 0.8) to fine-tune performance. + +2. **Implement Cross-Validation**: Use cross-validation to confirm that improvements in RMSE are consistent and not due to random variations or overfitting. + +3. **Enhance Content-Based Model**: Incorporate detailed item features like plot summaries in content based filtering to improve recommendations especially for users with limited interaction history. + +4. **Explore Advanced Techniques**: Integrate deep learning-based models and other advanced methods to further enhance the system's capabilities and address remaining limitations. -### 7. Conclusions \ No newline at end of file +5. **Incorporate User Feedback and Regular Updates**: Integrate user feedback and conduct real-world testing to validate the model's effectiveness ensuring it meets user preferences and expectations. Regularly update the recommendation system with new data and metrics to adapt to evolving user preferences and content. \ No newline at end of file