This project aims to build a recommendation system for movies based solely on their scripts. Unlike traditional recommendation algorithms that rely on user data, this system utilizes the content of the movies to suggest similar films.
To get started, you'll need to obtain the movie scripts. Refer to the end of the README file to find the resources for obtaining this data. Once you have the scripts, place them in a directory named "scripts" in the project directory.
This project requires the following dependencies:
- pandas
- numpy
- tqdm
- gensim
- scikit-learn
You can install these dependencies using pip:
pip install pandas numpy tqdm gensim scikit-learn
- Run the script to import the movie scripts and convert them into vectors using Doc2Vec.
- Optionally, you can choose between two formats of the dataset: one where each column represents a vector feature (dfscript), and another where one column holds the entire vector (dfvectors).
- Define functions for calculating cosine similarity and finding similar movies.
- Optionally, perform clustering using KMeans to group similar movies.
- Utilize the recommendation function to suggest movies based on the input.
recommend(['No Country for Old Men', 'Mean Girls'])
By leveraging clustering and mathematical methods, this project demonstrates the creation of a recommendation system without relying on user data. While the algorithm works effectively, there are areas for improvement, such as recommending movies from multiple clusters and handling duplicates. Future iterations may address these issues for enhanced performance.
Scrapper: authors:
- family-names: "Saha" given-names: "Aveek" orcid: "https://orcid.org/0000-0002-6112-3843" title: "Movie Script Database" version: 1.0.0 date-released: 2021-07-05 url: "https://github.com/Aveek-Saha/Movie-Script-Database"