Skip to content

JY2014/MovieGenres

Repository files navigation

Predicting movie genres using movie metadata and posters

- by JY, AD and CH

The purpose of this project is to predict movie genres to better assist targeted advertising. An advertising company can advertise more effectively based on the genre type that matches clients’ preference or basic demographic information. Movie metadata from IMDB and TMDB together with the movie posters were used as predictors, and the genre labels from TMDB were used as the response.

A major challenge of this study is that each movie may have multiple genre labels and belong to several classes. This type of problem is multi-label classification, where the classes are not mutually exclusive. To handle the multi-label classification, the movie genres were binarized (one-hot-encoded), and various modeling methods were applied including the traditional machine learning models, such as logistic regression, random forest and boosting, as well as convolutional neural network.

Data

The dataset includes the 1000 most popular US produced English language movies within each primary released year of 2012-2016 with genre labels, resulting in a dataset with about 6000 entries.

  • Metadata

    Most of the movie features are from TMDB, including: Release year, Release month, Vote count, Vote average. Additional information such as runtime and aspect ratio was downloaded from IMDB.
  • Text data

    Text analysis was performed on movie titles and TMDB reviews to obtain insights from high frequency words. Given the high dimensionality of the text data and the large number of words with only a few occurrences, words appearing more than 30 times in movie titles and more than 200 times in movie reviews were selected as features. This resulted in 124 features in total.
  • Posters

    Movie posters were downloaded from TMDB to for genre prediction in deep learning models.

Files

Data

  • data/top1000_movies_2011_2016_tmdb: Movie metadata downloaded from TMDB, generated by download_TMDB_data.py
  • data/top1000_movies_2011_2016_tmdb_imdb:Include both data from TMDB and Additional features from IMDB, genereated by add_IMDB_data.py
  • data/genre_list.csv: List of genres from TMDB, generated by download_TMDB_data.py
  • data/posters/: posters of the movies downloaded from TMDB, generated by download_posters.py
  • data/cleaned_data_for_traditional_models.p: features and response variables organized for fitting the traditional models, include predictor array, standardized predictor array, and response variable in the same order. Generated by Text_Analysis_and_Data_Cleaning.ipynb
  • data/binarizer_genre_list.p: genre label for the binarized response variable, generated by Text_Analysis_and_Data_Cleaning.ipynb

Code

  • download_TMDB_data.py: download movie metadata from TMDB
  • add_IMDB_data.py: download additional information for each movie from IMDB
  • download_posters.py: download movie posters from TMDB
  • Text_Analysis_and_Data_Cleaning.ipynb: perform bag-of-words and PCA on movie title and posters; prepare movie metadata for model fitting
  • fit_traditional_models.ipynb: perform multi-label classification using weighted logistic regression, decision tree, random forest and boosting
  • NeuralNet_Drama.ipynb: fine tune pre-trained VGG-16 in two steps, and perform classificaiton on the movie genre "Drama"
  • NeuralNet_Animation.ipynb: balance classes by subsampling, fine tune pre-trained VGG-16 in two steps, and perform classificaiton on the movie genre "Animation"

Results

  • results/NNet/: prediction of final neural netword model on the testing data for Drama and Animation genres
  • results/traditional/: multi-label prediction of traditional models

Model performance

Traditional models

The overall performance of the traditional models are summarized by the average F1-score of all genres. The model performance on each genre varies due to class imbalance. Detailed results can be found in fit_traditional_models.ipynb.

Model Average F1 score
Weighted logistic 0.364
Decision Tree 0.168
Random Forest 0.215
Ada Boosting 0.265

Neural Nets

The results are compared for two genres: Drama and Animation. Neural network models perform significantly better than the traditional models in predicting Drama movies.

  Model        Drama (F1 score)   Animation (F1 score)
Weighted logistic 0.58         0.18          
Decision Tree     0.37         0.09        
Random Forest     0.43         0.07        
Ada Boosting     0.47         0.15          
Neural Network 0.94     0.11          

Note: for animation, the neural network model was trained on balanced classes. However, the performance of neural network model was not better than the traditional models. A possible reason is that there were not enough data to train the neural net after subsamping the major class.

About

Tuned convolutional neural network to predict movie genres using movie posters

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors