Predicting movie genres using movie metadata and posters

- by JY, AD and CH

The purpose of this project is to predict movie genres to better assist targeted advertising. An advertising company can advertise more effectively based on the genre type that matches clients’ preference or basic demographic information. Movie metadata from IMDB and TMDB together with the movie posters were used as predictors, and the genre labels from TMDB were used as the response.

A major challenge of this study is that each movie may have multiple genre labels and belong to several classes. This type of problem is multi-label classification, where the classes are not mutually exclusive. To handle the multi-label classification, the movie genres were binarized (one-hot-encoded), and various modeling methods were applied including the traditional machine learning models, such as logistic regression, random forest and boosting, as well as convolutional neural network.

Data

The dataset includes the 1000 most popular US produced English language movies within each primary released year of 2012-2016 with genre labels, resulting in a dataset with about 6000 entries.

Metadata
Most of the movie features are from TMDB, including: Release year, Release month, Vote count, Vote average. Additional information such as runtime and aspect ratio was downloaded from IMDB.
Text data
Text analysis was performed on movie titles and TMDB reviews to obtain insights from high frequency words. Given the high dimensionality of the text data and the large number of words with only a few occurrences, words appearing more than 30 times in movie titles and more than 200 times in movie reviews were selected as features. This resulted in 124 features in total.
Posters
Movie posters were downloaded from TMDB to for genre prediction in deep learning models.

Files

Data

data/top1000_movies_2011_2016_tmdb： Movie metadata downloaded from TMDB, generated by download_TMDB_data.py
data/top1000_movies_2011_2016_tmdb_imdb：Include both data from TMDB and Additional features from IMDB, genereated by add_IMDB_data.py
data/genre_list.csv: List of genres from TMDB, generated by download_TMDB_data.py
data/posters/: posters of the movies downloaded from TMDB, generated by download_posters.py
data/cleaned_data_for_traditional_models.p: features and response variables organized for fitting the traditional models, include predictor array, standardized predictor array, and response variable in the same order. Generated by Text_Analysis_and_Data_Cleaning.ipynb
data/binarizer_genre_list.p: genre label for the binarized response variable, generated by Text_Analysis_and_Data_Cleaning.ipynb

Code

download_TMDB_data.py: download movie metadata from TMDB
add_IMDB_data.py: download additional information for each movie from IMDB
download_posters.py: download movie posters from TMDB
Text_Analysis_and_Data_Cleaning.ipynb: perform bag-of-words and PCA on movie title and posters; prepare movie metadata for model fitting
fit_traditional_models.ipynb: perform multi-label classification using weighted logistic regression, decision tree, random forest and boosting
NeuralNet_Drama.ipynb: fine tune pre-trained VGG-16 in two steps, and perform classificaiton on the movie genre "Drama"
NeuralNet_Animation.ipynb: balance classes by subsampling, fine tune pre-trained VGG-16 in two steps, and perform classificaiton on the movie genre "Animation"

Results

results/NNet/: prediction of final neural netword model on the testing data for Drama and Animation genres
results/traditional/: multi-label prediction of traditional models

Model performance

Traditional models

The overall performance of the traditional models are summarized by the average F1-score of all genres. The model performance on each genre varies due to class imbalance. Detailed results can be found in fit_traditional_models.ipynb.

Model	Average F1 score
Weighted logistic	0.364
Decision Tree	0.168
Random Forest	0.215
Ada Boosting	0.265

Neural Nets

The results are compared for two genres: Drama and Animation. Neural network models perform significantly better than the traditional models in predicting Drama movies.

Model	Drama (F1 score)	Animation (F1 score)
Weighted logistic	0.58	0.18
Decision Tree	0.37	0.09
Random Forest	0.43	0.07
Ada Boosting	0.47	0.15
Neural Network	0.94	0.11

Note: for animation, the neural network model was trained on balanced classes. However, the performance of neural network model was not better than the traditional models. A possible reason is that there were not enough data to train the neural net after subsamping the major class.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting movie genres using movie metadata and posters

- by JY, AD and CH

Data

Metadata

Text data

Posters

Files

Data

Code

Results

Model performance

Traditional models

Neural Nets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
NNet_files		NNet_files
data		data
results		results
.gitignore		.gitignore
NeuralNet_Animation.ipynb		NeuralNet_Animation.ipynb
NeuralNet_Drama.ipynb		NeuralNet_Drama.ipynb
README.md		README.md
Text_Analysis_and_Data_Cleaning.ipynb		Text_Analysis_and_Data_Cleaning.ipynb
add_IMDB_data.py		add_IMDB_data.py
download_TMDB_data.py		download_TMDB_data.py
download_posters.py		download_posters.py
fit_traditional_models.ipynb		fit_traditional_models.ipynb

Folders and files

Latest commit

History

Repository files navigation

Predicting movie genres using movie metadata and posters

- by JY, AD and CH

Data

Metadata

Text data

Posters

Files

Data

Code

Results

Model performance

Traditional models

Neural Nets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages