GitHub

<Book Rating Prediction Project

This repository contains a project focused on predicting the average rating of books based on various features present in the dataset. The goal of this study is to build models that can accurately predict book ratings using the provided attributes. The project involves data preprocessing, feature selection, regression and classification modeling, and an exploration of results.

Table of Contents

Introduction
Data Preprocessing
Feature Selection
Data Distribution and Imbalance
Data Augmentation with SMOGN
Regression and Classification Approaches
Model Evaluation
Conclusion
Future Steps
GitHub Repository

Introduction

The main objective of this project is to predict the average rating of books using a dataset with 12 attributes, ranging from authors to publishers. The dataset was preprocessed to handle outliers, missing values, and categorical features.

Data Preprocessing

Outliers and suspicious data points were removed, including instances with zero average ratings, unknown publication dates, and extreme values in attributes like number of pages and rating count. Categorical features were label encoded for model compatibility.

Feature Selection

Exploratory data analysis revealed limited correlations between the features and the target variable. The highest observed correlation was just 0.17, indicating that feature selection and engineering were necessary. Some redundant features were dropped to improve model efficiency.

Data Distribution and Imbalance

The distribution of average ratings exhibited a Gaussian pattern, with a concentration between ratings 3 and 4. This distribution imbalance could affect model performance by favoring the majority class.

Data Augmentation with SMOGN

To address the class imbalance issue, Synthetic Minority Over-sampling Technique (SMOGN) was employed to augment the minority class. This aimed to enhance the learning of the algorithm on the less represented class.

Regression and Classification Approaches

Both regression and classification approaches were explored. For regression, metrics like Mean Squared Error (MSE), Mean Absolute Percentage Error (MAPE), and R-squared were used. The classification task involved binning ratings into categories like bad, good, and excellent.

Model Evaluation

Despite the low correlation observed between features and the target variable, models were built and evaluated. The results indicated moderate performance. Regression models showed low MSE and MAPE due to the limited rating range (0 to 5), but R-squared was notably low. Classification achieved better results, with an accuracy score of 0.603 using the Random Forest Classifier.

Conclusion

In this project, we addressed the challenge of predicting book ratings by approaching it as both a regression and classification problem. Although the feature-target correlation was limited, the models showed promising results. Classification outperformed regression, which suggests that incorporating additional features like book price and sales could lead to better predictions.

Future Steps

To enhance prediction accuracy, several avenues can be explored:

Incorporate additional features like book price, sales data, and user reviews.
Experiment with different regression and classification algorithms.
Utilize more advanced techniques for feature selection and engineering.
Investigate ensemble methods to combine predictions from multiple models.

GitHub Repository

The code and materials related to this project can be found in the GitHub repository. Feel free to explore the code, datasets, and results.

For any questions or collaborations, please reach out to:

Christy Lazar: GitHub Profile
Shawna Roseaulin: GitHub Profile
Neelam Patkar: GitHub Profile
Chin Vergara: GitHub Profile

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Dataset		Dataset
Book_Rating_Prediction.pdf		Book_Rating_Prediction.pdf
Python_labs_proj_Book_Rating_Pred_final2.ipynb		Python_labs_proj_Book_Rating_Pred_final2.ipynb
README.md		README.md
notebook_requirements.txt		notebook_requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Data Preprocessing

Feature Selection

Data Distribution and Imbalance

Data Augmentation with SMOGN

Regression and Classification Approaches

Model Evaluation

Conclusion

Future Steps

GitHub Repository

About

Releases

Packages

Languages

lazarchris/Book_Rating_Prediction

Folders and files

Latest commit

History

Repository files navigation

Introduction

Data Preprocessing

Feature Selection

Data Distribution and Imbalance

Data Augmentation with SMOGN

Regression and Classification Approaches

Model Evaluation

Conclusion

Future Steps

GitHub Repository

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages