Skip to content

Models to classify customer sentiment given a product review, which was trained on Amazon Reviews using NLTK and Scikit.

Notifications You must be signed in to change notification settings

Sanavesa/Sentiment-Analysis-for-Amazon-Reviews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Sentiment Analysis for Amazon Reviews

The goal of this project was to train a few models to classify customer sentiment given a product review, which was trained on Amazon Reviews here. This dataset is extremely huge, which helped out in the training phase.

Data Pre-processing

As is known, models have difficulty learning meaningful patterns from raw, unprocessed data. Thus, I cleaned the data and ensured the data was balanced so that the learning has no problems. The following steps were taken to clean the data:

  • Lowercase all the reviews
  • Remove all HTML tags and URLs
  • Remove non-English alphanumeric characters and extra spaces
  • Expand contractions (won't became will not)
  • Remove stop words
  • Perform lemmatization

The average character length of the reviews were reduced from 309 to 183 characters (down by 40%).

Next, I randomly sampled 100k instances of each class (positive and negative sentiment) to be the final processed dataset. By doing this, the model will be fitted on a high-quality dataset that is well-balanced.

Feature Extraction

So far the reviews are still in their textual format. By utilizing Sklearn, I was able to extract features from the reviews using TF-IDF, which is a value that is intended to reflect how important a word is to a dataset. This step converted each review from its textual format into an array of floats.

Training

A total of 4 models were trained using Sklearn: Perceptron, Linear SVM, Logistic Regression, and Multinomial Naive Bayes. The following are the prediction results against the testing set after training the models:

Model Precision Recall Accuracy F1 score
Perceptron 85% 89% 85% 84
Multinomial Naive Bayes 87% 87% 87% 87
Linear SVC 89% 90% 89% 89
Logistic Regression 90% 90% 90% 90

About

Models to classify customer sentiment given a product review, which was trained on Amazon Reviews using NLTK and Scikit.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published