SentimentAnalyzer on Amazon Reviews Dataset #191

chinglamchoi · 2020-01-25T11:39:59Z

I used the Sentiment Analyzer model to perform binary classification on the Amazon Reviews Dataset. Before training, I perform the following steps for pre-processing:

truncate input at 500 chars
strip stopwords
strip corrupt utf8 chars (iso-8859-1 chars)
stemming to root words

The following are inference results:
Accuracy: 49.64325
Precision: 0.497469903015904
Recall: 0.701445
F1 Score: 0.5821059947510917

I also compared the accuracy (of TextAnalysis' model pretrained on the IMDB dataset) with a logistic model (trained on 12000 reviews of the Amazon Reviews trainset) in sklearn. The sklearn model scored 46.47175 in accuracy.

To improve on Sentiment Analyzer's accuracy, I think that part of speech tagging could be implemented. However, it is at the moment very time-consuming to perform, taking up to 24 hours for pre-processing on 10000 reviews (the entire testset has 400000 samples), which made it infeasible to test in Google Code In!

aviks · 2020-11-02T15:10:13Z

@chinglamchoi do you have the code for this exercise available somewhere?

chinglamchoi · 2020-11-03T10:27:56Z

Yes here it is: https://github.com/chinglamchoi/GCI_With_Julia/tree/master/Machine_Learning/sentiment_analysis

Due to time limitations of GCI, I believe the reported accuracy values in the issue were between sklearn and TextAnalysis.jl at 4000 and <4000 epochs respectively (not enough time to train). Sklearn achieved higher accuracies than 0.4647 with >4000 training samples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SentimentAnalyzer on Amazon Reviews Dataset #191

SentimentAnalyzer on Amazon Reviews Dataset #191

chinglamchoi commented Jan 25, 2020

aviks commented Nov 2, 2020

chinglamchoi commented Nov 3, 2020

SentimentAnalyzer on Amazon Reviews Dataset #191

SentimentAnalyzer on Amazon Reviews Dataset #191

Comments

chinglamchoi commented Jan 25, 2020

aviks commented Nov 2, 2020

chinglamchoi commented Nov 3, 2020