Skip to content

Commit

Permalink
Upload readme
Browse files Browse the repository at this point in the history
  • Loading branch information
jacquessham committed Jun 9, 2019
1 parent 0b13f8d commit 68da152
Show file tree
Hide file tree
Showing 3 changed files with 32 additions and 0 deletions.
32 changes: 32 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Japanese Whisky Reviews

There is a Japanese Whisky Review data set available in Kaggle, that the data set is originated from Master of Malt. I am interested in doing some NLP works on this data set. <br>
I will be making some analysis on the sentiment of the reviews and try to summarize the individual review.
<br>

## Tools
In this project, I will be using packages like SKlearn, vaderSentiment, ntlk for sentiment scores and TF-IDF. <br>

## Data set
The data set could be found in <a href="https://www.kaggle.com/koki25ando/japanese-whisky-review">Kaggle</a>. It consists of 4 columns including, bottle label, brand name, title of the review and the review content. The data set only covers 4 Japanese whisky brands -- Yamazaki, Hibiki, Hakushu, and Nikka.

## Sentiment Analysis
The first task is to understand the sentiment scores across brands.
<br>
First, I have used vaderSentiment to calculate the sentiment score for each review. Then, I used Plotly to visualize the range of sentiment score of each brand with a boxplot. It looks like this. <br><br>
![Screenshot](sentiment_score_boxplot.png)
<br>
From the boxplot, we can learn that reviewers in general have a positive view on the Japanese whiskies, while they have better impression on Nikka and Hibiki. Interestingly, the median sentiment score on Yamazaki is 0, which means neutral.
<br>
You may find the code <a href="jpwhisky_review_sentiment.py">here</a>

## TF-IDF
The second task is to build a model that shows the summary by displaying the top 5 key words in the review. To do this, I use TfidfVectorizer from sklearn.feature_extraction.text to build the model. To preprocess the texts, I used the same package to remove English stop words and nltk to stem the words.
<br>
There are 2 files of code for this task: <a href="jpwhisky_review_tfidf.py">jpwhisky_review_tfidf.py</a> is the backend, and <a href="driver.py">driver.py</a> will provoke the implementation and display result.
<br><br>
I don't like the first version of the displaying system because it shows the stemmed word as summary, like this:<br><br>
![Screenshot](display_before.png)
<br><br>
I add a feature that once the model produce the result for a review post, driver.py will grab the original post and pair the word from original post with the stemmed word in a dictionary, then it will display the word from original post, like this:<br><br>
![Screenshot](display_after.png)
Binary file added display_after.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added display_before.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 68da152

Please sign in to comment.