-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #2 from jacquessham/v2_1_0
Upgrade to v2.1.0
- Loading branch information
Showing
28 changed files
with
1,639 additions
and
44 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,7 +3,6 @@ | |
|
||
# Repository-specific | ||
*.html | ||
*.csv | ||
|
||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,32 +1,7 @@ | ||
# (Deprecated) Japanese Whisky Reviews - Version 1.0 | ||
# Archive | ||
|
||
There is a Japanese Whisky Review data set available in Kaggle, that the data set is originated from Master of Malt. I am interested in doing some NLP works on this data set. <br> | ||
I will be making some analysis on the sentiment of the reviews and try to summarize the individual review. | ||
<br> | ||
## Version 1 | ||
Released in 2019. You may find the scripts in the [Version 1](/v1_0_0) folder. | ||
|
||
## Tools | ||
In this project, I will be using packages like SKlearn, vaderSentiment, ntlk for sentiment scores and TF-IDF. <br> | ||
|
||
## Data set | ||
The data set could be found in <a href="https://www.kaggle.com/koki25ando/japanese-whisky-review">Kaggle</a>. It consists of 4 columns including, bottle label, brand name, title of the review and the review content. The data set only covers 4 Japanese whisky brands -- Yamazaki, Hibiki, Hakushu, and Nikka. | ||
|
||
## Sentiment Analysis | ||
The first task is to understand the sentiment scores across brands. | ||
<br> | ||
First, I have used vaderSentiment to calculate the sentiment score for each review. Then, I used Plotly to visualize the range of sentiment score of each brand with a boxplot. It looks like this. <br><br> | ||
![Screenshot](sentiment_score_boxplot.png) | ||
<br> | ||
From the boxplot, we can learn that reviewers in general have a positive view on the Japanese whiskies, while they have better impression on Nikka and Hibiki. Interestingly, the median sentiment score on Yamazaki is 0, which means neutral. | ||
<br> | ||
You may find the code <a href="jpwhisky_review_sentiment.py">here</a> | ||
|
||
## TF-IDF | ||
The second task is to build a model that shows the summary by displaying the top 5 key words in the review. To do this, I use TfidfVectorizer from sklearn.feature_extraction.text to build the model. To preprocess the texts, I used the same package to remove English stop words and nltk to stem the words. | ||
<br> | ||
There are 2 files of code for this task: <a href="jpwhisky_review_tfidf.py">jpwhisky_review_tfidf.py</a> is the backend, and <a href="driver.py">driver.py</a> will provoke the implementation and display result. | ||
<br><br> | ||
I don't like the first version of the displaying system because it shows the stemmed word as summary, like this:<br><br> | ||
![Screenshot](display_before.png) | ||
<br><br> | ||
I add a feature that once the model produce the result for a review post, driver.py will grab the original post and pair the word from original post with the stemmed word in a dictionary, then it will display the word from original post, like this:<br><br> | ||
![Screenshot](display_after.png) | ||
## Version 2.0.0 | ||
Release on 14 July 2023. This version upgrade the project with a Dashboard powered by Plotly Dash. You may find the scripts in the [Version 2.0](v2_0_0) folder. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# (Deprecated) Japanese Whisky Reviews - Version 1.0 | ||
|
||
There is a Japanese Whisky Review data set available in Kaggle, that the data set is originated from Master of Malt. I am interested in doing some NLP works on this data set. <br> | ||
I will be making some analysis on the sentiment of the reviews and try to summarize the individual review. | ||
<br> | ||
|
||
## Tools | ||
In this project, I will be using packages like SKlearn, vaderSentiment, ntlk for sentiment scores and TF-IDF. <br> | ||
|
||
## Data set | ||
The data set could be found in <a href="https://www.kaggle.com/koki25ando/japanese-whisky-review">Kaggle</a>. It consists of 4 columns including, bottle label, brand name, title of the review and the review content. The data set only covers 4 Japanese whisky brands -- Yamazaki, Hibiki, Hakushu, and Nikka. | ||
|
||
## Sentiment Analysis | ||
The first task is to understand the sentiment scores across brands. | ||
<br> | ||
First, I have used vaderSentiment to calculate the sentiment score for each review. Then, I used Plotly to visualize the range of sentiment score of each brand with a boxplot. It looks like this. <br><br> | ||
![Screenshot](sentiment_score_boxplot.png) | ||
<br> | ||
From the boxplot, we can learn that reviewers in general have a positive view on the Japanese whiskies, while they have better impression on Nikka and Hibiki. Interestingly, the median sentiment score on Yamazaki is 0, which means neutral. | ||
<br> | ||
You may find the code <a href="jpwhisky_review_sentiment.py">here</a> | ||
|
||
## TF-IDF | ||
The second task is to build a model that shows the summary by displaying the top 5 key words in the review. To do this, I use TfidfVectorizer from sklearn.feature_extraction.text to build the model. To preprocess the texts, I used the same package to remove English stop words and nltk to stem the words. | ||
<br> | ||
There are 2 files of code for this task: <a href="jpwhisky_review_tfidf.py">jpwhisky_review_tfidf.py</a> is the backend, and <a href="driver.py">driver.py</a> will provoke the implementation and display result. | ||
<br><br> | ||
I don't like the first version of the displaying system because it shows the stemmed word as summary, like this:<br><br> | ||
![Screenshot](display_before.png) | ||
<br><br> | ||
I add a feature that once the model produce the result for a review post, driver.py will grab the original post and pair the word from original post with the stemmed word in a dictionary, then it will display the word from original post, like this:<br><br> | ||
![Screenshot](display_after.png) |
File renamed without changes
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# Japanese Whisky Reviews | ||
|
||
There is a Japanese Whisky Review data set available in Kaggle, that the data set is originated from Master of Malt. I am interested in doing some NLP works on this data set. <br><br> | ||
I will be making some analysis on the sentiment of the reviews and try to summarize the individual review. | ||
|
||
## Tools | ||
In this project, I will be using packages like SKlearn, vaderSentiment, ntlk for sentiment scores and TF-IDF. Then, we will display the result on a dashboard via Plotly Dash. | ||
|
||
## Data set | ||
The data set could be found in <a href="https://www.kaggle.com/koki25ando/japanese-whisky-review">Kaggle</a>. It consists of 4 columns including, bottle label, brand name, title of the review and the review content. The data set only covers 4 Japanese whisky brands -- Yamazaki, Hibiki, Hakushu, and Nikka. | ||
|
||
|
||
## Dashboard | ||
The dashboard consists of two parts: <b>Sentiment Analysis</b> and <b>TF-IDF Analysis</b> (Core meaning of a posted comment). The Sentiment Analysis is plotted with a static box plot of sentiment scores distribution by whiksy brand. The bottom has 4 tabs represent each whisky brand. You may click on one whisky and the dashboard would randomly pick a comment and display the core meaning. | ||
<br><br> | ||
The dashboard looks like this: | ||
|
||
<img src=jp_whisky_dashboard.png> | ||
|
||
### How to Run the Dashboard? | ||
After ensuring installing all the dependencies, you run simply execute: | ||
|
||
``` | ||
python viz.py | ||
``` | ||
|
||
Once the dashboard is ready, you may access it at <b>127.0.0.1:9000</b> | ||
|
||
## Technical Explanation | ||
### Sentiment Analysis | ||
We will use vaderSentiment to calculate the sentiment score for each review. Then, Plotly will visualize the range of sentiment score of each brand with a boxplot and render on the Dashboard. It looks like this. <br><br> | ||
<img src=jp_whisky_boxplot.png> | ||
<br> | ||
From the boxplot, we can learn that reviewers in general have a positive view on the Japanese whiskies, while they have better impression on Nikka and Hibiki. Interestingly, the median sentiment score on Yamazaki is 0, which means neutral. | ||
|
||
### TF-IDF Analysis | ||
The second task is to build a model that shows the summary by displaying the top 5 key words in the review. The script uses TfidfVectorizer from sklearn.feature_extraction.text to build the model. To preprocess the texts, I used the same package to remove English stop words and nltk to stem the words. | ||
<br> | ||
<a href="jpwhisky_review_tfidf.py">jpwhisky_review_tfidf.py</a> is the backend, and the Dashboard <a href="">viz.py</a> will provoke the implementation and display result. | ||
|
||
## Files | ||
Here are the files to run the dashboards: | ||
|
||
### viz.py | ||
The driver file to construct the dashboard and backend. | ||
|
||
### jpwhisky_review_sentiment.py | ||
The helper script to calculate sentiment scores in the backend. | ||
|
||
### jpwhisky_reivew_tfidf.py | ||
The helper script to calculate TF-IDF scores in the backend. | ||
|
||
### viz_helper Folder | ||
The framework to render a Plotly visualization, the blueprint comes from the <a href="https://github.com/jacquessham/DashExamples/tree/master/PlotlyTemplateFramework">DashExamples Respository</a> with some modification. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
import pandas as pd | ||
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer | ||
|
||
|
||
# Calculate sentiment score on one element | ||
def get_score(x): | ||
judge = SentimentIntensityAnalyzer() | ||
return judge.polarity_scores(x)['compound'] | ||
|
||
# Calculate sentiment score on a pandas dataframe | ||
def calculate_sentiment_scores(df): | ||
df['score'] = df.apply(lambda x: get_score(x['Review_Content']), axis=1) | ||
return df |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
import re | ||
import string | ||
import pandas as pd | ||
import numpy as np | ||
from nltk.stem.porter import * | ||
from sklearn.feature_extraction import _stop_words | ||
from sklearn.feature_extraction.text import TfidfVectorizer | ||
|
||
|
||
# Preprocess the comment to remove punctuation and split into a list | ||
def clean_words(post): | ||
regex = re.compile('[' + re.escape(string.punctuation) + ' 0-9\\r\\t\\n]') | ||
sentence = regex.sub(' ', post.lower()) | ||
return sentence.split(' ') | ||
|
||
# Define tokenizer for Tfidf model | ||
def tokenizer(post): | ||
temp_list = clean_words(post) | ||
stemmer = PorterStemmer() | ||
temp_list = [stemmer.stem(word) for word in temp_list if | ||
word != '' and word not in _stop_words.ENGLISH_STOP_WORDS] | ||
return temp_list | ||
|
||
# Read file and extract the comments and convert to np array | ||
def read_reviews(filename, colname): | ||
jp_whisky = pd.read_csv(filename, encoding = 'ISO-8859-1') | ||
comments_list = jp_whisky[colname].tolist() | ||
return jp_whisky, comments_list | ||
|
||
# Declare model | ||
def tfidf_fit_trans(comments_list): | ||
tfidf = TfidfVectorizer(input='content', | ||
analyzer='word', | ||
tokenizer=tokenizer, | ||
stop_words='english', | ||
decode_error='ignore') | ||
# Fit and transform the model | ||
scorer = tfidf.fit(comments_list) | ||
result = scorer.transform(comments_list) | ||
# Get the top 5 scores from each comment | ||
features = np.array(scorer.get_feature_names()) | ||
scores = result.toarray() | ||
return features, scores | ||
|
||
def get_results(features, scores, article_index, top_n): | ||
index = np.argsort(scores[article_index])[::-1] | ||
return features[index[:top_n]], scores[article_index, index[:top_n]] | ||
|
||
# Design for translating 1 stemword to 1 word, not a dictionary of all words | ||
# This is not working too well | ||
def get_words(comment): | ||
temp_list = clean_words(comment) | ||
stemmer = PorterStemmer() | ||
word_dict = {} | ||
for word in temp_list: | ||
word_stemmed = stemmer.stem(word) | ||
if word != '' and word not in _stop_words.ENGLISH_STOP_WORDS \ | ||
and word_stemmed not in word_dict: | ||
word_dict[word_stemmed] = [word] | ||
elif word != '' and word not in _stop_words.ENGLISH_STOP_WORDS \ | ||
and word_stemmed in word_dict: | ||
word_dict[word_stemmed] = word_dict[word_stemmed].append(word) | ||
return word_dict | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
import json | ||
from random import choice | ||
import pandas as pd | ||
import dash | ||
import dash_core_components as dcc | ||
import dash_html_components as html | ||
from dash.dependencies import Input, Output, State | ||
from jpwhisky_review_tfidf import * | ||
from jpwhisky_review_sentiment import * | ||
from viz_helper.generate_plotly import * | ||
from viz_helper.table import * | ||
|
||
|
||
##### Data and Predictions ##### | ||
# This is hardcode because the dataset is always static | ||
filename = 'japanese_whisky_review.csv' | ||
colname = 'Review_Content' | ||
whiskies = ['Hibiki', 'Yamazaki', 'Hakushu', 'Nikka'] | ||
df, comments_list = read_reviews(filename, colname) | ||
df.columns = ['index','Bottle_name','Brand','Title','Review_Content'] | ||
df['index'] = df['index'].astype(int) | ||
|
||
# Calculate Sentiment Scores | ||
df = calculate_sentiment_scores(df) | ||
|
||
# Calculate TFIDF scores | ||
features, scores = tfidf_fit_trans(comments_list) | ||
|
||
# Prepare Boxplot for sentiment scores | ||
with open('viz_helper/arguements.json') as f: | ||
args = json.load(f) | ||
fig = generate_plotly_viz( | ||
df, args['metadata'], args['viz_type'], args['viz_name']) | ||
|
||
##### Dashboard layout ##### | ||
# Dash Set up | ||
app = dash.Dash() | ||
|
||
# Base Layout | ||
app.layout = html.Div([ | ||
html.Div([html.H1('Japanese Whisky Review Analysis')], | ||
style={'width': '90%', 'margin': 'auto', 'text-align': 'center'} | ||
), # Headline | ||
html.Div( | ||
dcc.Graph( | ||
id='sentiment_boxplot', | ||
figure=fig | ||
) | ||
), # BoxPlot | ||
html.Div([html.H3('Please choose from the following whiskies:')], | ||
style={'width': '90%', 'margin': 'auto', 'text-align': 'center'} | ||
), | ||
dcc.Tabs(id='whisky-tabs', value=whiskies[0], children=[ | ||
dcc.Tab( | ||
label=whiskies[0], | ||
value=whiskies[0] | ||
), # Tab 1, End whisky1-tab | ||
dcc.Tab( | ||
label=whiskies[1], | ||
value=whiskies[1] | ||
), # Tab 2, End whisky2-tab | ||
dcc.Tab( | ||
label=whiskies[2], | ||
value=whiskies[2] | ||
), # Tab 3, End whisky3-tab | ||
dcc.Tab( | ||
label=whiskies[3], | ||
value=whiskies[3] | ||
) # Tab 4, End whisky4-tab | ||
]), # End Tabs | ||
html.Div(id='whisky-table') | ||
]) # End base Div | ||
|
||
##### Dashboard Callback Function ##### | ||
# For Tab1 Only for now | ||
@app.callback(Output('whisky-table','children'), | ||
[Input('whisky-tabs','value')]) | ||
def render_table(tab): | ||
# Filter to the brand | ||
df_temp = df[df['Brand'] == tab] | ||
article_index = choice(df_temp['index'].tolist()) | ||
top_n = 5 | ||
# Obtain scores for a randomly selected comment | ||
result_features, result_scores = get_results(features, scores, | ||
article_index, top_n) | ||
|
||
sub_headline = 'This post is about ' + \ | ||
df_temp[df_temp['index']==article_index].iloc[0]['Brand'] + "'s " + \ | ||
df_temp[df_temp['index']==article_index].iloc[0]['Bottle_name'] | ||
|
||
# sub_headline = 'Hello world!' | ||
content = html.Div([ | ||
html.H3(sub_headline), | ||
generate_table_html(result_features, result_scores) | ||
], | ||
style={'width': '90%', 'margin': 'auto', 'text-align': 'center'}) | ||
return content | ||
|
||
|
||
if __name__ == '__main__': | ||
app.run_server(debug=True, host='0.0.0.0', port=9000) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{ | ||
"df_directory":"../japanese_whisky_review.csv", | ||
"viz_type":"boxplot", | ||
"viz_name":"Review Sentiment Score Distribution by Whisky", | ||
"metadata":{ | ||
"x":"Brand", | ||
"y":"score" | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# Universal Viz Arg | ||
def check_text(metadata): | ||
if 'text' in metadata: | ||
return metadata['text'] | ||
return None | ||
|
||
# Universal Viz Arg | ||
def check_textposition(metadata): | ||
if 'textposition' in metadata: | ||
return metadata['textposition'] | ||
elif 'text_position' in metadata: | ||
return metadata['text_position'] | ||
return None | ||
|
||
# Universal Viz Arg | ||
def check_textfont(metadata): | ||
if 'textfont' in metadata: | ||
return metadata['textfont'] | ||
if 'text_font' in metadata: | ||
return metadata['text_font'] | ||
return None | ||
|
||
# Bar, Boxplot Arg | ||
def check_barcolour(metadata): | ||
# Allow user to spell English and American English and use of _ | ||
if 'barcolour' in metadata: | ||
return metadata['barcolour'] | ||
elif 'bar_colour' in metadata: | ||
return metadata['bar_colour'] | ||
elif 'barcolor' in metadata: | ||
return metadata['barcolor'] | ||
elif 'bar_color' in metadata: | ||
return metadata['bar_color'] | ||
return None | ||
|
||
# Boxplot Arg | ||
def check_boxmean(metadata): | ||
if 'boxmean' in metadata: | ||
return metadata['boxmean'] | ||
return None |
Oops, something went wrong.