-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1 from jacquessham/v2
Release v2.0
- Loading branch information
Showing
20 changed files
with
620 additions
and
28 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,167 @@ | ||
# iOS related | ||
.DS_Store | ||
|
||
# Repository-specific | ||
*.html | ||
*.csv | ||
|
||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
share/python-wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.nox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
*.py,cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
cover/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
db.sqlite3 | ||
db.sqlite3-journal | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
.pybuilder/ | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# IPython | ||
profile_default/ | ||
ipython_config.py | ||
|
||
# pyenv | ||
# For a library or package, you might want to ignore these files since the code is | ||
# intended to run in multiple environments; otherwise, check them in: | ||
# .python-version | ||
|
||
# pipenv | ||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. | ||
# However, in case of collaboration, if having platform-specific dependencies or dependencies | ||
# having no cross-platform support, pipenv may install dependencies that don't work, or not | ||
# install all needed dependencies. | ||
#Pipfile.lock | ||
|
||
# poetry | ||
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. | ||
# This is especially recommended for binary packages to ensure reproducibility, and is more | ||
# commonly ignored for libraries. | ||
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control | ||
#poetry.lock | ||
|
||
# pdm | ||
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. | ||
#pdm.lock | ||
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it | ||
# in version control. | ||
# https://pdm.fming.dev/#use-with-ide | ||
.pdm.toml | ||
|
||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm | ||
__pypackages__/ | ||
|
||
# Celery stuff | ||
celerybeat-schedule | ||
celerybeat.pid | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
.dmypy.json | ||
dmypy.json | ||
|
||
# Pyre type checker | ||
.pyre/ | ||
|
||
# pytype static type analyzer | ||
.pytype/ | ||
|
||
# Cython debug symbols | ||
cython_debug/ | ||
|
||
# PyCharm | ||
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can | ||
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore | ||
# and can be added to the global gitignore or merged into this file. For a more nuclear | ||
# option (not recommended) you can uncomment the following to ignore the entire idea folder. | ||
#.idea/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# (Deprecated) Japanese Whisky Reviews - Version 1.0 | ||
|
||
There is a Japanese Whisky Review data set available in Kaggle, that the data set is originated from Master of Malt. I am interested in doing some NLP works on this data set. <br> | ||
I will be making some analysis on the sentiment of the reviews and try to summarize the individual review. | ||
<br> | ||
|
||
## Tools | ||
In this project, I will be using packages like SKlearn, vaderSentiment, ntlk for sentiment scores and TF-IDF. <br> | ||
|
||
## Data set | ||
The data set could be found in <a href="https://www.kaggle.com/koki25ando/japanese-whisky-review">Kaggle</a>. It consists of 4 columns including, bottle label, brand name, title of the review and the review content. The data set only covers 4 Japanese whisky brands -- Yamazaki, Hibiki, Hakushu, and Nikka. | ||
|
||
## Sentiment Analysis | ||
The first task is to understand the sentiment scores across brands. | ||
<br> | ||
First, I have used vaderSentiment to calculate the sentiment score for each review. Then, I used Plotly to visualize the range of sentiment score of each brand with a boxplot. It looks like this. <br><br> | ||
![Screenshot](sentiment_score_boxplot.png) | ||
<br> | ||
From the boxplot, we can learn that reviewers in general have a positive view on the Japanese whiskies, while they have better impression on Nikka and Hibiki. Interestingly, the median sentiment score on Yamazaki is 0, which means neutral. | ||
<br> | ||
You may find the code <a href="jpwhisky_review_sentiment.py">here</a> | ||
|
||
## TF-IDF | ||
The second task is to build a model that shows the summary by displaying the top 5 key words in the review. To do this, I use TfidfVectorizer from sklearn.feature_extraction.text to build the model. To preprocess the texts, I used the same package to remove English stop words and nltk to stem the words. | ||
<br> | ||
There are 2 files of code for this task: <a href="jpwhisky_review_tfidf.py">jpwhisky_review_tfidf.py</a> is the backend, and <a href="driver.py">driver.py</a> will provoke the implementation and display result. | ||
<br><br> | ||
I don't like the first version of the displaying system because it shows the stemmed word as summary, like this:<br><br> | ||
![Screenshot](display_before.png) | ||
<br><br> | ||
I add a feature that once the model produce the result for a review post, driver.py will grab the original post and pair the word from original post with the stemmed word in a dictionary, then it will display the word from original post, like this:<br><br> | ||
![Screenshot](display_after.png) |
File renamed without changes
File renamed without changes
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
import pandas as pd | ||
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer | ||
|
||
|
||
def get_score(x): | ||
judge = SentimentIntensityAnalyzer() | ||
return judge.polarity_scores(x)['compound'] | ||
|
||
|
||
# Cols: 'Unnamed: 0','Bottle_name','Brand','Title','Review_Content' | ||
jp_whisky = pd.read_csv('japanese_whisky_review.csv') | ||
# Demostrate how to get a score in one row | ||
score = get_score(jp_whisky.iloc[1,3]) | ||
print(score) | ||
|
||
# Calculate score for each row | ||
jp_whisky['score'] = jp_whisky.apply(lambda x: get_score(x[3]), axis=1) | ||
print(jp_whisky) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
import re | ||
import string | ||
import pandas as pd | ||
import numpy as np | ||
from nltk.stem.porter import * | ||
from sklearn.feature_extraction import stop_words | ||
from sklearn.feature_extraction.text import TfidfVectorizer | ||
|
||
|
||
# Preprocess the comment to remove punctuation and split into a list | ||
def clean_words(post): | ||
regex = re.compile('[' + re.escape(string.punctuation) + ' 0-9\\r\\t\\n]') | ||
sentence = regex.sub(' ', post.lower()) | ||
return sentence.split(' ') | ||
|
||
# Define tokenizer for Tfidf model | ||
def tokenizer(post): | ||
temp_list = clean_words(post) | ||
stemmer = PorterStemmer() | ||
temp_list = [stemmer.stem(word) for word in temp_list if | ||
word != '' and word not in stop_words.ENGLISH_STOP_WORDS] | ||
return temp_list | ||
|
||
# Read file and extract the comments and convert to np array | ||
def read_reviews(filename, colname): | ||
jp_whisky = pd.read_csv(filename, encoding = 'ISO-8859-1') | ||
comments_list = jp_whisky[colname].tolist() | ||
return jp_whisky, comments_list | ||
|
||
# Declare model | ||
def tfidf_fit_trans(comments_list): | ||
tfidf = TfidfVectorizer(input='content', | ||
analyzer='word', | ||
tokenizer=tokenizer, | ||
stop_words='english', | ||
decode_error='ignore') | ||
# Fit and transform the model | ||
scorer = tfidf.fit(comments_list) | ||
result = scorer.transform(comments_list) | ||
# Get the top 5 scores from each comment | ||
features = np.array(scorer.get_feature_names()) | ||
scores = result.toarray() | ||
return features, scores | ||
|
||
def get_results(features, scores, article_index, top_n): | ||
index = np.argsort(scores[article_index])[::-1] | ||
return features[index[:top_n]], scores[article_index, index[:top_n]] | ||
|
||
# Design for translating 1 stemword to 1 word, not a dictionary of all words | ||
# This is not working too well | ||
def get_words(comment): | ||
temp_list = clean_words(comment) | ||
stemmer = PorterStemmer() | ||
word_dict = {} | ||
for word in temp_list: | ||
word_stemmed = stemmer.stem(word) | ||
if word != '' and word not in stop_words.ENGLISH_STOP_WORDS \ | ||
and word_stemmed not in word_dict: | ||
word_dict[word_stemmed] = [word] | ||
elif word != '' and word not in stop_words.ENGLISH_STOP_WORDS \ | ||
and word_stemmed in word_dict: | ||
word_dict[word_stemmed] = word_dict[word_stemmed].append(word) | ||
return word_dict | ||
|
File renamed without changes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,32 +1,56 @@ | ||
# Japanese Whisky Reviews | ||
|
||
There is a Japanese Whisky Review data set available in Kaggle, that the data set is originated from Master of Malt. I am interested in doing some NLP works on this data set. <br> | ||
There is a Japanese Whisky Review data set available in Kaggle, that the data set is originated from Master of Malt. I am interested in doing some NLP works on this data set. <br><br> | ||
I will be making some analysis on the sentiment of the reviews and try to summarize the individual review. | ||
<br> | ||
<br><br> | ||
<b>The current version is 2.0.</b> You may find the previous versions in the [Archive](/Archive) folder. | ||
|
||
## Tools | ||
In this project, I will be using packages like SKlearn, vaderSentiment, ntlk for sentiment scores and TF-IDF. <br> | ||
In this project, I will be using packages like SKlearn, vaderSentiment, ntlk for sentiment scores and TF-IDF. Then, we will display the result on a dashboard via Plotly Dash. | ||
|
||
## Data set | ||
The data set could be found in <a href="https://www.kaggle.com/koki25ando/japanese-whisky-review">Kaggle</a>. It consists of 4 columns including, bottle label, brand name, title of the review and the review content. The data set only covers 4 Japanese whisky brands -- Yamazaki, Hibiki, Hakushu, and Nikka. | ||
|
||
## Sentiment Analysis | ||
The first task is to understand the sentiment scores across brands. | ||
<br> | ||
First, I have used vaderSentiment to calculate the sentiment score for each review. Then, I used Plotly to visualize the range of sentiment score of each brand with a boxplot. It looks like this. <br><br> | ||
|
||
## Dashboard | ||
The dashboard consists of two parts: <b>Sentiment Analysis</b> and <b>TF-IDF Analysis</b> (Core meaning of a posted comment). The Sentiment Analysis is plotted with a static box plot of sentiment scores distribution by whiksy brand. The bottom has 4 tabs represent each whisky brand. You may click on one whisky and the dashboard would randomly pick a comment and display the core meaning. | ||
<br><br> | ||
The dashboard looks like this: | ||
|
||
<img src=jp_whisky_dashboard.png> | ||
|
||
### How to Run the Dashboard? | ||
After ensuring installing all the dependencies, you run simply execute: | ||
|
||
``` | ||
python viz.py | ||
``` | ||
|
||
Once the dashboard is ready, you may access it at <b>127.0.0.1:9000</b> | ||
|
||
## Technical Explanation | ||
### Sentiment Analysis | ||
We will use vaderSentiment to calculate the sentiment score for each review. Then, Plotly will visualize the range of sentiment score of each brand with a boxplot and render on the Dashboard. It looks like this. <br><br> | ||
![Screenshot](sentiment_score_boxplot.png) | ||
<br> | ||
From the boxplot, we can learn that reviewers in general have a positive view on the Japanese whiskies, while they have better impression on Nikka and Hibiki. Interestingly, the median sentiment score on Yamazaki is 0, which means neutral. | ||
<br> | ||
You may find the code <a href="jpwhisky_review_sentiment.py">here</a> | ||
|
||
## TF-IDF | ||
The second task is to build a model that shows the summary by displaying the top 5 key words in the review. To do this, I use TfidfVectorizer from sklearn.feature_extraction.text to build the model. To preprocess the texts, I used the same package to remove English stop words and nltk to stem the words. | ||
### TF-IDF Analysis | ||
The second task is to build a model that shows the summary by displaying the top 5 key words in the review. The script uses TfidfVectorizer from sklearn.feature_extraction.text to build the model. To preprocess the texts, I used the same package to remove English stop words and nltk to stem the words. | ||
<br> | ||
There are 2 files of code for this task: <a href="jpwhisky_review_tfidf.py">jpwhisky_review_tfidf.py</a> is the backend, and <a href="driver.py">driver.py</a> will provoke the implementation and display result. | ||
<br><br> | ||
I don't like the first version of the displaying system because it shows the stemmed word as summary, like this:<br><br> | ||
![Screenshot](display_before.png) | ||
<br><br> | ||
I add a feature that once the model produce the result for a review post, driver.py will grab the original post and pair the word from original post with the stemmed word in a dictionary, then it will display the word from original post, like this:<br><br> | ||
![Screenshot](display_after.png) | ||
<a href="jpwhisky_review_tfidf.py">jpwhisky_review_tfidf.py</a> is the backend, and the Dashboard <a href="">viz.py</a> will provoke the implementation and display result. | ||
|
||
## Files | ||
Here are the files to run the dashboards: | ||
|
||
### viz.py | ||
The driver file to construct the dashboard and backend. | ||
|
||
### jpwhisky_review_sentiment.py | ||
The helper script to calculate sentiment scores in the backend. | ||
|
||
### jpwhisky_reivew_tfidf.py | ||
The helper script to calculate TF-IDF scores in the backend. | ||
|
||
### viz_helper Folder | ||
The framework to render a Plotly visualization, the blueprint comes from the <a href="https://github.com/jacquessham/DashExamples/tree/master/PlotlyTemplateFramework">DashExamples Respository</a> with some modification. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.