-
Notifications
You must be signed in to change notification settings - Fork 44
/
3b_sentiment_harry_potter.qmd
320 lines (237 loc) · 11.6 KB
/
3b_sentiment_harry_potter.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
---
title: 'Text analysis workshop: Basic sentiment analysis'
author: "Casey O'Hara"
format:
html:
toc: true
number-sections: true
embed-resources: true
code-fold: true
execute:
warning: false
message: false
---
```{r load packages}
library(tidyverse)
library(tidytext)
library(textdata)
library(pdftools)
library(ggwordcloud)
```
# Overview
Sentiment analysis is a fairly basic way to get a sense of the mood of a piece of text. In an eco-data-science sense, we can use sentiment analysis to understand perceptions of topics in environmental policy.
A good example is "Public Perceptions of Aquaculture: Evaluating Spatiotemporal Patterns of Sentiment around the World" by local celebrities Halley Froehlich, Becca Gentry, and Ben Halpern, in which they examine public perceptions of aquaculture by performing sentiment analyses on newspaper headlines from around the globe and government-solicited public comments on aquaculture policy and development. This paper is included in the 'pdfs' folder on Github, or [available here.](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0169281)
Another popular use of sentiment analysis is to determine the mood of Twitter comments. One excellent example is an examination of Trump tweets, which noted that tweets from an iPhone and an Android phone were markedly different in tone; the thought was that the Android account (with generally far more negative tweets) was run by Trump while the iPhone (with generally more postive tweets) was tweets from a staffer. [See here for details.](http://varianceexplained.org/r/trump-tweets/)
# Prep data: Harry Potter and the Sorceror's Stone
## Read in text from pdf
```{r}
hp_text <- pdf_text(here::here('pdfs', 'harry_potter.pdf'))
```
- Each row is a page of the PDF (i.e., this is a vector of strings, one for each page)
- Only sees text that is "selectable"
Example: Just want to get text from a single page (e.g. Page 34)?
```{r}
hp_p34 <- hp_text[34]
```
`pdf_text()` returns a vector of strings, one for each page of the pdf. So we can mess with it in tidyverse style, let’s turn it into a dataframe, and keep track of the pages. Then we can use `stringr::str_split()` to break the pages up into individual lines. Each line of the pdf is concluded with a backslash-n, so split on this. We will also add a line number in addition to the page number.
Let's first get it into a data frame. Then we'll do some wrangling with the tidyverse, break it up by chapter, and do some analyses.
```{r}
hp_lines <- data.frame(hp_text) %>%
mutate(page = 1:n()) %>%
mutate(text_full = str_split(hp_text, pattern = '\\n')) %>%
unnest(text_full) %>%
mutate(text_full = str_trim(text_full))
# Why '\\n' instead of '\n'? Because some symbols (e.g. \, *) need to be called literally with a starting \ to escape the regular expression. For example, \\a for a string actually contains literally \a. So the string that represents the regular expression '\n' is actually '\\n'.
# More information: https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html
```
## Do some tidying
Now, we'll add a new column that contains the Chapter number (so we can use this as a grouping variable later on).
We will use `str_detect()` to look for any cells in "text_full" column that contains the string "CHAPTER", and if it does, the new column will contain that chapter number (by word). Let's also turn the chapter number from all caps to title case...
```{r}
hp_chapts <- hp_lines %>%
slice(-(1:2)) %>%
mutate(chapter = ifelse(str_detect(text_full, "CHAPTER"), text_full, NA)) %>%
fill(chapter, .direction = 'down') %>%
mutate(chapter = str_remove_all(chapter, 'CHAPTER ')) %>%
mutate(chapter = str_to_lower(chapter),
chapter = fct_inorder(chapter))
```
## Get some word counts by Chapter!
```{r}
hp_words <- hp_chapts %>%
unnest_tokens(word, text_full) %>%
select(-hp_text)
```
```{r}
hp_wordcount <- hp_words %>%
group_by(chapter, word) %>%
summarize(n = n())
```
...OK, but check out which words show up the most. They're probably not words we're super interested in (like "a", "the", "and"). How can we limit those?
## Remove stop words
Those very common (and often uninteresting) words are called "stop words." See `?stop_words` and `View(stop_words)`to look at documentation for stop words lexicons (from the `tidytext` package).
We will *remove* stop words using `tidyr::anti_join()`, which will *omit* any words in `stop_words` from `hp_tokens`.
```{r}
head(stop_words)
hp_words_clean <- hp_words %>%
anti_join(stop_words, by = 'word')
```
Then let's try counting them again:
```{r}
nonstop_counts <- hp_words_clean %>%
group_by(chapter, word) %>%
summarize(n = n()) %>%
ungroup()
```
## Find the top 5 words from each chapter
```{r}
top_5_words <- nonstop_counts %>%
group_by(chapter) %>%
arrange(-n) %>%
slice(1:5) %>%
ungroup()
# Make some graphs:
ggplot(data = top_5_words, aes(x = n, y = word)) +
geom_col(fill = "blue") +
facet_wrap(~chapter, scales = "free")
```
## Let's make a word cloud for Chapter 1
```{r}
ch1_top100 <- nonstop_counts %>%
filter(chapter == 'ONE') %>%
arrange(-n) %>%
slice(1:100)
```
```{r}
ch1_cloud <- ggplot(data = ch1_top100, aes(label = word)) +
geom_text_wordcloud(aes(color = n, size = n), shape = "diamond") +
scale_size_area(max_size = 6) +
scale_color_gradientn(colors = c("darkgreen","blue","purple")) +
theme_minimal()
ch1_cloud
```
# Sentiment analysis: Harry Potter and the Sorceror's Stone
First, check out the ‘sentiments’ lexicons. From Julia Silge and David Robinson (https://www.tidytextmining.com/sentiment.html):
“The three general-purpose lexicons are
- AFINN from Finn Årup Nielsen,
- bing from Bing Liu and collaborators, and
- nrc (National Research Council Canada) from Saif Mohammad and Peter Turney
All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The bing lexicon categorizes words in a binary fashion into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. All of this information is tabulated in the sentiments dataset, and tidytext provides a function `get_sentiments()` to get specific sentiment lexicons without the columns that are not used in that lexicon."
Let's explore the sentiment lexicons. "bing" included, other lexicons ("afinn", "nrc", "loughran") you'll be prompted to to download.
**WARNING:** These collections include the most offensive words you can think of.
"afinn": Words ranked from -5 (very negative) to +5 (very positive)
```{r}
afinn_lex <- get_sentiments(lexicon = "afinn")
### you may be prompted to download an updated lexicon - say yes!
# Let's look at the pretty positive words:
afinn_pos <- get_sentiments("afinn") %>%
filter(value >= 4)
# Check them out:
DT::datatable(afinn_pos)
```
For comparison, check out the bing lexicon:
```{r}
bing_lex <- get_sentiments(lexicon = "bing")
```
And the nrc lexicon:https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
Includes bins for 8 emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust) and positive / negative.
**Citation for NRC lexicon**: Crowdsourcing a Word-Emotion Association Lexicon, Saif Mohammad and Peter Turney, Computational Intelligence, 29 (3), 436-465, 2013.
Now nrc:
```{r}
nrc_lex <- get_sentiments(lexicon = "nrc")
```
## Sentiment analysis with bing:
First, bind words in `hp_nonstop_words` to `bing` lexicon:
```{r}
hp_bing <- hp_words_clean %>%
inner_join(bing_lex, by = 'word')
```
Let's find some counts of positive vs negative:
```{r}
bing_counts <- hp_bing %>%
group_by(chapter, sentiment) %>%
summarize(n = n())
# Plot them:
ggplot(data = bing_counts, aes(x = sentiment, y = n)) +
geom_col() +
facet_wrap(~chapter)
```
Taking the ratio of positive to negative, rather than the total counts per chapter, adjusts for some chapters just being longer or shorter. Highly negative chapters would have a value between 0 and 1, highly positive could go from 1 to infinity, so that's a problem. Plotting as log ratio, i.e., $\ln\left(\frac{positive}{negative}\right)$, balances that so a chapter with 10:1 positive:negative would have the same absolute value as a chapter with 1:10 positive:negative.
We might also need to consider that the overall tone of the author's prose is darker or lighter, so let's find the *overall* log ratio for the entire book, and subtract that out.
```{r}
# find log ratio score overall:
bing_log_ratio_book <- hp_bing %>%
summarize(n_pos = sum(sentiment == 'positive'),
n_neg = sum(sentiment == 'negative'),
log_ratio = log(n_pos / n_neg))
# Find the log ratio score by chapter:
bing_log_ratio_ch <- hp_bing %>%
group_by(chapter) %>%
summarize(n_pos = sum(sentiment == 'positive'),
n_neg = sum(sentiment == 'negative'),
log_ratio = log(n_pos / n_neg)) %>%
mutate(log_ratio_adjust = log_ratio - bing_log_ratio_book$log_ratio) %>%
mutate(pos_neg = ifelse(log_ratio_adjust > 0, 'pos', 'neg'))
ggplot(data = bing_log_ratio_ch,
aes(x = log_ratio_adjust,
y = fct_rev(factor(chapter)),
fill = pos_neg)) +
# y = fct_rev(as.factor(chapter)))) +
geom_col() +
labs(x = 'Adjusted log(positive/negative)',
y = 'Chapter number') +
scale_fill_manual(values = c('pos' = 'slateblue', 'neg' = 'darkred')) +
theme_minimal() +
theme(legend.position = 'none')
```
## Sentiment analysis with afinn (not run in workshop):
First, bind words in `hp_nonstop_words` to `afinn` lexicon:
```{r}
hp_afinn <- hp_words_clean %>%
inner_join(afinn_lex, by = 'word')
```
Let's find some counts (by sentiment ranking):
```{r}
afinn_counts <- hp_afinn %>%
group_by(chapter, value) %>%
summarize(n = n())
# Plot them:
ggplot(data = afinn_counts, aes(x = value, y = n)) +
geom_col() +
facet_wrap(~chapter)
# Find the mean afinn score by chapter:
afinn_means <- hp_afinn %>%
group_by(chapter) %>%
summarize(mean_afinn = mean(value))
ggplot(data = afinn_means,
aes(x = fct_rev(factor(chapter)),
y = mean_afinn)) +
# y = fct_rev(as.factor(chapter)))) +
geom_col() +
coord_flip() +
labs(y = 'Chapter')
```
### Now with NRC lexicon (not run in workshop)
Recall, this assigns words to sentiment bins. Let's bind our hp data to the NRC lexicon:
```{r}
hp_nrc <- hp_words_clean %>%
inner_join(get_sentiments("nrc"))
```
Let's find the count of words by chapter and sentiment bin:
```{r}
hp_nrc_counts <- hp_nrc %>%
group_by(chapter, sentiment) %>%
summarize(n = n()) %>%
ungroup()
ggplot(data = hp_nrc_counts, aes(x = n, y = sentiment)) +
geom_col() +
facet_wrap(~chapter)
### perhaps order or color the sentiments by positive/negative
ggplot(data = hp_nrc_counts, aes(x = n,
y = factor(chapter) %>% fct_rev())) +
geom_col() +
facet_wrap(~sentiment) +
labs(y = 'chapter')
```
### NOTE:
This is a very simple sentiment analysis. The `sentimentr` package (https://cran.r-project.org/web/packages/sentimentr/index.html) seems to be able to parse things at the sentence level, accounting for negations etc. (e.g. "I am not having a good day.")