-
Notifications
You must be signed in to change notification settings - Fork 10
/
r-textmining.Rmd
734 lines (530 loc) · 41.6 KB
/
r-textmining.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
---
title: "Tidy Text mining with R"
output:
html_document:
fig_width: 8.5
---
```{r init, include=F}
library(knitr)
opts_chunk$set(message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE, cache=TRUE)
options(digits=3)
options(max.print=200)
.ex <- 1 # Track ex numbers w/ hidden var. Increment each ex: `r .ex``r .ex=.ex+1`
library(ggplot2)
theme_set(theme_minimal(base_size=12) + theme(strip.background = element_blank()))
```
## Class overview
<small>
> **_Attribution_:** This workshop was inspired by and/or modified in part from [_Text Mining with R_](https://www.tidytextmining.com/) by Julia Silge and David Robinson.
</small>
Most of the data we've dealt with so far in this course has been rectangular, in the form of a data frame or tibble, and mostly numeric. But lots of data these days comes in the form of unstructured text. This workshop provides an overview of fundamental principles in text mining, and introduces the **[tidytext](https://cran.r-project.org/web/packages/tidytext/index.html)** package that allows you to apply to text data the same "tidy" methods you're familiar with for wrangling and vizualizing data.
Course pre-requisites:
- [R Basics](r-basics.html)
- [Data frames](r-dataframes.html)
- [Manipulating and analyzing data with **dplyr**](r-dplyr-yeast.html) (and the associated [homework](r-dplyr-homework.html))
- [Tidying data with **tidyr**](r-tidy.html)
- [Data visualization with **ggplot2**](r-viz-gapminder.html) (and the associated [homework](r-viz-homework.html))
This course is _not_ an extensive deep dive into natural language processing (NLP). For that check out the [CRAN task view on NLP](https://cran.r-project.org/web/views/NaturalLanguageProcessing.html) for a long list of packages that will aid you in computational linguistics.
Before we get started, let's load the packages we'll need.
```{r}
library(tidyverse)
library(tidytext)
library(gutenbergr)
library(topicmodels)
```
## The Tidy Text Format
In the previous lessons linked above we discussed the three features of Tidy Data, as outlined in Hadley Wickham's [Tidy Data paper](http://vita.had.co.nz/papers/tidy-data.html):
- Each variable is a column
- Each observation is a row
- Each type of observational unit is a table
Tidy text format can be defined as **a table with one-token-per-row.** A **token** is any meaningful unit of text, such as a word, that we are interested in using for analysis. **Tokenization** is the process of splitting text into tokens. This is in contrast to storing text in strings or in a document-term matrix (discussed later). Here, the token stored in a single row is most often a single word. The **[tidytext](https://cran.r-project.org/web/packages/tidytext/index.html)** package provides functionality to tokenize strings by words (or n-grams, or sentences) and convert to a one-term-per-row format. By keeping text in "tidy" tables, you can use the normal tools you're familiar with, including dplyr, tidyr, ggplot2, etc., for manipulation, analysis, and visualization. The tidytext package also includes functions to convert to and from other data structures for text processing, such as a _corpus_[^corpus] or a _document-term matrix_.[^dtm]
[^corpus]: Corpus objects contain strings annotated with additional metadata.
[^dtm]: This is a (sparse) matrix describing a collection (corpus) of documents with one row for each document and one column for each term. The value in the matrix is typically word count or tf-idf for the document in that row for the term in that column.
![Workflow for text analysis using tidy principles.](img/tidytext-1-overview.png)
### The `unnest_tokens` function
We briefly mentioned before how to create vectors using the `c()` function. Let's create a simple character vector.
```{r}
text <- c("a", "banana", "crouton")
```
Let's extend that to create another character vector, this time with sentences:
```{r, results='hide'}
text <- c("It was the best of times,",
"it was the worse of times,",
"It was the spring of hope, it was the winter of despair.")
text
```
Before we can turn this into a tidy text dataset, we first have to put it in a data frame.
```{r}
text_df <- tibble(line = 1:3, text = text)
text_df
```
This data isn't yet "tidy." We can't do the kinds of operations like filter out particular words or summarize operations, for instance, to count which occur most frequently, since each row is made up of multiple combined words. We need to convert this so that it has **one-token-per-document-per-row**. Here we only have a single document, but later we'll have multiple documents.
We need to (1) break the text into individual tokens (i.e. *tokenization*) and transform it to a tidy data structure. To do this, we use tidytext's `unnest_tokens()` function.
```{r}
text_df %>%
unnest_tokens(output=word, input=text)
```
The `unnest_tokens` function takes a data frame (or tibble), and two additional parameters, the `output` and `input` column names. If you specify them in the correct order, you don't have to specify `output=` or `input=`. You can pipe to `print(n=Inf)` to print them all.
```{r, results='hide'}
text_df %>%
unnest_tokens(word, text) %>%
print(n=Inf)
```
First you give it the output column name that will be created as the text is unnested into it (`word`, in this example). This is a column name that you choose -- you could call it anything, but `word` usually makes sense. Then you give it the input column that the text comes from in the data frame you're passing to it (`text`, in this case). Our `text_df` dataset has a column called `text` that contains the data of interest.
The `unnest_tokens` function splits each row so that there is one word per row of the new data frame; the default tokenization in `unnest_tokens()` is for single words, as shown here. Also notice:
- Other columns, such as the line number each word came from, are retained.
- Punctuation has been stripped.
- By default, `unnest_tokens()` converts the tokens to lowercase, which makes them easier to compare or combine with other datasets. (Use the `to_lower = FALSE` argument to turn off this behavior).
Now our data is in a tidy format, and we can easily use all the normal dplyr, tidyr, and ggplot2 tools.
### Example: Jane Austen Novels
```{r, echo=FALSE, include=FALSE}
janeaustenr::austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
filter(text!="") %>%
write_csv("data/austen.csv")
```
Let's load the **[austen.csv](data/austen.csv)** data.
```{r, results='hide'}
jaorig <- read_csv("data/austen.csv")
jaorig
```
Click the `jaorig` dataset in the environment pane or use `View(jaorig)` to see what's being read in here. Before we can do anything else we'll need to tidy this up by unnesting the `text` column into words.
```{r, results='hide'}
jatidy <- jaorig %>%
unnest_tokens(word, text)
jatidy
```
Let's use the dplyr `count` function to count how many occurances we have for each word in the entire corpus. The `sort=TRUE` option puts the most common results on top.
```{r, results='hide'}
jatidy %>%
count(word, sort = TRUE)
```
Not surprisingly the most common words are some of the most commonly used words in the English language. These are known as [stop words](https://en.wikipedia.org/wiki/Stop_words). They're words you'll want to filter out before doing any text mining. There are lists of stop words online, but the tidytext package comes with a `stop_words` built-in dataset with some of the most common stop words across three different lexicons. See `?stop_words` for more information.
```{r, results='hide'}
data(stop_words)
stop_words
```
As in a previous lesson where we did an `inner_join` to link information across two different tables by a common key, there's also an [`anti_join()`](https://dplyr.tidyverse.org/reference/join.html#join-types) which takes two tibbles, _x_ and _y_, and returns all rows from _x_ where there are not matching values in _y_, keeping just columns from _x_. Let's `anti_join` the data to the stop words. Because we chose "word" as the output variable to `unnest_tokens()`, and "word" is the variable in the `stop_words` dataset, we don't have to be specific about which columns we're joining.
```{r, results='hide'}
jatidy %>%
anti_join(stop_words)
```
Now there are _far_ fewer rows than initially present. Let's run that count again, now with the stop words removed.
```{r}
jatidy %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
```
That's _much_ more in line with what we want. We have this data in a tibble. Let's keep piping to other operations!
```{r}
jatidy %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
head(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
coord_flip()
```
## Sentiment Analysis
Let's start to do some high-level analysis of the text we have. Sentiment analysis^[<https://en.wikipedia.org/wiki/Sentiment_analysis>], also called opinion mining, is the use of text mining to "systematically identify, extract, quantify, and study affective states and subjective information." It's a way to try to understand the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust.
If you make a simplifying assumption regarding the text you have as a combination of its individual words, you can treat the sentiment content of the whole text as the sum of the sentiment content of the individual words. It's a simplification, and it isn't the only way to approach sentiment analysis, but it's simple and easy to do with tidy principles.
To get started you'll need a _sentiment lexicon_ that attempt to evaluate the opinion or emotion in text. The tidytext package contains several sentiment lexicons in the `sentiments` dataset. All three of these lexicons are based on single words in the English language, assigning scores for positive/negative sentiment, or assigning emotions like joy, anger, sadness, etc.
- **`nrc`** from Saif Mohammad and Peter Turney^[<http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm>] categorizes words in a binary fashion ("yes"/"no") into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
- **`bing`** from Bing Liu and collaborators^[<https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html>] categorizes words in a binary fashion into positive and negative categories.
- **`AFINN`** from Finn Arup Nielsen^[<http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010>] assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
The built-in `sentiments` dataset available when you load the tidytext package contains all of this information. You could filter it to a single lexicon with the dplyr `filter()` function, or use tidytext's `get_sentiments()` to get specific sentiment lexicons containing only the data used for that lexicon.
```{r fixme1, results="hide", eval=FALSE}
# Look at the sentiments data
data(sentiments)
sentiments
sentiments %>% filter(lexicon=="nrc")
sentiments %>% filter(lexicon=="bing")
sentiments %>% filter(lexicon=="AFINN")
# Use the built-in get_sentiments() function
get_sentiments("nrc")
get_sentiments("bing")
get_sentiments("afinn")
```
There are a few major caveats to be aware of.
1. The sentiment lexicons we're using here were constructed either via crowdsourcing or by the work of the authors, and validated using crowdsourcing, movie/restaurant reviews, or Twitter data. It's unknown how useful it is to apply these lexicons to text from a completely different time and place (e.g., 200-year old fiction novels). Further, there are other domain-specific lexicons available, e.g., for finance data, that are better used in that context.
1. May words in the English language are fairly neutral, and aren't included in any sentiment lexicon.
1. Methods based on unigrams (single words) do not take into account qualifiers before a word, such as in "no good" or "not true". If you have sustained sections of sarcasm or negated text, this could be problematic.
1. The size of the chunk of text that we use to add up single-word sentiment scores matters. Sentiment across many paragraphs often has positive and negative sentiment averaging out to about zero, but sentence-sized or paragraph-sized text might be better.
![Workflow for sentiment analysis using tidy principles.](img/tidytext-2-sentiment.png)
### Sentiment analysis with tidy tools
Let's look at the most common joy words in _Emma_. To do this we will:
1. Start with the unnested Jane Austen text data.
1. Join it to the NRC sentiment lexicon.
1. Filter it to only include "joy" words.
1. Filter for only words in _Emma_.
1. Count the number of occurences of each word, sorting the output with the highest on top.
```{r}
jatidy %>%
inner_join(get_sentiments("nrc")) %>%
filter(sentiment=="joy") %>%
filter(book=="Emma") %>%
count(word, sort=TRUE)
```
Try running the same code but replacing "joy" with "anger" or "trust."
```{r, results='hide'}
jatidy %>%
inner_join(get_sentiments("nrc")) %>%
filter(sentiment=="anger") %>%
filter(book=="Emma") %>%
count(word, sort=TRUE)
```
Let's look at how sentiment changes over time throughout each novel.
1. Start with the unnested Jane Austen text data.
1. Join it to the 'bing' sentiment lexicon (positive vs negative).
1. Create a new variable that counts up each 80-line section. First note that the `%/%` operator does integer division. It tells you the integer quotient without the remainder. This is a way for us to keep track of which 80-line section of text we are counting up negative and positive sentiment in.
1. Count the number of occurances of each sentiment (positive vs negative) in each section, for each book.
1. Spread the sentiment column into new columns, and fill in missing values with zeros.
1. Create your own summary sentiment score that's the total number of positive words minus the total number of negative words.
```{r}
jatidy %>%
inner_join(get_sentiments("bing")) %>%
mutate(section=linenumber %/% 80) %>%
count(book, section, sentiment) %>%
spread(sentiment, n, fill=0) %>%
mutate(sentiment=positive-negative)
```
Now let's pipe that whole thing to ggplot2 to see how the sentiment changes over the course of each novel. Facet by book, and pass `scales="free_x"` so the x-axis is filled for each panel.
```{r, fig.height=8}
jatidy %>%
inner_join(get_sentiments("bing")) %>%
mutate(section=linenumber %/% 80) %>%
count(book, section, sentiment) %>%
spread(sentiment, n, fill=0) %>%
mutate(sentiment=positive-negative) %>%
ggplot(aes(section, sentiment)) +
geom_col() +
facet_wrap(~book, ncol = 2, scales = "free_x")
```
Try comparing different sentiment lexicons. You might see different results! Different lexicons contain different ratios of positive to negative sentiment words, and thus will give you different results. You would probably want to try a few different lexicons using a known dataset to see what lexicon is most appropriate for your purpose. For more information on this topic, see [section 2.3 of the Tidy Text Mining book](https://www.tidytextmining.com/sentiment.html#comparing-the-three-sentiment-dictionaries).
### Measuring contribution to sentiment
We could also analyze word counts that contribute to each sentiment. This first joins Jane Austen's tidy text data to the bing lexicon and counts how many times each word-sentiment linkage exists.
```{r}
jatidy %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort=TRUE)
```
Look at the help for `?top_n`. It's similar to arranging a dataset then using `head` to get the first few rows. But if we want the top _n_ from each group, we need the `top_n` function. Let's continue the pipeline above.
1. First group by sentiment.
1. Next get the top 10 observations in each group. By default, it uses the last column here as a ranking metric.
1. The `top_n` function leaves the dataset grouped. In this case we want to ungroup the data.
1. Let's plot a bar plot showing the n for each word separately for positive and negative words.
1. We could mutate word to reorder it as a factor by n.
```{r}
jatidy %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort=TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word=reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col(aes(fill=sentiment)) +
facet_wrap(~sentiment, scale="free_y") +
coord_flip()
```
Notice that "miss" is probably erroneous here. It's used as a title for unmarried women in Jane Austen's works, and should probably be excluded from analysis. You could filter this, or you could create a custom stop words lexicon and add this to it. You could also unnest the corpus using bigrams instead of single words, then filter to look for bigrams that start with "miss," counting to show the most common ones.
```{r}
jaorig %>%
unnest_tokens(bigram, text, token="ngrams", n=2) %>%
filter(str_detect(bigram, "^miss")) %>%
count(bigram, sort=TRUE)
```
## Word and Document Frequencies
### TF, IDF, and TF-IDF
In text mining we're trying to get at "what is this text about?" We can start to get a sense of this by looking at the words that make up the text, and we can start to measure measure how important a word is by its **term frequency** (tf), how frequently a word occurs in a document. When we did this we saw some common words in the English language, so we took an approach to filter out our data first by a list of common stop words.
```{r, results="hide"}
jatidy %>%
anti_join(stop_words) %>%
count(word, sort=TRUE)
```
Another way is to look at a term's **inverse document frequency** (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. It's defined as:
$$idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$
If you multiply the two values together, you get the **tf-idf**^[<https://en.wikipedia.org/wiki/Tf%E2%80%93idf>], which is the frequency of a term adjusted for how rarely it is used. The tf-idf measures how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.
We want to use tf-idf to find the important words for the content of each document by decreasing the weight for common words and increasing the weight for words that are not used very much in a corpus of documents (in this case, the group of Jane Austen's novels). Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not *too* common.
You could do this all manually, but there's a nice function in the tidytext package called `bind_tf_idf` that does this for you. It takes a tidy text dataset as input with one row per word, per document. One column (`word` here) contains the terms/tokens, one column contains the documents (`book` in this case), and the last necessary column contains the counts, how many times each document contains each term (`n` in this example).
Let's start by counting the number of occurances of each word in each book:
```{r, results='hide'}
jatidy %>%
count(book, word, sort=TRUE)
```
Then we simply pipe that to the `bind_tf_idf` function, giving it the column names for the word, document, and count column (`word`, `book`, and `n` here):
```{r, results="hide"}
jatidy %>%
count(word, book, sort=TRUE) %>%
bind_tf_idf(word, book, n)
```
You'll see that the idf (and the tf-idf) are zero for really common words. These are all words that appear in all six of Jane Austen's novels, so the idf is zero. This is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection. Let's arrange descending by tf-idf (`tf_idf` with an underscore).
```{r}
jatidy %>%
count(word, book, sort=TRUE) %>%
bind_tf_idf(word, book, n) %>%
arrange(desc(tf_idf))
```
No surprise - we see all proper nouns, names that are important for each novel. None of them occur in all of novels, and they are important, characteristic words for each text within the entire corpus of Jane Austen's novels. Let's visualize this data!
```{r fig.height=8}
jatidy %>%
count(word, book, sort=TRUE) %>%
bind_tf_idf(word, book, n) %>%
arrange(desc(tf_idf)) %>%
group_by(book) %>%
top_n(15) %>%
ungroup %>%
mutate(word=reorder(word, tf_idf)) %>%
ggplot(aes(word, tf_idf)) +
geom_col() +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~book, ncol = 2, scales = "free") +
coord_flip()
```
### Project Gutenberg
Project Gutenberg (<https://www.gutenberg.org/>) is a collection of freely available books that are in the public domain. You can get most books in all kinds of different formats (plain text, HTML, epub/kindle, etc). The **[gutenbergr](https://cran.r-project.org/web/packages/gutenbergr/vignettes/intro.html)** package includes tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest. Includes:
* A function `gutenberg_download()` that downloads one or more works from Project Gutenberg by ID: e.g., `gutenberg_download(84)` downloads the text of Frankenstein.
* Metadata for all Project Gutenberg works as R datasets, so that they can be searched and filtered:
* `gutenberg_metadata` contains information about each work, pairing Gutenberg ID with title, author, language, etc
* `gutenberg_authors` contains information about each author, such as aliases and birth/death year
* `gutenberg_subjects` contains pairings of works with Library of Congress subjects and topics
Let's use a different corpus of documents, to see what terms are important in a different set of works. Let's download some classic science texts from Project Gutenberg and see what terms are important in these works, as measured by tf-idf. We'll use three classic physics texts, and a classic Darwin text. Let's use:
- _Discourse on Floating Bodies_ by Galileo Galilei: <http://www.gutenberg.org/ebooks/37729>
- _Treatise on Light_ by Christiaan Huygens: <http://www.gutenberg.org/ebooks/14725>
- _Experiments with Alternate Currents of High Potential and High Frequency_ by Nikola Tesla: <http://www.gutenberg.org/ebooks/13476>
- _On the Origin of Species By Means of Natural Selection_ by Charles Darwin: <http://www.gutenberg.org/ebooks/5001>
These might all be physics classics, but they were written across a 300-year timespan, and some of them were first written in other languages and then translated to English.
```{r, eval = FALSE}
library(gutenbergr)
sci <- gutenberg_download(c(37729, 14725, 13476, 1228), meta_fields = "author")
```
```{r, echo = FALSE}
# saveRDS(sci, "data/_sci.rds")
sci <- readRDS("data/_sci.rds")
```
Now that we have the texts, let's use `unnest_tokens()` and `count()` to find out how many times each word was used in each text. Let's assign this to an object called `sciwords`. Let's go ahead and add the tf-idf also.
```{r}
scitidy <- sci %>%
unnest_tokens(word, text)
sciwords <- scitidy %>%
count(word, author, sort = TRUE) %>%
bind_tf_idf(word, author, n)
sciwords
```
Now let's do the same thing we did before with Jane Austen's novels:
```{r, fig.height=8}
sciwords %>%
group_by(author) %>%
top_n(15) %>%
ungroup() %>%
mutate(word=reorder(word, tf_idf)) %>%
ggplot(aes(word, tf_idf)) +
geom_col() +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~author, ncol = 2, scales = "free") +
coord_flip()
```
We see some weird things here. We see "fig" for Tesla, but I doubt he was writing about a fruit tree. We see things like ab, ac, rc, etc for Huygens -- these are names of rays and angles, etc. We could create a custom stop words dictionary to remove these. Let's create a stop words data frame, then anti join that before plotting.
```{r, eval=FALSE}
mystopwords <- tibble(word=c("ab", "ac", "rc", "cm", "cg", "cb", "ak", "bn", "fig"))
sciwords %>%
anti_join(mystopwords) %>%
group_by(author) %>%
top_n(15) %>%
ungroup() %>%
mutate(word=reorder(word, tf_idf)) %>%
ggplot(aes(word, tf_idf)) +
geom_col() +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~author, ncol = 2, scales = "free") +
coord_flip()
```
## Topic Modeling
**Topic modeling**^[<https://en.wikipedia.org/wiki/Topic_model>] is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items even when we're not sure what we're looking for. It's a way to find abstract "topics" that occur in a collection of documents, and it's frequently used to find hidden semantic structures in a text body. Topic models can help us understand large collections of unstructured text bodies. In addition to text mining tasks like what we'll do here, topic models have been used to detect useful structures in data such as genetic information, images, and networks, and have also been used in bioinformatics.^[Blei, David (April 2012). "Probabilistic Topic Models". _Communications of the ACM_. 55 (4): 77-84. doi:10.1145/2133806.2133826]
**Latent Dirichlet Allocation**^[<https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation>] is one of the most common algorithms used in topic modeling. LDA treats each document as a mixture of topics, and each topic as a mixture of words:
1. _Each document is a mixture of topics._ Each document contains words from several topics in particular proportions. For example, in a two-topic model we could say "Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B."
2. _Every topic is a mixture of words._ Imagine a two-topic model of American news, with one topic for "politics" and one for "entertainment." Common words in the politics topic might be "President", "Congress", and "government", while the entertainment topic may be made up of words such as "movies", "television", and "actor". Words can be shared between topics; a word like "budget" might appear in both equally.
LDA attempts to estimate both of these at the same time: finding words associated with each topic, while simutaneously determining the mixture of topics that describes each document.
### Document-term matrix
Before we can get started in topic modeling we need to take a look at another common format for storing text data that's not the tidy one-token-per-document-per-row format we've used so far (what we get from `unnest_tokens`). Another very common structure that's used by other text mining packages (such as **tm** or **quanteda**) is the **[document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix)**^[<https://en.wikipedia.org/wiki/Document-term_matrix>] (DTM). This is a matrix where:
- Each row represents one document (such as a book or article).
- Each column represents one term.
- Each value contains the number of appearances of that term in that document.
Since most pairings of document and term do not occur (they have the value zero), DTMs are usually implemented as sparse matrices. These objects can be treated as though they were matrices (for example, accessing particular rows and columns), but are stored in a more efficient format. DTM objects can't be used directly with tidy tools, just as tidy data frames can't be used as input for other text mining packages. The tidytext package provides two verbs that convert between the two formats.
- `tidy()` turns a DTM into a tidy data frame. This verb comes from the broom package.
- `cast()` turns a tidy one-term-per-row data frame into a matrix. tidytext provides three variations of this verb, each converting to a different type of matrix:
- `cast_sparse()`: converts to a sparse matrix from the Matrix package.
- `cast_dtm()`: converts to a `DocumentTermMatrix` object from tm.
- `cast_dfm()`: converts to a `dfm` object from quanteda.
First let's load the `AssociatedPress` data from the topicmodels package. Take a look. We can see that the `AssociatedPress` data is a DTM with 2,246 documents (AP articles from 1988) and 10,473 terms. The 99% sparsity indicates that the matrix is almost complete made of zeros, i.e., almost all the document-term pairs are zero -- most terms are not used in most documents. If we want to use the typical tidy tools we've used above, we'll use the `tidy()` function to melt this matrix into a tidy one-token-per-document-per-row format. Notice that this only returns the 302,031 non-zero entries.
```{r, results="hide"}
library(topicmodels)
data(AssociatedPress)
AssociatedPress
tidy(AssociatedPress)
```
First let's use the `LDA()` function from the topicmodels package, setting `k = 2`, to create a two-topic LDA model. This function returns an object containing the full details of the model fit, such as how words are associated with topics and how topics are associated with documents. Fitting the model is easy. For the rest of this section we'll be exploring and interpreting the model.
```{r ap_lda, cache=TRUE, results="hide"}
# set a seed so that the output of the model is predictable
ap_lda <- LDA(AssociatedPress, k = 2, control=list(seed=1234))
ap_lda
```
### Word-topic probabilities
Displaying the model itself, `ap_lda` isn't that interesting. The tidytext package provides a `tidy` method for extracting the per-topic-per-word probabilities, called $\beta$ ("beta"), from the model.
```{r}
ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics
```
This returns a one-topic-per-term-per-row format. For each combination, the model computes the probability of that term being generated from that topic. For example, the term "aaron" has a $`r ap_topics$beta[1]`$ probability of being generated from topic 1, but a $`r ap_topics$beta[2]`$ probability of being generated from topic 2.
We could use dplyr's `top_n()` to find the 10 terms that are most common within each topic. Because this returns a tidy data frame, we could easily continue piping to ggplot2.
```{r, results="hide"}
# What are the top words for each topic?
ap_topics %>%
group_by(topic) %>%
top_n(10) %>%
ungroup() %>%
arrange(topic, desc(beta))
```
```{r}
# Continue piping to ggplot2
ap_topics %>%
group_by(topic) %>%
top_n(10) %>%
ungroup() %>%
arrange(topic, desc(beta)) %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta)) +
geom_col() +
facet_wrap(~topic, scales = "free") +
coord_flip()
```
This visualization lets us understand the two topics that were extracted from the articles. Common words in topic 1 include "percent", "million", "billion", and "company". Perhaps topic 1 represents business or financial news. Common in topic 2 include "president", "government", and "soviet", suggeting that this topic represents political news. Note that some words, such as "new" and "people", are common within both topics. This is an advantage (as opposed to "hard clustering" methods): topics used in natural language could have some overlap in terms of words.
Let's look at the terms that had the *greatest difference* in $\beta$ between topic 1 and topic 2. This can be estimated based on the log ratio of the two: $\log_2(\frac{\beta_2}{\beta_1})$ (a log ratio is useful because it makes the difference symmetrical: $\beta_2$ being twice as large leads to a log ratio of 1, while $\beta_1$ being twice as large results in -1). To constrain it to a set of especially relevant words, we can filter for relatively common words, such as those that have a $\beta$ greater than 1/1000 in at least one topic.
First let's turn `1` and `2` into `topic1` and `topic2` so that after the `spread` we'll easily be able to work with those columns.
```{r beta_spread, results='hide'}
ap_topics %>%
mutate(topic = paste0("topic", topic)) %>%
spread(topic, beta) %>%
filter(topic1 > .001 | topic2 > .001) %>%
mutate(log_ratio = log2(topic2 / topic1))
```
We could continue piping to ggplot2. First, let's create a new variable we'll group on, which is the direction of imbalance. We'll also create a variable showing the absolute value of the log ratio, which is a directionless value indicating the magnitude of the effect. This lets us select the top 10 terms most associated with either topic1 or topic2.
```{r, fig.height=8}
ap_topics %>%
mutate(topic = paste0("topic", topic)) %>%
spread(topic, beta) %>%
filter(topic1 > .001 | topic2 > .001) %>%
mutate(log_ratio = log2(topic2 / topic1)) %>%
mutate(direction = (log_ratio>0)) %>%
mutate(absratio=abs(log_ratio)) %>%
group_by(direction) %>%
top_n(10) %>%
ungroup() %>%
mutate(term = reorder(term, log_ratio)) %>%
ggplot(aes(term, log_ratio)) +
geom_col() +
labs(y = "Log2 ratio of beta in topic 2 / topic 1") +
coord_flip()
```
We can see that the words more common in topic 2 include political parties such as "democratic" and "republican", as well as politician's names such as "dukakis" and "gorbachev". Topic 1 was more characterized by currencies like "yen" and "dollar", as well as financial terms such as "index", "prices" and "rates". This helps confirm that the two topics the algorithm identified were political and financial news.
### Document-topic probabilities
Above we estimated the per-topic-per-word probabilities, $\beta$ ("beta"). LDA also models each document as a mixture of topics. Let's look at the per-document-per-topic probabilities, $\gamma$ ("gamma"), with the `matrix = "gamma"` argument to `tidy()`.
```{r}
ap_documents <- tidy(ap_lda, matrix = "gamma")
ap_documents
```
These values represent the estimated proportion of words from that document that are generated from that topic. For example, the the model estimates only about `r scales::percent(ap_documents$gamma[1])` of the words in document 1 were generated from topic 1.
Most of these documents were drawn from a mix of the two topics, but document 6 was drawn almost entirely from topic 2, having a $\gamma$ from topic 1 close to zero. To check this answer, we could `tidy()` the document-term matrix.
```{r ap_document_6}
tidy(AssociatedPress) %>%
filter(document == 6) %>%
arrange(desc(count))
```
Based on the most common words, this looks like an article about the relationship between the American government and Panamanian dictator [Manuel Noriega](https://en.wikipedia.org/wiki/Manuel_Noriega), which means the algorithm was right to place it in topic 2 (as political/national news).
## Case Studies & Examples
### The Great Library Heist
>From [section 6.2 of Tidy Text Mining](https://www.tidytextmining.com/topicmodeling.html#library-heist).
When examining a statistical method, it can be useful to try it on a very simple case where you know the "right answer". For example, we could collect a set of documents that definitely relate to four separate topics, then perform topic modeling to see whether the algorithm can correctly distinguish the four groups. This lets us double-check that the method is useful, and gain a sense of how and when it can go wrong. We'll try this with some data from classic literature.
Suppose a vandal has broken into your study and torn apart four of your books:
- *Great Expectations* by Charles Dickens
- *The War of the Worlds* by H.G. Wells
- *Twenty Thousand Leagues Under the Sea* by Jules Verne
- *Pride and Prejudice* by Jane Austen
This vandal has torn the books into individual chapters, and left them in one large pile. How can we restore these disorganized chapters to their original books? This is a challenging problem since the individual chapters are **unlabeled**: we don't know what words might distinguish them into groups. We'll thus use topic modeling to discover how chapters cluster into distinct topics, each of them (presumably) representing one of the books.
We'll retrieve the text of these four books using the gutenbergr package:
```{r titles}
library(gutenbergr)
titles <- c("Twenty Thousand Leagues under the Sea",
"The War of the Worlds",
"Pride and Prejudice",
"Great Expectations")
books <- gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = "title")
```
You'll want to start by dividing the books these into chapters, use tidytext's `unnest_tokens()` to separate them into words, then remove `stop_words`. You'll be treating every chapter as a separate "document", each with a name like `Great Expectations_1` or `Pride and Prejudice_11`. You'll cast this into a DTM then run LDA. You'll look at the word-topic probabilities to try to get a sense of which topic represents which book, and you'll use document-topic probabilities to assign chapters to their books. See [section 6.2 of Tidy Text Mining](https://www.tidytextmining.com/topicmodeling.html#library-heist) for code and a walk-through.
### Happy Galentine's Day!
Source: **<https://suzan.rbind.io/2018/02/happy-galentines-day/>**
This analysis does a tidy text mining analysis of several scripts from _Parks and Recreation_. In addition to the kinds of analyses we've performed here, it also illustrates some additional functionality for extracting text from PDF documents (the scripts were only available as PDFs).
![](img/tm-cs1-prec-1.png)
![](img/tm-cs1-prec-2.png)
### Who wrote the anti-Trump New York Times op-ed?
Source: **<http://varianceexplained.org/r/op-ed-text-analysis/>**
In September 2018 the New York Times published an anonymous op-ed, ["I Am Part of the Resistance Inside the Trump Administration"](https://www.nytimes.com/2018/09/05/opinion/trump-white-house-anonymous-resistance.html), written by a "senior official in the Trump administration". Lots of data scientists tried to use text-mining techniques to determine who wrote this op-ed. This analysis compares the text of the op-ed to the set of documents representing "senior officials." In addition to what we've covered here, this also covers scraping text from Twitter accounts, and methods for comparing TF-IDF vectors using cosine similarity, which was touched on in [section 4.2 of Tidy Text Mining](https://www.tidytextmining.com/ngrams.html#counting-and-correlating-pairs-of-words-with-the-widyr-package).
![](img/tm-cs2-trumpoped.png)
### Seinfeld dialogues
Source: **<https://pradeepadhokshaja.wordpress.com/2018/08/06/looking-at-seinfeld-dialogues-using-tidytext/>**
Data: **<https://www.kaggle.com/thec03u5/seinfeld-chronicles>**
This analysis uses the tidytext package to analyze the full text of the entire _Seinfeld_ series that ran 1989-1998.
![Sentiment across time for Seinfeld main characters.](img/tm-cs3-seinfeld.gif)
### Sentiment analysis in Shakespeare tragedies
Source: **<https://peerchristensen.netlify.com/post/fair-is-foul-and-foul-is-fair-a-tidytext-entiment-analysis-of-shakespeare-s-tragedies/>**
This analysis illustrates a tidytext approach to examine the use of sentiment words in the tragedies written by William Shakespeare.
![Plays ranked by ratio of negative sentiment words](img/tm-cs4-shake-1.png)
![Sentiment over time for Romeo & Juliet](img/tm-cs4-shake-2.png)
### Authorship of the Federalist Papers
Source: **<https://kanishka.xyz/2018/my-first-few-open-source-contributions-authorship-attribution-of-the-federalist-papers/>**
The Federalist Papers were written as essays between 1787-1788 by Alexander Hamilton, John Jay and James Madison to promote the ratification of the constitution. They were all authored under the pseudonym 'Publius', which was a tribute to the founder of the Roman Republic, but were then confirmed to be written by the three authors where Hamilton wrote 51 essays, Jay wrote 5, Madison wrote 14, and Hamilton and Madison co-authored 3. The authorship of the remaining 12 has been in dispute. This post uses tidy text mining and some additional functionality to try to determine who authored the 12 in dispute.
![](img/tm-cs5-fed.png)
<!--
Search https://www.gutenberg.org/ for "H G Wells" and click on one of the results. You'll see the book ID in the URL. One thing we might do in text mining is to compare word frequencies across different texts. We loaded Jane Austen's novels from a file. Let's use gutenbergr to get two more sets of texts. First, let's look at some science fiction and fantasy novels by H.G. Wells, who lived in the late 19th and early 20th centuries. Let's get [*The Time Machine*](https://www.gutenberg.org/ebooks/35), [*The War of the Worlds*](https://www.gutenberg.org/ebooks/36), [*The Invisible Man*](https://www.gutenberg.org/ebooks/5230), and [*The Island of Doctor Moreau*](https://www.gutenberg.org/ebooks/159). We can access these works using `gutenberg_download()` and the Project Gutenberg ID numbers for each novel.
```{r, include=FALSE}
# saveRDS(hgwells, "data/_hgwells.rds")
hgwells <- readRDS("data/_hgwells.rds")
# saveRDS(bronte, "data/_bronte.rds")
bronte <- readRDS("data/_bronte.rds")
```
```{r, eval=FALSE}
library(gutenbergr)
hgwells <- gutenberg_download(c(35, 36, 5230, 159), meta_fields="author")
hgwells
```
Let's get some well-known works of the Bronte sisters, whose lives overlapped with Jane Austen's somewhat but who wrote in a rather different style. Let's get [*Jane Eyre*](https://www.gutenberg.org/ebooks/1260), [*Wuthering Heights*](https://www.gutenberg.org/ebooks/768), [*The Tenant of Wildfell Hall*](https://www.gutenberg.org/ebooks/969), [*Villette*](https://www.gutenberg.org/ebooks/9182), and [*Agnes Grey*](https://www.gutenberg.org/ebooks/767).
```{r, eval=FALSE}
bronte <- gutenberg_download(c(1260, 768, 969, 9182, 767), meta_fields="author")
bronte
```
Now let's tidy each dataset:
```{r, results="hide"}
hgtidy <- hgwells %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
hgtidy
brtidy <- bronte %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
brtidy
```
Now let's look at the top words in each:
```{r}
hgtidy %>%
count(word, sort=TRUE)
brtidy %>%
count(word, sort=TRUE)
```
-->
---