-
Notifications
You must be signed in to change notification settings - Fork 13
/
part1-lda.Rmd
149 lines (106 loc) · 4.35 KB
/
part1-lda.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
title: "Topic Model Workshop - Part 1 (LDA)"
author: "Ryan Wesslen"
date: "Nov 27, 2017"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
```
## Load the data
Let's start by loading our packages and loading the dataset.
```{r Load Data}
library(quanteda); library(tidyverse); library(RColorBrewer)
dataset <- read_csv("./articles-sample.csv")
#create corpus
myCorpus <- corpus(dataset$Abstract)
```
## Create the dfm (pre-processing)
First, we need to create a dfm (document-feature matrix) and remove a standard list of English stop words.
We'll also remove sparse terms using the `trim` function.
```{r Pre-Processing}
dfm <- dfm(myCorpus,
remove = c(stopwords("english")),
ngrams=1L,
stem = F,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE)
vdfm <- dfm_trim(dfm, min_count = 10, min_docfreq = 5)
# min_count = remove words used less than x
# min_docfreq = remove words used in less than x docs
```
Let's explore the top 50 words.
```{r Top Features}
topfeatures(vdfm, n = 50)
```
Let's plot two word clouds: one with the raw term frequencies and one with TF-IDF.
```{r Word Clouds, warning=FALSE}
textplot_wordcloud(vdfm, scale=c(3.5, .75), colors=brewer.pal(8, "Dark2"),
random.order = F, rot.per=0.1, max.words=250, main = "Raw Counts")
textplot_wordcloud(tfidf(vdfm), scale=c(3.5, .75), colors=brewer.pal(8, "Dark2"),
random.order = F, rot.per=0.1, max.words=250, main = "TF-IDF")
```
Let's now create a dendogram to get an idea of how the words are clustering.
```{r Clustering}
numWords <- 50
wordDfm <- dfm_sort(dfm_weight(vdfm, "tfidf"))
wordDfm <- t(wordDfm)[1:numWords,] # keep the top numWords words
wordDistMat <- dist(wordDfm)
wordCluster <- hclust(wordDistMat)
plot(wordCluster, xlab="", main="TF-IDF Frequency weighting (First 50 Words)")
```
## Topic Modeling (LDA)
For the first part, we're going to use the `topicmodels` package to run LDA.
We're going to run Gibbs sampling which is a simulation based approach to LDA. There are multiple parameters we need to set.
The most important parameter is the number of topics. Usually, for your first time running topic modeling, there isn't a perfect number to start with. This is ok! Usually starting with 10 (hundred of documents) to 50 (tens of thousands of documents). Let's start with 20.
The second important parameter is the number of iterations. We'll set this as 500.
```{r LDA}
library(topicmodels)
# we now export to a format that we can run the topic model with
dtm <- convert(vdfm, to="topicmodels")
# estimate LDA with K topics
K <- 20
lda <- LDA(dtm, k = K, method = "Gibbs",
control = list(verbose=25L, seed = 123, burnin = 100, iter = 500))
```
## Visualizations Example: LDAVis
To explore our results, we'll use a Shiny-based interactive visualization called LDAvis. This has been prebuilt as a R package (FYI it's also available in Python).
In order to use it, we'll need to convert our model results (in the `lda` object) to a json object that LDAVis requires as its input. That function is pre-loaded in the file `functions.R` within the repository.
```{r LDAVis, results="hide"}
#Create Json for LDAVis
library(LDAvis)
source('./functions.R')
json <- topicmodels_json_ldavis(lda,vdfm,dtm)
new.order <- RJSONIO::fromJSON(json)$topic.order
# change open.browser = TRUE to automatically open result in browser
serVis(json, out.dir = 'unccResearch', open.browser = F)
```
Let's view the topics.
```{r}
term <- terms(lda, 10)
# Topic #'s reordered!!
term <- term[,new.order]
colnames(term) <- paste("Topic",1:K)
term
```
Like topics are probability distribution of words, in LDA documents are probability distributions of topics.
Accordingly, we can rank the documents (papers) by how much they rank for each topic. In other words,
First, let's extract the document-topic probability matrix.
```{r Doc-Topic Matrix}
# to get topic probabilities per document
postlist <- posterior(lda)
probtopics <- data.frame(postlist$topics)
probtopics <- probtopics[,new.order]
colnames(probtopics) <- paste("Topic",1:K)
```
Next, let's find the most representative document for Topic 1.
```{r Representative Docs}
filter.topic <- "Topic 1"
row <- order(-probtopics[,filter.topic])[1]
dataset$Abstract[row]
```
## Session Info
```{r}
sessionInfo()
```