-
Notifications
You must be signed in to change notification settings - Fork 4
/
index.Rmd
110 lines (92 loc) · 3.31 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
<!-- badges go here -->
[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)
[![Travis build status](https://travis-ci.org/news-r/gensimr.svg?branch=master)](https://travis-ci.org/news-r/gensimr)
[![Say Thanks!](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/JohnCoene)
[![code-size](https://img.shields.io/github/languages/code-size/news-r/gensimr.svg)](https://github.com/news-r/gensimr)
[![activity](https://img.shields.io/github/last-commit/news-r/gensimr.svg)](https://github.com/news-r/gensimr)
[![Coveralls test coverage](https://coveralls.io/repos/github/news-r/gensimr/badge.svg)](https://coveralls.io/r/news-r/gensimr?branch=master)
[![Codecov test coverage](https://codecov.io/gh/news-r/gensimr/branch/master/graph/badge.svg)](https://codecov.io/gh/news-r/gensimr?branch=master)
<!-- badges: end -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
warning = FALSE,
collapse = TRUE,
comment = "#>"
)
par(bg = '#f9f7f1')
library(htmltools)
reticulate::use_virtualenv("./env")
```
```{r, echo=FALSE}
br()
br()
div(
class = "row",
div(
class = "col-md-4",
img(
src = "logo.png",
class = "img-responsive responsive-img"
)
),
div(
class = "col-md-8",
p(
"Topic Modeling for Humans with gensim.",
br(),
"Large scale efficient topic modeling in R and Python."
),
br(),
p(
tags$a(
tags$i(class = "fa fa-level-down-alt"),
class = "btn btn-primary",
href = "articles/installation.html",
style = "margin-bottom: 5px;",
"Installation"
),
tags$a(
tags$i(class = "fa fa-terminal"),
class = "btn btn-default",
href = "reference",
style = "margin-bottom: 5px;",
"Reference"
)
)
)
)
```
## Example
Below we build a very basic Latent Dirichlet Allocation model aiming for 2 latent dimensions using example data.
```{r}
library(gensimr)
# example corpus
data("corpus", package = "gensimr")
# preprocess documents
texts <- prepare_documents(corpus)
dictionary <- corpora_dictionary(texts)
corpus <- doc2bow(dictionary, texts)
# create tf-idf model
tfidf <- model_tfidf(corpus)
tfidf_corpus <- wrap(tfidf, corpus)
# latent similarity index
lda <- model_lda(tfidf_corpus, id2word = dictionary, num_topics = 2L)
topics <- lda$print_topics() # get topics
```
Objects returned by the package are not automatically converted to R data structures, use `reticulate::py_to_r` as shown below to convert them.
```{r}
reticulate::py_to_r(topics) # convert to R format
```
We can then use our model to transform our corpus and then the document topic matrix.
```r
corpus_wrapped <- wrap(lda, corpus)
doc_topics <- get_docs_topics(corpus_wrapped)
plot(doc_topics$dimension_1_y, doc_topics$dimension_2_y)
```
```{r, echo = FALSE}
corpus_wrapped <- wrap(lda, corpus)
doc_topics <- get_docs_topics(corpus_wrapped)
par(bg = '#f4f1e6')
plot(doc_topics$dimension_1_y, doc_topics$dimension_2_y)
```
The plot correctly identifies two topics/clusters. As stated in table 2 from [this paper](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf), the example corpus (`data(corpus)`) essentially has two classes of documents. First five are about human-computer interaction and the other four are about graphs.