-
Notifications
You must be signed in to change notification settings - Fork 1
/
StarWarTextAnalysis.Rmd
464 lines (277 loc) · 11.7 KB
/
StarWarTextAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
---
title: "StarWarsDataAnalysis"
author: "Sridhar Varanasi"
date: "May 11, 2018"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE)
```
```{r load libraries}
library(ggplot2)
library(dplyr)
library(tidytext)
library(tidyr)
library(tm)
library(wordcloud)
library(wordcloud2)
library(reshape2)
```
In this data analysis project I will be analysing the dialogues of few characters from star wars episode 4,5 and 6. Each of the dataset is in form of text file and has 2 columns thr character name and the dialogue spoke by that character. To start things off I will first load the data from the text files into 3 different datasets.
```{r load data}
ep4<-read.table('SW_EpisodeIV.txt')
ep5<-read.table('SW_EpisodeV.txt')
ep6<-read.table('SW_EpisodeVI.txt')
```
I will first analyze episode 4 data.
To do this analysis I will do the below steps to create wordcloud of the words in the episode 4
- Create a corpus of words from the column in the dataframe on which we want to create wordcloud
- Clean the corpus by applying the below functions from base R and tm package.
- removePunctuation() This function is used to remove punctuations.
- stripWhitespace(). Remove excess whitespace
- tolower(). Make all characters lowercase
- removeWords(). Remove some common English stop words ("I", "she'll", "the", etc.)
- removeNumbers(). Remove numbers
- Once the corpus has been cleaned conver this to a document term matrix which will be used for analysis
- Remove sparse terms from the document term matrix to make the DTM feasible for analysis
- Convert the above Document term matrix to a Matrix in R
- Conver the MAtrix from the above step to a Data frame with columns word and word frequency
- Create a word cloud on the dataframe createdd above
So to do all these steps I will create 2 functions
**cleancorpus** which will have all the functions used to clean the corpus
**createterms** which will create a function which takes the column on which we intend to create wordcloud and gives output a dataframe which contains word and word frequency.
And finally we can use the output of the **createterms** function to create wordcloud
So to begin with, I will be creating **cleancorpus** function
```{r corpus clean function}
#creating a vector of words which are stopwords
stop_words<-c(stopwords("english"),c("thats","weve","hes","theres","ive","will","can","cant","dont","youve","youre","youll","theyre","whats","didnt","us"))
#creating clean corpus function
cleancorpus<- function(corpus){
#removes punctuation
corpus <- tm_map(corpus,removePunctuation)
#removes whitespaces
corpus <- tm_map(corpus,stripWhitespace)
#converts all strings to lowercase
corpus <- tm_map(corpus,content_transformer(tolower))
#removes stopwords
corpus <- tm_map(corpus,removeWords,stop_words)
#removes numbers
corpus <- tm_map(corpus,removeNumbers)
#returns clean corpus
return(corpus)
}
```
Creating a createterms functions
```{r createterms}
createterms <- function(text){
#creating createterms function , it takes text column as input and gives a dataframe with term and term frequency as outut
#creating a corpus from the text column
df_corp <- VCorpus(VectorSource(text))
#Cleaning the corpus using the cleancorpus function
df_corp_clean <- cleancorpus(df_corp)
#Creating a term document matrix
df_tdm <- TermDocumentMatrix(df_corp_clean)
#Removing sparse terms from the TermDocumentMatrix
df_tdm <- removeSparseTerms(df_tdm,sparse = 0.99)
#Converting the TermDocumentMatrix to a Matrix in R
df_m <- as.matrix(df_tdm)
#convering matrix to data frame with terms and term frequency
word_freq = sort(rowSums(df_m), decreasing=T)
df <- data.frame(word = names(word_freq),freq = word_freq)
#return the dataframe
return(df)
}
```
Generating word cloud in R using the createterms function.
```{r creating wordcloud episode 4}
wordcloud2(createterms(ep4$dialogue),size=0.6, shape = 'star')
```
Now let's look at the number of dialogues spoken by each character in episode 4 of star wars.
```{r}
#look at the structure of the dataframe
str(ep4)
```
Both the columns are of factor data type, I will be converting these into character data types.
```{r}
#converting columns to character
ep4$character <- as.character(ep4$character)
ep4$dialogue <- as.character(ep4$dialogue)
#checking if the conversion has been applied successfully.
str(ep4)
```
Now, I will find out the number of dialogues for each character and before doing that I want to check number of characters in episode 4
```{r}
#number of unique characters in star wars episode 4
length(unique(ep4$character))
```
There are 60 characters in episode 4. Now Let's proceed to find out the characters with most number of dialogues.
```{r}
ep4%>%
group_by(character)%>%
count()%>%
ungroup()%>%
arrange(desc(n))%>%
top_n(15,n)%>%
ggplot(aes(reorder(character,n),n,fill=character))+
geom_col(show.legend = F)+
coord_flip()+
labs(x="Number of dialogues" , y="Character Name" , title ="Number of dialogues for each character in Episode 4")
```
I will repeat the same steps for doing analysis on the episode 5 of star wars.
```{r creating wordcloud episode 5}
wordcloud2(createterms(ep5$dialogue),size=0.3, shape = 'star')
```
Now let's look at the number of dialogues spoken by each character in episode 5 of star wars.
```{r}
#look at the structure of the dataframe
str(ep5)
```
Both the columns are of factor data type, I will be converting these into character data types.
```{r}
#converting columns to character
ep5$character <- as.character(ep5$character)
ep5$dialogue <- as.character(ep5$dialogue)
#checking if the conversion has been applied successfully.
str(ep5)
```
Now, I will find out the number of dialogues for each character and before doing that I want to check number of characters in episode 5
```{r}
#number of unique characters in star wars episode 5
length(unique(ep5$character))
```
There are 49 unique characters in episode 5. Now Let's proceed to find out the characters with most number of dialogues.
```{r}
ep5%>%
group_by(character)%>%
count()%>%
ungroup()%>%
arrange(desc(n))%>%
top_n(15,n)%>%
ggplot(aes(reorder(character,n),n,fill=character))+
geom_col(show.legend = F)+
coord_flip()+
labs(x="Number of dialogues" , y="Character Name" , title ="Number of dialogues for each character in Episode 5")
```
Creating word loud and doing text analysis on episode 6.
```{r creating wordcloud episode 6}
wordcloud2(createterms(ep6$dialogue),size=0.4, shape = 'star')
```
Now let's look at the number of dialogues spoken by each character in episode 6 of star wars.
```{r}
#look at the structure of the dataframe
str(ep6)
```
Both the columns are of factor data type, I will be converting these into character data types.
```{r}
#converting columns to character
ep6$character <- as.character(ep6$character)
ep6$dialogue <- as.character(ep6$dialogue)
#checking if the conversion has been applied successfully.
str(ep6)
```
Now, I will find out the number of dialogues for each character and before doing that I want to check number of characters in episode 6
```{r}
#number of unique characters in star wars episode 6
length(unique(ep6$character))
```
There are 53 unique characters in episode 6. Now Let's proceed to find out the characters with most number of dialogues.
```{r}
ep6%>%
group_by(character)%>%
count()%>%
ungroup()%>%
arrange(desc(n))%>%
top_n(15,n)%>%
ggplot(aes(reorder(character,n),n,fill=character))+
geom_col(show.legend = F)+
coord_flip()+
labs(x="Number of dialogues" , y="Character Name" , title ="Number of dialogues for each character in Episode 6")
```
Now I want do some analysis on the words present in the text.
```{r}
#adding filename variable in the episode data frame
ep4$episode <- 'ep4'
ep5$episode <- 'ep5'
ep6$episode <- 'ep6'
#combining all the 3 dataframes into 1
ep_full<- rbind(ep4,ep5,ep6)
```
Now to do text analysis I will break each word into a token by using **unnest_tokens** function from the tidytext package in R.
```{r}
ep_full_tokens<-ep_full%>%
unnest_tokens(word,dialogue)
```
Now I will remove the stopwords from the tokens by using an anti_join on **ep_full_tokens** with **stop_words** dataframe. Then I will Join this with **Bing** lexicon to get the sentiment of words.
Since, **bing** lexicon gives the
I will use this to create a comparison cloud which help in understanding the most commonly used positive and negative words.
```{r}
#creating stopword dataframe from the stop_words vector
stopword<-data.frame(word=stop_words)
```
```{r}
ep_full_tokens%>%
anti_join(stopword, by ="word")%>%
inner_join(get_sentiments("bing"),by="word")%>%
count(word,sentiment,sort = TRUE)%>%
acast(word~sentiment , value.var="n" , fill = 0)%>%
comparison.cloud(colors = c("#FF4C4C","#99FF99"),max.words = 100)
```
Let's see which episode has more positve and negative words.
```{r}
ep_bing<-ep_full_tokens%>%
anti_join(stopword, by ="word")%>%
inner_join(get_sentiments("bing"),by="word")%>%
count(episode,sentiment,sort = TRUE)
```
Converting the dataframe to wide format to get the ratio of positive to negative words in each episode.
```{r}
ep_bing_wide<-dcast(ep_bing,episode~sentiment, value.var = "n")
```
Creating a variable with positive to negative word ratio.
```{r}
ep_bing_wide%>%
mutate(positivetonegativeratio= positive/negative)
```
It seems that as the episodes are progressing the ratio of positive words to negative words is increasing.
Now I want to analyze the positive to negative words ratio for each character in all the episodes.
```{r}
character_bing<-ep_full_tokens%>%
anti_join(stopword, by ="word")%>%
inner_join(get_sentiments("bing"),by="word")%>%
count(character,sentiment,sort = TRUE)%>%
dcast(character~sentiment, value.var = "n",fill = -1)%>%
mutate(positivetonegativeratio=positive/negative)%>%
filter(positivetonegativeratio>0)%>%
arrange(desc(positivetonegativeratio))
```
Creature and Trooper are the characters who have spoken more number of positive words as compared to negative words.
Whereas, Ackbar and Mon Mothama have spoken more number of negative words than positive words.
Next I want to visualize how the trend of positive and negative words changes throuhput each of the episodes.
```{r}
#creating line number variable for each line in each episode
tidy_episode <- ep_full%>%
group_by(episode)%>%
mutate(linenumber= row_number())%>%
ungroup()%>%
unnest_tokens(word,dialogue)
```
```{r}
#creating an index variable which defines different subsegments of the episode
tidy_episode_sentiment <- tidy_episode%>%
mutate(index= linenumber%/% 10)
```
```{r}
#Plotting the trend of sentiment flow in each of the star wars episodes
tidy_episode_sentiment%>%
anti_join(stopword, by ="word")%>%
inner_join(get_sentiments("bing"),by="word")%>%
count(episode,index,sentiment)%>%
spread(sentiment,n,fill=0)%>%
#finding the difference net sentiment
mutate(net_sentiment=positive-negative)%>%
ggplot(aes(index,net_sentiment,fill=episode))+
geom_col(show.legend = F)+
facet_wrap(~episode , ncol=2 , scales="free_x")+
labs(x="Subsegments of the episodes",y="sentiment flow throughout each episode",
title="Sentiment flow throughout the episode for different episodes")
```