-
Notifications
You must be signed in to change notification settings - Fork 0
/
UNFINISHED_K_Pop_Data_Analysis.Rmd
235 lines (151 loc) · 7.37 KB
/
UNFINISHED_K_Pop_Data_Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
---
title: K-Pop Data Analysis
author:
- Christine P. Chai
date: \today
header-includes:
- \usepackage{fontspec} # use fontspec package
- \usepackage{xeCJK} # use xeCJK package
- \setCJKmainfont{標楷體} # font for Windows (Chinese and Japanese)
- \setCJKmonofont{標楷體} # font for Windows (Chinese and Japanese)
- \renewcommand{\and}{\\}
output:
pdf_document:
latex_engine: xelatex
extra_dependencies: float
number_sections: true
citation_package: natbib
bibliography: references.bib
biblio-style: apalike
link-citations: true
---
\renewcommand{\cite}{\citep}
```{r latex-cite-command, include=FALSE}
# %\let\cite\citep
# % from \citep to \cite to cite in author style, e.g. [Mule, 2008]
# % \bibliographystyle{plainnat}
# %\citep: citation in parentheses, e.g. [Mule, 2008]
# %\citet: citation as author, e.g. Mule [2008]
# %\cite: citation as author, \citet by default
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
def.chunk.hook <- knitr::knit_hooks$get("chunk")
knitr::knit_hooks$set(chunk = function(x, options) {
x <- def.chunk.hook(x, options)
ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x, "\n\n \\normalsize"), x)
})
knitr::opts_chunk$set(fig.width=6, fig.align = 'center', fig.pos = "H", out.extra = "")
```
Starting in 2024.
Test citation \cite{chai2024statistical}
# Executive Summary
Write something here
# Disclaimer {.unnumbered}
The opinions and views expressed in this manuscript are those of the author, and do not necessarily state or reflect those of any institution or government entity.
# Introduction
How the author got interested in K-Pop music (Korean popular music):
Tzuyu (Chou Tzu-Yu, 周子瑜)^[<https://en.wikipedia.org/wiki/Tzuyu>]
(a lot more content here)
\textcolor{red}{Important: Write about the K-Pop scandal revealed in 2019 and later.}
## Read in the Idol School Dataset
Idol School (偶像學校) (2017)
Motivation: One of the contestants, Snowbaby (蔡瑞雪),^[Snowbaby's YouTube channel: <https://www.youtube.com/@snowbaby>] is originally from Taiwan.
\textcolor{red}{Need to write the data description}
Wikipedia data: <https://en.wikipedia.org/wiki/List_of_Idol_School_contestants>
```{r options-setup, include=FALSE}
# Prevent tibble columns from truncating
options(dplyr.width = Inf)
```
```{r idol-school-data}
library(readxl)
idol_school = read_excel("UNFINISHED_Idol_School_Dataset.xlsx",
sheet="Idol_School_Dataset")
# Date of birth (DOB) should be date only, not a full timestamp.
idol_school$DOB = as.Date(idol_school$DOB)
columns_to_show = c("Name_Chn", "Name_Eng", "DOB",
"Vocal", "Dance", "Physical",
"Overall", "Ability_Rank")
idol_school[1:10, columns_to_show]
```
## Idol School: Exploratory Data Analysis
What changes did we make from the Wikipedia data?
Our presumption: In each category, no two contestants should have the same score.
\textcolor{red}{Physical: We found two 3.5's and two 1.2's after sorting the scores.}
The two 3.5 scores belong to adjacent cells in the Wikipedia data.
Physical testing contains a group exercise and an individual exercise.
In the video clip, Park Ji Won (朴池原) and her partner were the first runner-up in the group exercise.^[Screenshot of the group physical exercise: <https://bit.ly/4a7QT9m>] We are surprised that Ji Won's physical score was only 3.5. According to the video's score table for contestants ranked 11th to 20th,^[<https://bit.ly/400KUhH>] Ji Won's physical score should be 6.2.
The Wikipedia table contains some inconsistency in the overall score, i.e., the average across the three categories.
Ji Won's vocal score was 7.9, and her dance was 5. These scores seem to be reasonable for Ji Won, because she is known for excellent singing and decent dancing as a performer.^[Park Ji Won is the main vocalist in Fromis 9. <https://bit.ly/402yCFI>] Therefore, we assume both scores to be correct.
- If the physical score had really been 3.5, then Ji Won's overall score would be 5.47, dropping her from 13th place to the 18th.
- If the overall score of 6.37 had been correct, then Ji Won's physical score should be 6.2.
The second scenario is more likely.
Evidence we found in the video clip.
The two 1.2 scores are more difficult to check for the underlying values.
Especially that they occurred in two contestants with lower ranking.^[Physical scores of all contestants in Idol School: <https://bit.ly/3DRNK0Z>]
With the help of Google Translate:^[<https://translate.google.com/>]
Can translate Korean text in an image back to English text.
Finally, we discovered that Michelle White (懷特·米雪兒)'s physical score should be 1.3, not 1.2.
Idol School (2017): Videos with subtitles in Simplified Chinese
https://www.bilibili.com/video/BV1554y1C7wj/
Screenshots saved:
https://github.com/star1327p/K-Pop-Dataset/tree/main/Idol_School_Rating_Screenshots
\textcolor{red}{Check for the mean and median of each category score}
```{r idol-school-eda}
vocal_sorted = sort(idol_school$Vocal, decreasing = TRUE)
dance_sorted = sort(idol_school$Dance, decreasing = TRUE)
physical_sorted = sort(idol_school$Physical, decreasing = TRUE)
# UNFINISHED HERE
# Make the cbind object a data.frame!
# cbind(vocal_sorted, dance_sorted, physical_sorted)
```
## Idol School: Additional Resources
Students who were eliminated from the show:
https://www.ptt.cc/bbs/fromis_9/M.1555819461.A.C73.html
Someone else used random forests to predict the final ranking:
https://shavid.pixnet.net/blog/post/331691281
## Read in the Produce 48 Dataset
Produce 48 dataset (2018)
```{r produce-48-data}
produce_48_data = read_excel("UNFINISHED_Idol_School_Dataset.xlsx",
sheet="Produce_48_Dataset")
# Date of birth (DOB) should be date only, not a full timestamp.
produce_48_data$DOB = as.Date(produce_48_data$DOB)
# UNFINISHED HERE:
# Decide on which columns and rows to show here.
columns_to_show = c("Name_Chn", "Name_Eng", "DOB",
"First_Eval", "Second_Eval", "Final_Rank")
produce_48_data[1:20, columns_to_show]
```
# Tentative Placeholders
Write something here
## Test for Non-English Characters
CJK = Chinese, Japanese, Korean
Chinese example
RStudio有辦法打中文嗎?
```{r print-Chinese}
print("大家好,很高興能認識你們!")
```
Japanese example
思い出にするにはまだ早すぎる
```{r print-Japanese}
print("みやわき さくら")
print("宮脇 咲良")
```
This template does not support Korean characters yet.
## R Markdown Narrative
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
```{r cars}
summary(cars)
```
## Including Plots
You can also embed plots, for example in Figure \ref{fig:pressure}:
```{r pressure, fig.cap="Test Plot", echo=FALSE}
plot(pressure)
```
Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.
# Acknowledgments {.unnumbered}
Write something here
\addcontentsline{toc}{section}{References}