UNFINISHED_K_Pop_Data_Analysis.Rmd

---
title: K-Pop Data Analysis
author:
- Christine P. Chai
- cpchai21@gmail.com
date: \today
header-includes:
  - \usepackage{fontspec} # use fontspec package
  - \usepackage{xeCJK}    # use xeCJK package
  - \setCJKmainfont{標楷體} # font for Windows (Chinese and Japanese)
  - \setCJKmonofont{標楷體} # font for Windows (Chinese and Japanese)
  - \renewcommand{\and}{\\}
output: 
  pdf_document:
    latex_engine: xelatex
    extra_dependencies: float
    number_sections: true
    citation_package: natbib
bibliography: references.bib
biblio-style: apalike
link-citations: true    
---

\renewcommand{\cite}{\citep}

```{r latex-cite-command, include=FALSE}
# %\let\cite\citep
# % from \citep to \cite to cite in author style, e.g. [Mule, 2008]

# % \bibliographystyle{plainnat}
# %\citep: citation in parentheses, e.g. [Mule, 2008]
# %\citet: citation as author, e.g. Mule [2008]
# %\cite: citation as author, \citet by default 
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

def.chunk.hook <- knitr::knit_hooks$get("chunk")
knitr::knit_hooks$set(chunk = function(x, options) {
  x <- def.chunk.hook(x, options)
  ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x, "\n\n \\normalsize"), x)
})

knitr::opts_chunk$set(fig.width=6, fig.align = 'center', fig.pos = "H", out.extra = "")
```

Starting in 2024.  

Test citation \cite{chai2024statistical}

# Executive Summary

Write something here

# Disclaimer {.unnumbered}

The opinions and views expressed in this manuscript are those of the author, and do not necessarily state or reflect those of any institution or government entity.

# Introduction

How the author got interested in K-Pop music (Korean popular music):  

Tzuyu (Chou Tzu-Yu, 周子瑜)^[<https://en.wikipedia.org/wiki/Tzuyu>]  

(a lot more content here)  

\textcolor{red}{Important: Write about the K-Pop scandal revealed in 2019 and later.}  


## Read in the Idol School Dataset

Idol School (偶像學校) (2017)

Motivation: One of the contestants, Snowbaby (蔡瑞雪),^[Snowbaby's YouTube channel: <https://www.youtube.com/@snowbaby>] is originally from Taiwan.  

\textcolor{red}{Need to write the data description}  

Wikipedia data: <https://en.wikipedia.org/wiki/List_of_Idol_School_contestants>

```{r options-setup, include=FALSE}
# Prevent tibble columns from truncating
options(dplyr.width = Inf)
```

```{r idol-school-data}
library(readxl)
idol_school = read_excel("UNFINISHED_Idol_School_Dataset.xlsx",
                         sheet="Idol_School_Dataset")

# Date of birth (DOB) should be date only, not a full timestamp.
idol_school$DOB = as.Date(idol_school$DOB)

columns_to_show = c("Name_Chn", "Name_Eng", "DOB", 
                    "Vocal", "Dance", "Physical", 
                    "Overall", "Ability_Rank")

idol_school[1:10, columns_to_show]
```

## Idol School: Exploratory Data Analysis

What changes did we make from the Wikipedia data?  

Our presumption: In each category, no two contestants should have the same score.  

\textcolor{red}{Physical: We found two 3.5's and two 1.2's after sorting the scores.}  

The two 3.5 scores belong to adjacent cells in the Wikipedia data.   

Physical testing contains a group exercise and an individual exercise.  

In the video clip, Park Ji Won (朴池原) and her partner were the first runner-up in the group exercise.^[Screenshot of the group physical exercise: <https://bit.ly/4a7QT9m>] We are surprised that Ji Won's physical score was only 3.5. According to the video's score table for contestants ranked 11th to 20th,^[<https://bit.ly/400KUhH>] Ji Won's physical score should be 6.2.   

The Wikipedia table contains some inconsistency in the overall score, i.e., the average across the three categories.  

Ji Won's vocal score was 7.9, and her dance was 5. These scores seem to be reasonable for Ji Won, because she is known for excellent singing and decent dancing as a performer.^[Park Ji Won is the main vocalist in Fromis 9. <https://bit.ly/402yCFI>] Therefore, we assume both scores to be correct.  

- If the physical score had really been 3.5, then Ji Won's overall score would be 5.47, dropping her from 13th place to the 18th.  

- If the overall score of 6.37 had been correct, then Ji Won's physical score should be 6.2.  

The second scenario is more likely.  
Evidence we found in the video clip.  


The two 1.2 scores are more difficult to check for the underlying values.   

Especially that they occurred in two contestants with lower ranking.^[Physical scores of all contestants in Idol School: <https://bit.ly/3DRNK0Z>]  

With the help of Google Translate:^[<https://translate.google.com/>]   
Can translate Korean text in an image back to English text.   
Finally, we discovered that Michelle White (懷特·米雪兒)'s physical score should be 1.3, not 1.2.  


Idol School (2017): Videos with subtitles in Simplified Chinese  
https://www.bilibili.com/video/BV1554y1C7wj/

Screenshots saved:  
https://github.com/star1327p/K-Pop-Dataset/tree/main/Idol_School_Rating_Screenshots 

\textcolor{red}{Check for the mean and median of each category score}  

```{r idol-school-eda}

vocal_sorted = sort(idol_school$Vocal, decreasing = TRUE)
dance_sorted = sort(idol_school$Dance, decreasing = TRUE)
physical_sorted = sort(idol_school$Physical, decreasing = TRUE)

# UNFINISHED HERE
# Make the cbind object a data.frame!
# cbind(vocal_sorted, dance_sorted, physical_sorted)
```

## Idol School: Additional Resources

Students who were eliminated from the show:  
https://www.ptt.cc/bbs/fromis_9/M.1555819461.A.C73.html

Someone else used random forests to predict the final ranking:  
https://shavid.pixnet.net/blog/post/331691281


## Read in the Produce 48 Dataset

Produce 48 dataset (2018)

```{r produce-48-data}
produce_48_data = read_excel("UNFINISHED_Idol_School_Dataset.xlsx",
                             sheet="Produce_48_Dataset")

# Date of birth (DOB) should be date only, not a full timestamp.
produce_48_data$DOB = as.Date(produce_48_data$DOB)

# UNFINISHED HERE:
# Decide on which columns and rows to show here.

columns_to_show = c("Name_Chn", "Name_Eng", "DOB", 
                    "First_Eval", "Second_Eval", "Final_Rank")

produce_48_data[1:20, columns_to_show]
```

# Tentative Placeholders

Write something here

## Test for Non-English Characters

CJK = Chinese, Japanese, Korean

Chinese example

RStudio有辦法打中文嗎？

```{r print-Chinese}
print("大家好，很高興能認識你們！")
```

Japanese example

思い出にするにはまだ早すぎる

```{r print-Japanese}
print("みやわき さくら")
print("宮脇 咲良")
```

This template does not support Korean characters yet.

## R Markdown Narrative

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

```{r cars}
summary(cars)
```

## Including Plots

You can also embed plots, for example in Figure \ref{fig:pressure}:

```{r pressure, fig.cap="Test Plot", echo=FALSE}
plot(pressure)
```

Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.

# Acknowledgments {.unnumbered}

Write something here

\addcontentsline{toc}{section}{References}