Final_Report.Rmd

---
title: 'Report: Deezer Kaggle Inclass Competition 2017'
author: "Christian Kregelin, Dennis Uckel, Max Philipp, Pranav Pandya and Vera Weidmann"
date: "14. April - 31. May 2017"
output:
  html_document:
    toc: true
    toc_float: true
    toc_depth: 4
---

## 1. Introduction

This project is based on the Data Science Games 2017 hosted on Kaggle. It is a worldwide competition between universities. This year the data was provided by the music streaming application Deezer. They offer a recommendation feature, called Flow, which suggests the users songs they might like to listen to. The algorithm behind "Flow" uses collaborative filtering to provide the user with the right music at the right time. While the algorithm detects the user's current listing taste by analysing the particular streaming-session where songs are listened or skipped, the first song prediction is the most difficult one. Therefore, the goal of this challenge was to predict how likely it is that the users of the test dataset will listen to the first track of "Flow". 

In the end of the project we finished 66th out of 145 teams with 45 submissions and a score of **0.63860**. This report reflects our whole project journey, including our ideas, failures and achievements. 

Therefore, the report will give a short introduction to the dataset at first. Sequentially, our approaches in the field of feature engineering are explained and the used prediction models are presented.  

The libraries we used for the project are the following: 
```{r Initialization, warning=FALSE, message=FALSE}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr,
               xgboost, 
               Matrix, 
               data.table, 
               caret, 
               splitstackshape,
               ggplot2, 
               psychometric, 
               RColorBrewer, 
               jsonlite, 
               httr, 
               corrplot, 
               gridExtra,
               formattable,
               qdap,
               tm,
               textcat,
               rvest,
               stringr
               )
```


## 2. Data Analysis & Challenges
This chapter illustrates an exploratory analysis of the competition dataset.
As it is mentioned before, the goal of this challenge was to predict whether the first recommended song will be listened or not. Not listened means, that the song was skipped within the first 30 seconds and it will be assumed that the user does not like the song.

The test dataset contains the first recommended "Flow"-track for several different users. The train dataset was generated by the user's listening history for one month, based on songs which were listened in "Flow" and songs which were listened by searches or saved playlists. Each row represents one listened or not listened song. 

Data:
```{r}
Deezer <- read.csv("/home/Deezer/10_Basic_Dataset/train.csv")
Deezer_test <- read.csv("/home/Deezer/10_Basic_Dataset/test.csv")

tmp <- Deezer_test[,-1] 
tmp$is_listened <- 0
all <- rbind(Deezer,tmp)
rm(tmp)
```

The train data contains `r round(dim(Deezer)[1]/1000000,2)` million rows and `r dim(Deezer)[2]` features, while the test set just contains `r round(dim(Deezer_test)[1]/1000,2)` thousand rows.

The dataset provides the following columns, which we grouped into user specific, song specific and device specific features. These features were analyzed, NAs detected and properly handled. As all features are characterized as numeric, we convert them into the right data type later on. 

User specific: 

  * user_id -  anonymized and unique id of the user
  * user_gender -  gender of the user
  * user_age - age of the user
  
Song specific:

  * media_id - identifiant of the song listened by the user
  * album_id - identifiant of the album of the song
  * artist_id - identifiant of the artist of the song
  * genre_id - identifiant of the genre of the song
  * media_duration - duration of the song
  * context_type - type of content where the song was listened: playlist, album ...
  * release_date - release date of the song with the format YYYYMMDD

Device specific: 

  * platform_name - type of operation system
  * platform_family - type of device
  
Other features:

  * ts_listen - timestamp of the listening in UNIX time (time since 1970-01-01 00:00, usually in seconds, sometime with milliseconds)
  * listen_type - if the songs was listened in a flow or not
  
Response variable:

  * is_listened - 1 if the track was listened longer than 30 seconds, 0 otherwise

#### 2(a) User Specific Features

The data contains `r length(unique(Deezer$user_id))` different users. Some of them occur over thousands of times, while nearly 1500 users occur just once (with one listened or non listened song). No NAs were found. 

```{r, warning=FALSE}
user_distribution <- Deezer %>%
  group_by(user_id) %>%
  summarise(n=n()) %>%
  arrange (-n)
myColors <- rep(brewer.pal(7,"Blues")[5:6],50)

ggplot(user_distribution, aes(x=n))+
  geom_histogram(bins=100, fill=myColors,color="black")+
  scale_x_continuous(breaks=seq(0,2000,50),limits=c(0,2000))+
labs(subtitle="in the train set",
      y="number of users",
      x="number of songs",
      title="Distribution of Observations per User",
      caption="Source: Deezer train data")+
theme_light()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
#ggsave("observations_distribution.png",plot=gg6, width = 8, height=5, units="in")
```
  
The data reflects the listening behavior of male and female users. However, in this project we do not know how the values of 0 and 1 belong the specific genders. The plot below illustrates the amount of users per gender. 

```{r}
myColors <- brewer.pal(7,"Blues")[5:6]
ggplot(Deezer,aes(x=user_gender))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
      y="number of users",
      x="Gender",
      title="Distribution of Users Gender",
      caption="Source: Deezer train data")+
theme_light()
```

The user's age is in the range of `r min(Deezer$user_age)` and `r max(Deezer$user_age)` with a median of `r round(median(Deezer$user_age))`. The following boxplot illustrates the distribution of the unique users by their age. Accordingly, 50% of the users are 24 years old or younger. The feature age does not contain any NAs in the Deezer dataset.

```{r}
sum(is.na(Deezer$user_age))
age <- Deezer %>%
  group_by(user_id) %>% summarise(age=max(user_age))
boxplot(age$age)
```

```{r}
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],6),"#4292C6")
ggplot(age, aes(x=age))+stat_bin(binwidth=1, fill=myColors, color="black") +  scale_y_continuous(limits=c(0,2200),breaks=seq(0,2200,100))+
    stat_bin(binwidth=1, geom="text", aes(label=..count..), vjust=-1.5)+ scale_x_continuous(breaks=seq(18,30,1))+
labs(y="number of users",
      x="age",
      title="Distribution of Age among Users",
      caption="Source: Deezer train data")+
theme_light()

#ggsave("Age_distribution.png",plot=gg7, width = 8, height=5, units="in")
```

#### 2(b) Song Specific Features

The dataset contains `r length(unique(Deezer$media_id))` different media_ids. One media_id represents one specific song from one artist on one specific album. At first, we thought that one media_id reflects one unique song. However, while checking the provided extra information (extra_infos.json), which links each media_id to a song title, album title and an artist name, it can be seen that one unique song (e.g. Everybody from Backstreet Boys) can have more than one media_id. This is caused by the album. Whenever a song is also on another album (e.g. Fetenkult - Best of the 90's) a new media_id is generated. 

One example is covered by the following data table:

```{r echo = FALSE, message=FALSE, warning=FALSE}
load("/home/Deezer/deezer_report/extra.rda")
```

```{r}
extra_example <- extra %>%
  filter(sng_title=="Everybody (Backstreet's Back)" & art_name=="Backstreet Boys")
data.table(extra_example)
```
Furthermore, we can count `r length(unique(Deezer$album_id))` different albums and `r length(unique(Deezer$artist_id))` different artists in the dataset. No NAs were identified. 

```{r}
sum(is.na(Deezer$media_id))
sum(is.na(Deezer$album_id))
sum(is.na(Deezer$artist_id))
```

According to the extra information from the json file, we know that the media_id and the album_id are strongly correlated with each other. The correlation coefficient in that case is: `r cor(Deezer$media_id,Deezer$album_id)`.
The following histogram underlines this correlation. Furthermore, some media_ids (and therefore some albums and artists) are recommended quite often over all users.

```{r}
par(mfrow=c(1,3))
hist(Deezer$media_id)
hist(Deezer$album_id)
hist(Deezer$artist_id)
```

Likewise, `r length(unique(Deezer$genre_id))` different genres exist in the dataset. The histogram shows clearly that one specific genre with the id = 0 occurs 3 million times, while the second one for example just occurs only around 1 million times.
This would mean, that one specific genre is recommended the most. However, further investigation are done in the feature engineering chapter. 

```{r}
sum(is.na(Deezer$genre_id))

number_genre <- Deezer %>%
  group_by(genre_id) %>%
  summarize(n=n()) %>%
  arrange(-n)
head(number_genre)

myColors <- c(rep(brewer.pal(7,"Blues")[5:6],25))

ggplot(number_genre, aes(genre_id)) + 
  geom_histogram(bins=50,fill=myColors, color="black")+ 
  scale_y_continuous(limits=c(0,750),breaks=seq(0,750,100))+
labs(y="number of observations",
      x="genre_id",
      title="Distribution of Genre_id",
      caption="Source: Deezer train data")+
theme_light()
```

We were also provided with some information about the media duration. But a quick look at the summary will reveal a couple of outliers.

```{r}
summary(Deezer$media_duration)
```

For this visualisation part, we filtered out <q5 and >q95 and kept all the values between `r quantile(Deezer$media_duration,0.05)` and `r quantile(Deezer$media_duration,0.95)`.

```{r}
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],15))

ggplot(Deezer%>%filter(media_duration>=quantile(media_duration,0.05) & media_duration<=quantile(media_duration,0.95)), aes(media_duration)) + 
  geom_histogram(bins=30,fill=myColors, color="black")+
labs(y="Number of Observations",
      x="Media Duration in Seconds",
      title="Distribution of Mediaduration",
     subtitle="For 90% of the Data",
      caption="Source: Deezer train data")+
theme_light()
```

The column context_type reflects where the song was listened. For example it shows if a song was listened on a playlist or on an album.  
```{r}
sum(is.na(Deezer$context_type))
#hist(Deezer$context_type)
head(table(Deezer$context_type))
```

The release date of a song has the format of YYYYMM. By extracting the year of its timestamp, the most occurring release year can be better analyzed. Therefore, new songs from recent years occur most often in our dataset.

```{r}
sum(is.na(Deezer$release_date))

releaseDate = as.Date(as.character(Deezer$release_date), format="%Y%m%d")
  splitrdt  <- data.frame(ryear = as.numeric(format(releaseDate, format = "%Y")))
  Deezer = cbind(Deezer, splitrdt)
```
```{r}
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],15))

ggplot(Deezer%>%filter(ryear>=quantile(ryear,0.05) & ryear<=quantile(ryear,0.95)), aes(ryear)) + 
  geom_histogram(bins=30,fill=myColors, color="black")+
labs(y="Number of Observations",
      x="Release Year of a song",
      title="Distribution of Release Year",
      subtitle="for 90% of the data",
      caption="Source: Deezer train data")+
theme_light()
```

#### 2(c) Device Specific Features

The dataset also contains information about the device and the operating system the user used to listen to the music. Both features have 3 values (1:3). As it seems that device "1" and the operating system "1" appears most of the time, we cannot say which specific device or system it is. Furthermore, these two columns are correlated (r=`r round(cor(Deezer$platform_family, Deezer$platform_name),2)`).

```{r}
sum(is.na(Deezer$platform_family))
sum(is.na(Deezer$platform_name))

par(mfrow=c(1,2))
myColors <- brewer.pal(7,"Blues")[5:7]
ggplot(Deezer,aes(x=platform_family))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
      y="Observations",
      x="Used Devise",
      title="Distribution of Platform",
      caption="Source: Deezer train data")+
theme_light()

ggplot(Deezer,aes(x=platform_name))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
      y="Observations",
      x="Used Operation System",
      title="Distribution of Platform",
      caption="Source: Deezer train data")+
theme_light()
```

#### 2(d) Other Features, incl. Response Variable

The ts_listen feature includes the time information when a specific song was recommended to a user. It is a UNIX format and will be modified later on. The third chapter about feature engineering covers some visualizations of the timestamp. No NAs were detected. 

```{r}
sum(is.na(Deezer$ts_listen))
```

The listen_type feature contains information whether the song was listenend in the "Flow" or not. This is important as the training data includes the whole listening history of a user and not just the behavior on the recommendation. Songs which were not recommended by "Flow" could be songs which the user searched for or saved in specific playlists. Thereby, `r round(table(Deezer$listen_type)[1]/1000000,2)` million rows do not represent recommended songs, while `r round(table(Deezer$listen_type)[2]/1000000,2)` were predicted by the recommendation engine "Flow". The test data contains just predicted songs by "Flow". 

```{r}
myColors <- brewer.pal(7,"Blues")[5:6]
ggplot(Deezer,aes(x=listen_type))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
      y="number of observations",
      x="listening type",
      title="Distribution of Listening Type",
      caption="Source: Deezer train data")+
theme_light()
```
```{r}
table(Deezer_test$listen_type)
sum(is.na(Deezer$listen_type))
```
Interestingly, we detected one single song inside the test set, which is of listening_type==0.

At last, the most important column is_listened presents the response variable in the modeling part. 
```{r}
myColors <- rep(brewer.pal(7,"Blues")[5:6],2)
ggplot(Deezer,aes(x=is_listened))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
      y="Number of Observations",
      x="Listening Type",
      title="Distribution of Is Listened versus Listening Type",
      caption="Source: Deezer train data")+
theme_light()+facet_wrap(~listen_type)
```
Here, we can see that songs which were not suggested (e.g. searched for or listened to in a playlist) have a higher chance of being listened `r mean(Deezer$is_listened[Deezer$listen_type==0])` versus songs which have been suggested by the "Flow" `r mean(Deezer$is_listened[Deezer$listen_type==1])`.

#### 2(e) Summary of Main Challenges

On the first glimpse, the dataset looks like tall data. However, most of the features are factors. As already mentioned in the previous chapter, the data contains `r length(unique(Deezer$user_id))` different users, `r length(unique(Deezer$genre_id))` genres, `r length(unique(Deezer$artist_id))` artists, `r length(unique(Deezer$album_id))` albums and `r length(unique(Deezer$media_id))` unique songs. 

Therefore, the main challenge of this project was to create a prediction model with this amount of factors. Furthermore, insufficient data of the column genre_id, a very granular timestamp and release date were the features which needed some improvements.

Furthermore, the following correlation plot visualizes the collinearity between the variables. It can be easily identified that songs and release date as well as the used operating system and device are highly correlated. For all other variables just a slight dependency exists.

```{r}
corDeezer <- cor(Deezer)
#png("correlationMatrix.png")
corDeezerplot <- corrplot(corDeezer, method = "ellipse", type = "full")
#dev.off()
```


## 3. Feature Engineering

In this part, we will talk about our Feature Engineering. We will start with easier transformations and calculations going into advanced thinking processes throughout this chapter. We tried to demonstrate all we have achieved and our journey to the point of most success.

#### 3(a) Timestamp Conversion

Converting the timestamp is the most basic feature engineering part we have done. We extract hour of day as well as weekday from the timestamp. This procedure rather simple, that we did it on the fly, as you will see at the end of the feature engineering section. Nevertheless, the process is described in the following.

The idea behind that is the assumption that the hour of a day or/and the day of week contributes patterns to the prediction, e.g. if users regularly go to the gym on Tuesdays and Thursdays at 8pm or if users tend to listen to slower songs in the evening, while they could listen to fast songs on the way to work. These patterns cannot be found with a time precision of seconds.

```{r}
transforming_timestamp <- function(x){
timestamp = as.POSIXct(x$ts_listen, origin="1970-01-01")
  splitdt  <- data.frame(
    hh = as.numeric(format(timestamp, format = "%H")), #24hours format
    wd = as.numeric(format(timestamp, format = "%w"))) #weekday
  x = cbind(x, splitdt) #cbind the extracted time to data
  return(x)
}
transforming_timestamp(Deezer)[1:10,12:17]
```

```{r}
tmp <- transforming_timestamp(Deezer)
myColors <- rep(brewer.pal(7,"Blues")[5:6],12)

ggplot(tmp, aes(x=hh))+stat_bin(binwidth=1,fill=myColors, color="black")+
  scale_y_continuous(limits=c(0,600000),breaks=seq(0,600000,50000))+
  scale_x_continuous(breaks=seq(0,23,1))+
labs(y="observations",
      x="hour of day",
      title="Distribution of Listening Hour",
      caption="Source: Deezer train data")+
theme_light()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
#ggsave("hh_distribution_train.png",plot=gg9, width = 8, height=5, units="in")
```

```{r}
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],3),"#4292C6")

ggplot(tmp, aes(x=wd))+stat_bin(binwidth=1,fill=myColors, color="black")+
  scale_y_continuous(limits=c(0,1300000),breaks=seq(0,1300000,100000))+
  scale_x_continuous(breaks=seq(0,6,1))+
labs(y="observations",
      x="weekday",
      title="Distribution of Listening Weekday",
      caption="Source: Deezer train data")+
theme_light()
#ggsave("wd_distribution_train.png",plot=gg10, width = 8, height=5, units="in")
```

```{r}
tmp <- transforming_timestamp(Deezer_test)
myColors <- rep(brewer.pal(7,"Blues")[5:6],12)

  ggplot(tmp, aes(x=hh))+stat_bin(binwidth=1,fill=myColors, color="black")+
  scale_y_continuous(limits=c(0,1800),breaks=seq(0,1800,100))+
  scale_x_continuous(breaks=seq(0,23,1))+
labs(y="observations",
      x="hour of day",
      title="Distribution of Listening Hour",
     subtitle="for test data",
      caption="Source: Deezer test data")+
theme_light()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
#ggsave("hh_distribution_test.png",plot=gg11, width = 8, height=5, units="in")
```

```{r}
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],3),"#4292C6")

ggplot(tmp, aes(x=wd))+stat_bin(binwidth=1,fill=myColors, color="black")+
  scale_y_continuous(limits=c(0,5000),breaks=seq(0,5000,250))+
  scale_x_continuous(breaks=seq(0,6,1))+
labs(y="observations",
      x="weekday",
      title="Distribution of Listening Weekday",
     subtitle="for test data",
      caption="Source: Deezer test data")+
theme_light()
#ggsave("wd_distribution_test.png",plot=gg12, width = 8, height=5, units="in")
```

#### 3(b) Beats per Minute

Beats per Minute (bpm) is an easy, numeric feature which we could query from the Deezer API. We thought it may help us to differentiate within the genres, for example "soft rock" being tendentially slower (lower bpm) than its associated genre "rock". It was not our intention to create subgenres in any way but just giving this extra information into the model.

For querying the API we used the `httr` package to query and `jsonlite` to handle the received answer. To show how the code works, we limit the number of queries to 50 instead of nearly 450.000 media_ids. 

```{r}
## Load necessary packages
library(jsonlite)
library(httr)
uniquetracks <- as.data.frame(unique(all$media_id)[1:50]) #getting all unique media_ids
uniquetracks$api <- paste("https://api.deezer.com/track/",uniquetracks[,1], sep="")
uniquetracks$bpm <- 0

for (i in 1:length(uniquetracks[,1])) { ## for all tracks
	this.raw.result <- GET(url = as.character(uniquetracks[i,2])) ## get the infos
	this.result <- fromJSON(rawToChar(this.raw.result$content)) ## turn it into a readable format
	uniquetracks$bpm[i] <- ifelse(is.null(this.result$bpm),"NA",this.result$bpm) ## get the BPM
	#message(as.character(i), appendLF = FALSE) ## print the iteration to see if the code is still working
	Sys.sleep(time = 0.05) ## cap the speed so the 50 per 5 seconds are not violated
}
head(uniquetracks,10)
#save(tracks, file = "tracks_BPM.rda")
```

After fetching the information from the API, we need to clean it. Because the bpm is received as a character we can set all cells where the string is "NA" to 0 without using is.na(). Transforming them to numeric afterwards leads to 0 missing data but 30000 zeros (only 171 were introduced by setting "NA" to 0).

```{r warning=FALSE}
load("/home/Deezer/30_Wrangled_Data/Archiv/tracks_BPM.rda")
tracks <- tracks[,c(2,3)]
tracks$track_bpm[tracks$track_bpm=="NA"] <- "0"
tracks$track_bpm <- as.numeric(tracks$track_bpm)

myColors <- rep(brewer.pal(7,"Blues")[5:6],20)

ggplot(tracks, aes(x=track_bpm))+stat_bin(bins=40, fill=myColors, color="black")+
  scale_y_continuous(limits=c(0,46000),breaks=seq(0,45000,5000))+
  scale_x_continuous(breaks=seq(0,240,10))+
labs(title="Distribution of Beats per Minute",
     y="observations",
      x="beats per minute",
      caption="Source: Deezer train data")+
theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))
```

To fill the zeros, we will take the mean bpm from the album the media is on, or the genre it is from, if the album information is not available. Sadly, at this point we did not have information about the genre, so in order to keep it chronologically correct and not confusing, we will go to the genre section now and will come back to this afterwards.

#### 3(c) Discarded: EchoNest, Singer's Gender, Users Age, User's Language vs. Song Language

For increasing the available information we looked for additional sources, like API provider. And we found EchoNest and an R interface (https://github.com/mukul13/rechonest). EchoNest provides a lot of information about songs, albums, artists, etc. Unfortunately, EchoNest was bought by Deezers market competitor Spotify. Furthermore, we found a hint in the DSG forum, that we can use the Deezer API, but are not allowed to use other sources.

In a brainstorming session we thought about additional factors, which influences, if we want to listen to a song or not. One aspect is the timbre of a voice or the gender of an artist, e.g. in combination with a genre. Since we were not allowed to use EchoNest and this information is not available in Deezers API, we discarded this idea.

If one is thinking to the parents music taste, one could claim, that elder people tend to listen to older songs. One explanation could be, those users tend to listen to songs from their "youth". The question was, if the properties user_age and year of release_date correlate. Unfortunately, user_age and release_date were not significant in our first models.

#### 3(d) Exploring New Features with Language Detection

This approach was to further explore "extra info" data to check whether language detection can help identify new features or not. At first, we did text cleaning and then "textcat" package to identify unique language from song title, album title and artist's name. The next step was to include logical interaction between this variables. For example, if song language and album language is exact same them grouping them, else considering it as variation. Below is the step by step procedure followed to extract this language features. 

```{r eval=FALSE}
# Add new features: join json file with media description (categorical variables)
library(jsonlite)
library(httr)
library(qdap)
library(tm)

extra = stream_in(file("/home/Deezer/60_data_other_models/extra_infos.json"))
extra = extra[1:100,]
str(extra)

##################################################
## add language feats (detect lang)
##################################################

library("textcat")
library("rvest")
library("stringr")

#text cleaning (in order to assign unique numeric value)
extra[] = lapply(extra, tolower)
extra[] = lapply(extra, removePunctuation)
extra[] = lapply(extra, stripWhitespace)

extra$songLang = textcat(extra$sng_title)
extra$albLang = textcat(extra$alb_title)
extra$artistLang = textcat(extra$art_name)

#if song lang and alb lang is exact same them grouping them, else considering it as variation
extra$langSngAlb = ifelse(extra$songLang  == extra$albLang, extra$albLang, "variation")
extra$langAlbArt = ifelse(extra$albLang == extra$artistLang, extra$artistLang, "variation")
extra$langSngArt = ifelse(extra$songLang  == extra$artistLang, extra$artistLang, "variation")

extra$langReg = ifelse(extra$songLang  == "french" | extra$songLang  == "german" | extra$songLang  == "spanish" |
                       extra$songLang  == "swedish" |  extra$songLang  == "italian" | 
                       extra$songLang  == "polish", "TopEuLang", "EnglishOrOther")
str(extra)
extra$media_id = as.numeric(extra$media_id)

# save(extra,file="extra_feats11.rda")

#add to train_test dataset
#train_test = left_join(train_test, extra, by = "media_id")
#str(train_test)
```

We checked importance of this language features through feature importance matrix in various XGBoost (label encoded), LightGBM and H2O models but their contributioin to accuracy was relatively low. The illustration of feature importance is included at the end of XGBoost NUM and LightGBM model's script. 
* [LightGBM](https://github.com/pranavpandya84/deezer_report/blob/master/Models/LightGBM.rmd)(scroll down to the bottom for feature importance matrix)

#### 3(e) Genre_id

In a first approach we tried to get enhanced information about the genre_id, by quering the API for the genre_ids.
As seen in the visualization part, we were confronted with a vast amount of genre_id==0 and it would be nice to know which genre it is.

```{r}
uniquegenres <- unique(all$genre_id) #getting all unique genre_ids
uniquegenres <- as.data.frame(paste("https://api.deezer.com/genre/",uniquegenres, sep="")) #paste the ids into the needed API-url to access the informations
uniquegenres$name <- "" #initialise a empty column
uniquegenres$genre_id <- unique(all$genre_id)
```

At this point we are able to query from the API, using the following loop. To show the results, we set the sample size to 50.

```{r}
#library(httr)
#library(jsonlite)

for (i in 1:50){ #first 500 genres
	this.raw.result <- GET(url = as.character(uniquegenres[i,1])) # get data
	this.result <- fromJSON(rawToChar(this.raw.result$content)) # turn data from unix to char and from json into a variable
	uniquegenres$name[i] <- ifelse(is.null(this.result$name),"NA",this.result$name) # fill NA where NULL
	Sys.sleep(time = 0.1) # System Sleep time to not overload the APIs capacity of 50 requests every 5 seconds
}
head(uniquegenres[,c(2,3)],10)
```

As we can see, the genre names are mainly NAs, and if one would look at row number 21, we would see, that genre_id==0 refers to "all". `r uniquegenres[21,c(2,3)]`

Now that we know, that we can and should not rely on the genre_id, we need to get the information about the genre from another source. Luckily, we have the album_id and the API provides genre information for each album individually, which is not related to the genre_id in first place.

**Genre Information from album_id**

```{r}
albums <- unique(all$album_id) #getting all album ids
albums <- paste("https://api.deezer.com/album/",albums, sep="") #get them in the right format for query the API
albums <- as.data.frame(albums)
albums$album_id <- unique(all$album_id)

#initiating some columns
albums$alb.genre <- ""
albums$alb.genre.id <- ""
```

In the next chunk we receive the genre information for the first 50 albums.
```{r}
for (i in 1:50){ #for 50 albums. you can replace 50 by length(albums$albums) to loop over all
	this.raw.result <- GET(url = as.character(albums[i,1])) #get the infos
	this.result <- fromJSON(rawToChar(this.raw.result$content)) #turn it into a readable format
	albums$alb.genre[i] <- ifelse(is.null(this.result$genres$data[2]),"NA",this.result$genres$data[2]) #getting the written genre
	albums$alb.genre.id[i] <- ifelse(is.null(this.result$genre_id),"NA",this.result$genre_id) #getting the genre_id from the album
	#message(as.character(i), appendLF = FALSE) #print the iteration to see if the code is still working
	Sys.sleep(time = 0.05) #cap the speed so the 50 per 5 seconds are not violated
}

formattable(head(albums,50))
```

So far we created a new DF called "albums" which has 4 columns: API-query, album_id, genre and genre_id,
Where genre is the written name of the genre and the ID is just the corresponding ID.
We can see that some albums have more than one genre and some have NULL.

Looking back at the data, it would be quite easy to get the lists of genre out of the cells of the df. But as we started the project, we were not that fund of performing transformations on lists of lists. And since this report demonstrates what we have done and not what we should have done, here is our original approach.

We used gsub to get rid of all symbols.

```{r}
albums[,'alb.genre2'] <- gsub("c\\(", "" , albums[,'alb.genre']) #remove "c("
albums[,'alb.genre2'] <- gsub("\\)", "" , albums[,'alb.genre2']) #remove ")"
albums[,'alb.genre2'] <- gsub("\"", "" , albums[,'alb.genre2']) #remove """
albums[,'alb.genre2'] <- gsub(" ", "" , albums[,'alb.genre2']) #remove " "

albums <- splitstackshape::cSplit(albums,splitCols = "alb.genre2",direction = "wide")
formattable(head(albums[,c(2,5,6,7)],10))
```

At this stage we decided to only grab the very first of the listed genres for each album, since it would make most sense, that the main genre is named first. We still have NULLs and NAs.

Our first approach in tackling the NULLs and NAs was to get the albums corresponding artist ids. Then, with the artist information, we were able to lookup the artistÂ´s most played genre to fill in missing data. For Example if artist 1 had 4 albums in genre "rock", one could assume that a 5th album, which genre is missing is "rock" as well.

```{r, message=FALSE, warning=FALSE}
albums$artist <- all$artist_id[albums$album_id %in% all$album_id] #joining artist informations

#save(albums,file="bums_art.rda")
#load("bums_art.rda")
```

At this point we need to load in data which has been saved during our running.
```{r}
load("/home/Deezer/30_Wrangled_Data/bums_art.rda")
formattable(head(albums,10))
```

Only with this data you can see what splitstackshape::cSplit really did. It cut the genres from the list of list and got each of them into a new column. Therefore just using the very first of these new columns is what we want, since it is the first entry of the genres.
To get going we created a new DF which contains all the missing data (NAs and NULLs).

```{r}
Na.albums <- albums %>% filter(alb.genre.id=="NA" | alb.genre.id==-1) #get all NAs and NULLs into seperate DF
Na.albums <- as.data.frame(Na.albums[,c(1,20)])
colnames(Na.albums) <- c("album_id","artist_id")
albums2 <- albums[!albums$album_id %in% Na.albums$album_id,] #get clean data into a new DF
```
As we can see, we have `r nrow(Na.albums)` rows in Na.albums, which means that we have 20521 albums out of 151605 which does not have a genre after quering the API.

```{r}
all2 <- all[,c(4,13)] #subset all to contain only album_id and artist_id
all2 <- left_join(all2,albums2[,c(1,4)],by="album_id") #joining our new genre

#creating artists loopuptable
loup.artists <- all2 %>%
  group_by(artist_id,alb.genre2_01) %>%
  summarise(n=n()) %>% #how often a genre at a artist appears. btw: listened/played songs more important than overall genres from albums released
  filter(n==max(n))

Na.albums <- merge(Na.albums,loup.artists[,-3],by="artist_id") #reduced from 20.000 NA albums to 999.

albums2 <- albums2[,c(20,1,4)]
albums2 <- bind_rows(albums2, Na.albums[!is.na(Na.albums$alb.genre2_01),-3]) #adding albums with new information to album2
Na.albums <- Na.albums[is.na(Na.albums$alb.genre2_01),] #999 albums still without genre
```

This implementation was a big success. We were able to make a valid guess on the genre_id by using the artistÂ´s main genre, reducing the number of NAs and NULLs by more than 95%.

To fill the final 999 genres, we used the bpm, which we have collected earlier to identify the genres, but as mentioned as well, we need what we know about the genre so far to fill our gaps in bpm knowledge adequately.

First we get all songs where our bpm is 0, afterwards we first fill in the mean bpm of the album the song is in, and if we don't have this information, we fill with the mean bpm of the genre.

Note: make sure you have run 2(b) already.

```{r warning=FALSE}
#bpm.data <- all[!duplicated(all$media_id),c("media_id","album_id")]
bpm.data <- unique(all[,c("media_id","album_id")]) #get all media
bpm.data<- merge(bpm.data,tracks,by="media_id") #join bpm
bpm.data<- merge(bpm.data,albums2[,c(-1,-4)],by="album_id", all.x=TRUE) #join genre_id we know so far

#creation if album and genre lookuptables for the everage bpm
bpm.merge.album <- bpm.data %>% filter(track_bpm>0) %>% group_by(album_id) %>% summarise(track_bpm=mean(track_bpm))
bpm.merge.genre <- bpm.data %>% filter(track_bpm>0) %>% group_by(alb.genre2_01) %>% summarise(track_bpm=mean(track_bpm))

#fill in album_mean_bpm where it is known
bpmclean2 <- bpm.data[bpm.data$track_bpm==0,]
bpmclean2 <- merge(bpmclean2[,c(1,2,4)],bpm.merge.album, by="album_id",all.x=TRUE)

bpmclean3 <- bpmclean2[is.na(bpmclean2$track_bpm),] #where we dont have bpm data from the album
bpmclean3 <- merge(bpmclean3[,c(-4)],bpm.merge.genre, by="alb.genre2_01",all.x=TRUE) #join the information from the average_genre_lookup table
```

What was quite amusing is one very resilient little fellow: In all 7.5 million songs in our data, there is only one "HörbuchaufDeutsch". Since it is the only one of his kind, and it is an album by itself, we do not have any information, but we set the bpm to 80, which is on the lower end of the scale.

```{r}
bpmclean3[is.na(bpmclean3$track_bpm),]
bpmclean3[is.na(bpmclean3$track_bpm),4] <- 80
```

Now we only had to clean everything and bind it back into a single bpm lookup table, which we will call tracks_cleaned.rda.

```{r}
bpm1 <- bpm.data %>% filter(track_bpm>0)
bpm2 <- bpmclean2 %>% filter(!is.na(track_bpm))
bpm2 <- bpm2[,c(1,2,4,3)]
bpm3 <- bpmclean3[,c(2,3,4,1)]

tracks_clean <- rbind(bpm1,bpm2,bpm3)
tracks_clean <- tracks_clean[!duplicated(tracks_clean$media_id),]

tracks_clean2 <- tracks_clean[,c(2,3)]
#save(tracks_clean, file="data_for_genre_loop.rda")
#save(tracks_clean2,file="tracks_cleaned.rda")
```

```{r, message=FALSE, warning=FALSE}
load("/home/Deezer/30_Wrangled_Data/tracks_cleaned.rda")

myColors <- rep(brewer.pal(7,"Blues")[5:6],10)

ggplot(tracks_clean2, aes(x=track_bpm))+stat_bin(bins=20, fill=myColors, color="black")+
  scale_y_continuous(limits=c(0,80000),breaks=seq(0,80000,5000))+
  scale_x_continuous(breaks=seq(50,220,10),limits=c(50,220))+
labs(title="Distribution of Beats per Minute",
     y="observations",
      x="beats per minute",
      caption="Source: Deezer train data")+
theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("BPM_distribution.png",plot=gg8, width = 8, height=5, units="in")

```

To fill the missing genre_ids for our 999 observations, we will conduct a glm model to predict the genre by the songs bpm, since we have already exhausted all other possibilities. According to a majority vote on how the genre_id is distributed on the known cases, we could assume that the missing ones are "pop".

```{r, fig.width=8,fig.height=5}
load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda")
library(treemap)
#png(filename="genre_treemap.png",width=16, height=9, units = "in",res=72)
treemap(DeezerNew_train_0525 %>% group_by(alb.genre2_01) %>% summarise(n=log(n())), index="alb.genre2_01",vSize = "n",title = "Logarithmic Distribution of Genres", palette = "GnBu")
#dev.off()
```

Note: this code will run for some minutes (glm for 47 models). For the sake of creating our html report, we set the chunks `eval=FALSE` to prevent the chunk from running. You can skip the following two chunks without missing too much information.

```{r eval=FALSE}
test.loop <- tracks_clean %>% filter(is.na(alb.genre2_01)) %>% select(album_id,track_bpm)
results <- test.loop

#loop and fit
for (i in levels(tracks_clean$alb.genre2_01)){
	tracks_clean$y <- as.factor(as.numeric(tracks_clean$alb.genre2_01==i))
	train.loop <- tracks_clean %>% filter(!is.na(alb.genre2_01)) %>% select(track_bpm,y)

	fit <- glm(y~track_bpm,data=train.loop, family=binomial(link = "logit"))
	results[,i] <- predict(fit,test.loop, type="response")
}
results <- results[,-2]
```

At this point we have done a prediction for every missing album and can show, that all of them are considered to be "pop".

```{r eval=FALSE}
albums.finish <- melt(results,id.vars="album_id")
albums.finish <- albums.finish %>% group_by(album_id,variable) %>% summarise(value=sum(value))
albums.finish <- albums.finish %>% group_by(album_id) %>% filter(value==max(value))
table(albums.finish$variable)
```

The tables shows that for all 20521 songs from out missing 999 albums, the looped glm model suggested that all of the songs are so be considered "pop", which also followed the majority vote.

```{r warning=FALSE}
Na.albums$alb.genre2_01 <- "Pop"
albums2 <- bind_rows(albums2[,c(2,3)], Na.albums[,c(2,3)]) #adding albums with new information to album2
albums2 <- albums2[!duplicated(albums2$album_id),]

#save(albums2,file="album_genre_clean.rda")
```

#### 3(f) User Clusters by Genre

```{r}
load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda") #we load a newer version of the data at this stage, but we only build on what has been worked out so far.
DeezerNew_train_0525$is_listened <- as.numeric(DeezerNew_train_0525$is_listened)-1
library(dplyr)
library(splitstackshape)
library(reshape2)
```

One of our first thoughts about this project was: "the genre is the most important indicator if you like a song or not."
At this point in time we already had trustworthy information about the genre, so we could feature engineer further.
In a first attempt we generated a clustering based on genre-listening-history. Sadly we cannot deliver the original code, because the Zeno Server crashed after running the code and deleted the at this point unsaved rmd file. Luckily, we saved the output of the code in an rda.

Here is an attempt to recreate the code to give you an impression:

```{r}
profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,alb.genre2_01) %>% #for each user and each genre
  summarise(c = sum(is_listened)) %>% #is_listened is 0/1 coded as.numeric, so sum() works fine
  mutate(p = c/sum(c)) #summarise ungroups once, so at this point we are at groub_by(user_id), therefore sum(c)==sum(c | user(i))

profs <- profs[,-3] %>%
  dcast(user_id ~ alb.genre2_01, value.var="p") #dcast is the invers of melt, but creating a new column for each value pair.
profs[is.na(profs)] <- 0
profs[1:10,c(1,3,13,17,29,41)]
```

This chunk gave us the opportunity to look how genres are distributed for every user individual, e.g. user 1 listened 11% "Alternative", 12% "Dance", 24% Electro and so on.

We used this information to create clusters of similar users, using hclust over kmeans, because we did not know how many clusters we wanted to come up with and did not wanted to rerun the code multiple times.

The code looked somewhat like this, note that creating the distance-matrix will take around 2 minutes on the zeno-server and around 5-10 minutes on a regular machine, since it consists of 198 million elements.

```{r}
d <- dist(profs[,-1])
```

We want to use this journey back into our beginnings to show how different clustering methods looked like. However, we used the method "complete" linkage (which means, that the maximum distance will be used to link two clusters) and set the cut-off threshold to 35 clusters.

```{r}
c.tree.comp = hclust(d,method="complete")

w.tree.comp = hclust(d,method="ward.D2")

s.tree.comp = hclust(d,method="single")
```

```{r fig.width=8}
par(mfrow=c(1,3)) 
plot(c.tree.comp)
plot(w.tree.comp)
plot(s.tree.comp)

labs.comp = cutree(c.tree.comp,k=35)
#lookup <-as.data.frame(profs$user_id)
#lookup$profile_id <- labs.comp
#colnames(lookup) <- c("user_id","profile_id")
```

This new feature added to our previously best model increased our score by 4%.

We did not stop at this point, since there are some aspects which are worrying, e.g. that we do not take into account the number of observations or how often a genre was played, in contrast, we only look at how often a user has listened to a genre. We kept on developing the idea, which is worth a chapter on its own.

#### 3(g) User Behavior Index

```{r}
load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda")
DeezerNew_train_0525$is_listened <- as.numeric(DeezerNew_train_0525$is_listened)-1
library(dplyr)
library(splitstackshape)
library(reshape2)
```

Sometimes a simple idea can lead to a cascade of thoughts and breakthroughs, as happened in our case.
The thought was a simple one: "When we look at the distribution of genres for each unique user and cluster them, donÂ´t we actually ignore the preferences of the individual user?" What for example happens to a user, who has multiple favorite genres? Imagine a user who listens only to 5 genres in an equal distributed fashion - each of those genres have a score of 0.2 in our previous approach. Now imagine another user who listens to two genres, one 80% and one 20%. At the moment we propose, that the genres with a score of 0.2 are equally important, even though it is clearly not the case: For user 1, every genre is his favorite, while for user 2 the second genre is not his favorite.

How to bypass this problematic? By scaling our previous results user-individual. That means, we will take an userâ€™s favorite genre (the max() of his scores), and set it to 1 and scale all other genres accordingly. With this transformation we are able to compare different users.
Going back to the example from above: For user 1, each of his/her 5 genres is now a score of 1 while for user 2 only the first genre is 1 and the second is scaled respectively.

```{r}
profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,alb.genre2_01) %>% #for each user and each genre
  summarise(c = sum(is_listened)) %>% #is_listened is 0/1 coded as.numeric, so sum() works fine
  mutate(p = c/sum(c)) #summarise ungroups once, so at this point we are at groub_by(user_id), therefore sum(c)==sum(c | user(i))

profs <- profs[,-3] %>%
  dcast(user_id ~ alb.genre2_01, value.var="p")
profs[is.na(profs)] <- 0 

scaled <- as.data.frame(t(apply(profs[,-1], 1, function(x)(x-min(x))/(max(x)-min(x))))) #using min(x) instead of 0 for convinience, since for all observations min(x)==0
scaled$user_id <- profs$user_id
scaled[is.na(scaled)] <-0 #for users who have never listened to anything

#creating a lookuptable for final models, better not run it again, it takes ages.

#genre_scaled.molten <- melt(scaled, id.vars="user_id",variable.name="alb.genre2_01",value.name="genre_scaled")
#save(genre_scaled.molten,file="30_05_genre_scaled_lookup.rda")
```

```{r fig.width=8,fig.height=5, message=FALSE}
tmp <- melt(profs)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378 | user_id==10971), aes(x=variable,y=value, fill=user_id))+
  geom_bar(stat="identity", position="dodge")+
  scale_y_continuous(breaks=seq(0,0.65,0.05))+
  scale_fill_brewer(palette="Set1")+
  labs(x="genre",y="relative frequency", title="Relative Genre Frequency",subtitle="for three selected users")+
  theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("relative_genre.png",plot=gg1, width = 8, height=5, units="in")

```

```{r fig.width=8,fig.height=5, message=FALSE}
tmp <- melt(scaled)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378| user_id==10971), aes(x=variable,y=value, fill=user_id))+
  geom_bar(stat="identity", position="dodge")+
  scale_y_continuous(breaks=seq(0,1,0.05))+
  
  scale_fill_brewer(palette="Set1")+
  labs(x="genre",y="scaled relative frequency",title="Scaled Relative Genre Frequency",subtitle="for three selected users")+
  theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("scaled_relative_genre.png",plot=gg2, width = 8, height=5, units="in")
```

As told, an idea can lead to a cascade of ideas, here is our second thought: "What if the Deezer Flow-Algorithm suggested a genre very often, but it was not listened to often, even though still often enough to let it appear as a favorite genre for this user?". To get a sense for this possibility, we created a new table with the percentage of how often a genre is listened when played for every user.

```{r}
profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,alb.genre2_01) %>% #for each user and genre
  summarise(c = sum(is_listened),n=n(),p=c/n) #sum(is_listened) again, but this time n() is equal to count() and p as the quotient of c/n

profs <- profs[,-c(3,4)] %>%
  dcast(user_id ~ alb.genre2_01, value.var="p") #long to wide format
profs[is.na(profs)] <- 0 #NAs are equal to 0
listened <- profs[,-1]
listened$user_id <- profs$user_id

#genre_listened.molten <- melt(listened, id.vars="user_id",variable.name="alb.genre2_01",value.name="genre_listened")
#save(genre_listened.molten,file="30_05_genre_listened_lookup.rda")
```

```{r fig.width=8,fig.height=5, message=FALSE}
tmp <- melt(listened)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378 | user_id==10971), aes(x=variable,y=value, fill=user_id))+
  geom_bar(stat="identity", position="dodge")+
  scale_y_continuous(breaks=seq(0,1,0.05),limits = c(0,1))+
  scale_fill_brewer(palette="Set1")+
  labs(x="genre",y="average is_listened",title="Average is_listened per Genre",subtitle="for three selected users")+
  theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("average_is_listened.png",plot=gg3, width = 8, height=5, units="in")
```

Combining the scaled genre preferences as well as the listening history is exactly what we had in mind. As one of the value decreases, the overall trustworthiness of the data is going down. As long as both values are high or extremely low, we can be pretty sure, that a user has listened or not listened to the song respectively.

One aspect we have not considered the sample size yet. Until this point we cannot differentiate between a user, who has listened for example to only one song once (scaled value and listening behavior is 1) and someone who has heard to a genre hundreds of time.

To compensate this in some way, we extracted this information as well:

```{r}
profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,alb.genre2_01) %>% #for each user and genre
  summarise(n=n()) %>% #count
  dcast(user_id ~ alb.genre2_01, value.var="n") #long to wide

profs[is.na(profs)] <- 0
counts <- profs[,-1]
counts$user_id <- profs$user_id

#genre_counts.molten <- melt(counts, id.vars="user_id",variable.name="alb.genre2_01",value.name="genre_counts")
#save(genre_counts.molten,file="30_05_genre_counts_lookup.rda")
```

```{r fig.width=8,fig.height=5, message=FALSE}
tmp <- melt(counts)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378| user_id==10971), aes(x=variable,y=value, fill=user_id))+
  geom_bar(stat="identity", position="dodge")+scale_y_continuous(breaks=seq(0,90,5),limits = c(0,90))+
  scale_fill_brewer(palette="Set1")+
  labs(x="genre",y="Count",title="Count per Genre",subtitle="for three selected users")+
  theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("count.png",plot=gg4, width = 8, height=5, units="in")
```

Test, if columns and user_ids are arranged identically for each of the three tables.

```{r}
sum(names(listened) != names(scaled)) #should be 0
sum(names(listened) != names(counts))
sum(scaled$user_id!=listened$user_id)
sum(scaled$user_id!=counts$user_id)
```

In our last step we combined the three different values. To show you our thought process throughout the report, we will show you how the combination changed over some testing and iterative feedback rounds.

To demonstrate what drove the changes, we will create a set of easy example data. Where:

```{r}
g <- c(1,.8,.6,.3) #Genre-Distribution
l <- c(.25,.7,.8,.95) #Listening-History
c <- c(200,68,45,19) #Count
```

Our first approach was g*l*sqrt(c)
using c as a weight to get higher index values, when we have more data available. We had never the intention to keep the value inside a 0:1 boundry

```{r}
g*l*sqrt(c)
```

As we can see, there is not enough emphasis on the last case. The last genre has 95% listening rate over 19 observations, if this genre would be played again, we could pretty sure predict, that the user is going to listen to it.

Additionally case 1 still has the second highest value, which should indicate a high likelyhood of "is_listened" for a future song of this genre. But the listening value is only 0.25, which indicates, that it is rather unlikely, that the user is going to listen to the genre next time.

In a first step, we exchanged the sqrt() to a log(x+1) function, which gives less weight relatively to the growth in count, the +1 inside the log is needed because log(0) otherwise would lead to -Inf.

```{r}
par(mfrow=c(1,2))
plot(log(1:1000))
plot(sqrt(1:1000))
```

```{r}
g*l*log(c+1)
```

We can see that the values are closer together now; still, the log is a strong enough penalty for low count cases.
As wanted, case 1 has a lower value than before since we donÂ´t give him that much weight through his high n.

Since we wanted more emphasis on the listening history, we thought about assigning a weight manually, something like:

```{r}
weight =5
g*(l*weight)*log(c+1)
```

But tweaking an imaginary value isn't what we wanted to do. A more solid way of getting more emphasis on the listening behavior, was to shrink the g by using the sqrt(g). The Square root performed well, because g is ranged between 1:0, so sqrt(max(g))==1 and sqrt(min(g))==0 and every value for g which is <1 will get smaller, which in turn is more weight on l. Additionally it has a stronger impact on lower values, which is a beneficial side effect.

```{r}
plot(sqrt(seq(1,0,length.out = 100)))
```

```{r}
sqrt(g)*l*log(c+1)
```

In this final version we can see, that case 4 is more important than case 1, but still, due to the small sample size not too overdone.

Here is the final code, which computes the behavior_index:
```{r}
behavior <- sqrt(scaled[,-47])*listened[,-47]*log(counts[,-47]+1)
behavior$user_id <- scaled$user_id
#behavior_molten <- melt(behavior, id.vars="user_id",variable.name="alb.genre2_01",value.name="behavior")
#save(behavior_molten,file="25_05_new_behavior_lookup.rda")
```

```{r fig.width=8,fig.height=5, message=FALSE}
tmp <- melt(behavior)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378| user_id==10971), aes(x=variable,y=value, fill=user_id))+
  geom_bar(stat="identity", position="dodge")+scale_y_continuous(breaks=seq(0,4,0.25),limits = c(0,4))+
  scale_fill_brewer(palette="Set1")+
  labs(x="genre",y="behavior index",title="Behavior Index",subtitle="for three selected users")+
  theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("behavior_index.png",plot=gg5, width = 8, height=5, units="in")
```

```{r fig.width=8,fig.height=5, message=FALSE}
tmp <- cbind.data.frame(sqrt(seq(0,1,length.out = 100)),seq(0,1,length.out = 100))
colnames(tmp) <- c("after","before")
ggplot(tmp,aes(x=before,y=after))+geom_point()+
  scale_y_continuous(breaks=seq(0,1,0.1),limits = c(0,1))+
  scale_x_continuous(breaks=seq(0,1,0.1),limits = c(0,1))+
  scale_fill_brewer(palette="Set1")+
  labs(x="x",y="sqrt(x)",title="Sqrt(x) versus x")+
  theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("sqrt.png",plot=gg6, width = 8, height=5, units="in")
```

As you can imagine, the Behavior-Matrix is sparsely filled, since unlistened genres are set to 0, because g will be 0, even if c is not.
Higher values of Behavior_index for a genre is indicating a higher likelihood of getting listened to again by the user, while values close to 0 are indicating a low likelihood, respectively. We used the data for clustering as well as a feature itself. To use it as a feature we melt() the matrix and saved it as a lookup table, which we will later left join to the data using user_id and genre_id as a key.

```{r}
formattable(behavior[1:10,1:10])
```

#### 3(h) User Behavior Index for Clustering Users

Finally, we can replace the previous clustering by clustering across the users Behavior_index.
If you have not read 2(d), you should go back and read the explanations for our clustering approaches and codes.

We want to use this journey back into our beginnings to show how different clustering methods looked like. In the end, we used the method "complete" and set the cutoff point to 35 clusters.

```{r}
d <- dist(behavior[,-47]) 
c.tree.comp = hclust(d,method="complete")

w.tree.comp = hclust(d,method="ward.D2")

s.tree.comp = hclust(d,method="single")
```

```{r fig.width=8}
#png(filename="clustering_types.png",width=16, height=9, units = "in",res=72)
par(mfrow=c(1,3)) 
plot(c.tree.comp)
plot(w.tree.comp)
plot(s.tree.comp)
#dev.off()

#png(filename="clustering_complete.png",width=16, height=9, units = "in",res=72)
plot(c.tree.comp)
#dev.off()
```

```{r}
tree.comp = hclust(d,method="complete")
labs.comp = cutree(tree.comp,k=35)
cluster_user_lookup <-as.data.frame(behavior$user_id)
cluster_user_lookup$profile_id <- labs.comp
colnames(cluster_user_lookup) <- c("user_id","profile_id")
# save(cluster_user_lookup, file="23_05_SQRT_35_Clusters.rda")
```

```{r}
load("/home/Deezer/30_Wrangled_Data/23_05_SQRT_35_Clusters.rda")
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],17),"#4292C6")

ggplot(cluster_user_lookup, aes(x=profile_id))+stat_bin(binwidth=1,fill=myColors, color="black")+scale_y_log10()+
  scale_x_continuous(breaks=seq(1,35,1))+
labs(y="log10 of observations",
      x="profile cluster",
      title="Logarithmic Distribution of Profile Belonging",
      caption="Source: Feature Engineering Behavior_index Clustering")+
theme_light()+theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("distribution_profiles.png",plot=gg13, width = 8, height=5, units="in")
```

The profiles were saved in a lookup table and joined onto the data.
Overall, adding the behavior_index as well as behavior_profile increased our accuracy by a significant amount.
We were lucky and could use the data for g, l and c as additional Features.

#### 3(i) Using the Behavior Architecture to Create Further Features

A day before the deadline, after the meeting with Prof. LÃ¶cher and to his suggestion we used the existing behavior_index architecture to create new features, not just for genre, but also for artist and album. In this report we set the following code to eval=FALSE, because it is very time consuming.
For Example: The lookup table for artists/user-behavior has roughly 1.3e+09 rows, the lookup for album 2.97e+09 rows.

```{r eval=FALSE}
profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,artist_id) %>% 
  summarise(c = sum(is_listened)) %>% 
  mutate(p = c/sum(c))
profs <- profs[,-3] %>%
  dcast(user_id ~ artist_id, value.var="p")
profs[is.na(profs)] <- 0
scaled <- as.data.frame(t(apply(profs[,-1], 1, function(x)(x-min(x))/(max(x)-min(x)))))
scaled$user_id <- profs$user_id
scaled[is.na(scaled)] <-0 #for users who have never listened to anything

artist_scaled.molten <- melt(scaled, id.vars="user_id",variable.name="artist_id",value.name="artist_scaled")
#save(artist_scaled.molten,file="30_05_artist_scaled_lookup.rda")

profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,artist_id) %>% 
  summarise(c = sum(is_listened),n=n(),p=c/n)
profs <- profs[,-c(3,4)] %>%
  dcast(user_id ~ artist_id, value.var="p")
profs[is.na(profs)] <- 0
listened <- profs[,-1]
listened$user_id <- profs$user_id

artist_listened.molten <- melt(listened, id.vars="user_id",variable.name="artist_id",value.name="artist_listened")
#save(artist_listened.molten,file="30_05_artist_listened_lookup.rda")

profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,artist_id) %>% 
  summarise(n=n()) %>%
  dcast(user_id ~ artist_id, value.var="n")
profs[is.na(profs)] <- 0
counts <- profs[,-1]
counts$user_id <- profs$user_id

artist_counts.molten <- melt(counts, id.vars="user_id",variable.name="artist_id",value.name="artist_counts")
#save(artist_counts.molten,file="30_05_artist_counts_lookup.rda")

behavior_artist <- sqrt(scaled[,-67143])*listened[,-67143]*log(counts[,-67143]+1)
behavior_artist$user_id <- scaled$user_id
behavior_artist_molten <- melt(behavior_artist, id.vars="user_id",variable.name="artist_id",value.name="behavior_artist")
#save(behavior_artist_molten,file="30_05_behavior_artist_lookup.rda")
```

Here is the same code for album_id:
```{r eval=FALSE}
profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,album_id) %>% 
  summarise(c = sum(is_listened)) %>% 
  mutate(p = c/sum(c))
profs <- profs[,-3] %>%
  dcast(user_id ~ album_id, value.var="p")
profs[is.na(profs)] <- 0
scaled <- as.data.frame(t(apply(profs[,-1], 1, function(x)(x-min(x))/(max(x)-min(x)))))
scaled$user_id <- profs$user_id
scaled[is.na(scaled)] <-0 #for users who have never listened to anything

album_scaled.molten <- melt(scaled, id.vars="user_id",variable.name="album_id",value.name="album_scaled")
#save(album_scaled.molten,file="30_05_album_scaled_lookup.rda")

profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,album_id) %>% 
  summarise(c = sum(is_listened),n=n(),p=c/n)
profs <- profs[,-c(3,4)] %>%
  dcast(user_id ~ album_id, value.var="p")
profs[is.na(profs)] <- 0
listened <- profs[,-1]
listened$user_id <- profs$user_id

album_listened.molten <- melt(listened, id.vars="user_id",variable.name="album_id",value.name="album_listened")
#save(album_listened.molten,file="30_05_album_listened_lookup.rda")

profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,album_id) %>% 
  summarise(n=n()) %>%
  dcast(user_id ~ album_id, value.var="n")
profs[is.na(profs)] <- 0
counts <- profs[,-1]
counts$user_id <- profs$user_id

albums_counts.molten <- melt(counts, id.vars="user_id",variable.name="album_id",value.name="album_counts")
#save(album_counts.molten,file="30_05_album_counts_lookup.rda")

behavior_album <- sqrt(scaled[,-ncol(scaled)])*listened[,-ncol(listened)]*log(counts[,-ncol(listened)]+1)
behavior_album$user_id <- scaled$user_id
behavior_album_molten <- melt(behavior_album, id.vars="user_id",variable.name="album_id",value.name="behavior_album")
#save(behavior_album_molten,file="30_05_behavior_album_lookup.rda")

```

Sadly the computation was not fast enough to enable us to get a final prediction with the new features. But we ran some hyper tuned models after the competition and will present the results later on.

#### 3(j) Creating the New Dataset

The last step before we start with modeling, it is needed to merge all new features by their look up tables with the old dataset. As some of the merges were quite time intensive (even on the zeno server), we execute the code just for the test dataset for illustrating how it works.

Loading required data and lookup tables:
```{r Loading new lookup tables}
Deezer <- read.csv("/home/Deezer/10_Basic_Dataset/train.csv")
Deezer_test <- read.csv("/home/Deezer/10_Basic_Dataset/test.csv")

load("/home/Deezer/30_Wrangled_Data/tracks_cleaned.rda") #BPM column
load("/home/Deezer/30_Wrangled_Data/album_genre_clean.rda") #new_genre
load("/home/Deezer/30_Wrangled_Data/21_05_new_behavior_lookup.rda")
load("/home/Deezer/30_Wrangled_Data/23_05_SQRT_35_Clusters.rda")

load("/home/Deezer/30_05_genre_counts_lookup.rda")
load("/home/Deezer/30_05_genre_listened_lookup.rda")
load("/home/Deezer/30_05_genre_scaled_lookup.rda")

#These lookup tables are really big and it takes quite long to load them into the environment. Therefore, we decided to uncomment these lines.
#load("/home/Deezer/30_05_artist_listened_lookup.rda")
#load("/home/Deezer/30_05_artist_scaled_lookup.rda")
#load("/home/Deezer/30_05_artist_counts_lookup.rda")
```

The lookup tables already provide the user_ids as factors. However, the datatypes of the competition set has not been modified yet. This means all features are still numeric variables. We decided to merge the data by the remaining numeric value instead of the factor as the merge itself can be executed faster. Therefore, we change the user_id of the lookup tables into numeric values at first. 

```{r}
behavior_molten$user_id <- as.integer(as.character(behavior_molten$user_id))
#class(behavior_molten$user_id)

cluster_user_lookup$user_id <- as.integer(as.character(cluster_user_lookup$user_id))

#new genre features
genre_counts.molten$user_id <- as.integer(as.character(genre_counts.molten$user_id))
genre_listened.molten$user_id <- as.integer(as.character(genre_listened.molten$user_id))
genre_scaled.molten$user_id <- as.integer(as.character(genre_scaled.molten$user_id))

#new artist features:
# artist_counts.molten$user_id <- as.integer(as.character(artist_counts.molten$user_id))
# artist_listened.molten$user_id <- as.integer(as.character(artist_listened.molten$user_id))
# artist_scaled.molten$user_id <- as.integer(as.character(artist_scaled.molten$user_id))
# artist_counts.molten$artist_id <- as.integer(as.character(artist_counts.molten$artist_id))
# artist_listened.molten$artist_id <- as.integer(as.character(artist_listened.molten$artist_id))
# artist_scaled.molten$artist_id <- as.integer(as.character(artist_scaled.molten$artist_id))

#Due to the fact that we had problems while removing sample_id, i created a NA column for train data as well. 
Deezer_test$is_listened = 0
Deezer$sample_id = NA
```

In the beginning of the project where we had just small lookup tables, which needed to be merged with the competition dataset, no serious problems occurred within the joining process. However, as the lookup tables became larger and larger (e.g. the artist table), the needed computational power and time was increasing vastly. Therefore, we converted the lookup tables into data tables. This speeds up the process of joining big tables quite much. However, the three new lookup tables based on artist information, which were created in the previous chapter, were so big that we could not even do any operations on it. Hence, we decided to exclude it from the model. 

```{r}
behavior_molten <- data.table(behavior_molten)
cluster_user_lookup <- data.table(cluster_user_lookup)
genre_counts.molten <- data.table(genre_counts.molten)
genre_listened.molten <- data.table(genre_listened.molten)
genre_scaled.molten <- data.table(genre_scaled.molten)
#artist_counts.molten <- dataframe(artist_counts.molten)
#artist_listened.molten <- dataframe(artist_listened.molten)
#artist_scaled.molten <- dataframe(artist_scaled.molten)
class(behavior_molten)
```

We created a merging function, which does all joins for a preferred dataset. For the demonstration, we just merge the 1000 rows of the Deezer train dataset.

```{r}
#dim(Deezer) #great for checking that number of rows stay constant while merging
dim(Deezer_test)

#joining data:
get_new_data <- function(x) {
  x <- left_join(x, albums2, by = "album_id", all.x=T, sort=F) #alb.genre_2..
  x <- left_join(x, tracks_clean2, by = "media_id",all.x=T, sort= F) # BPM
  x <- merge(x, behavior_molten, x.by = c("user_id","alb.genre_2_01"),all.x=T,sort=F) #old listening behavior, we 
  x <- left_join(x, cluster_user_lookup, by = "user_id",all.x=T, sort =F) #new clusters
  
  timestamp = as.POSIXct(x$ts_listen, origin="1970-01-01")
  splitdt  <- data.frame(
    hh = as.numeric(format(timestamp, format = "%H")), #24hours format
    wd = as.numeric(format(timestamp, format = "%w"))) #weekday
  x = cbind(x, splitdt) #cbind the extracted time to data
  
  releaseDate = as.Date(as.character(x$release_date), format="%Y%m%d")
  splitrdt  <- data.frame(ryear = as.numeric(format(releaseDate, format = "%Y")))
  x = cbind(x, splitrdt)
  
  #new features (30.05.17)
  x <- merge(x, genre_listened.molten , x.by = c("user_id","alb.genre_2_01"),all.x=T,sort=F)
  x <- merge(x, genre_scaled.molten, x.by = c("user_id","alb.genre_2_01"),all.x=T,sort=F)
  x <- merge(x, genre_counts.molten, x.by = c("user_id","alb.genre_2_01"),all.x=T,sort=F)
 # x <- merge(x, artist_counts.molten, x.by = c("user_id","artist_id"),all.x=T,sort=F)
 # x <- merge(x, artist_scaled.molten, x.by = c("user_id","artist_id"),all.x=T,sort=F)
 # x <- merge(x, artist_listened.molten, x.by = c("user_id","artist_id"),all.x=T,sort=F)
  
  #remove old columns / other columns are deleted below:
  x$genre_id = NULL #delete old genre column
  x$ts_listen = NULL #remove ts_listen
  x$release_date = NULL #remove release_date

  return(x)
}

#DeezerNew_train <- get_new_data(Deezer)
DeezerNew_test <- get_new_data(Deezer_test)

#dim(DeezerNew_train)
dim(DeezerNew_test)

DeezerNew_test <- DeezerNew_test %>%
 arrange(sample_id)
```

It can be seen, that the function added 10 new features (23-16+3 missing artist features), while the number of rows stay fixed. However, before we can use the data we have to modify the datatypes. That is done by the next function: data_type_function.

```{r}
#converting specific columns in factors
data_type_function <- function(x){
    cols <- c("media_id", "album_id", "context_type", "platform_name", "platform_family", "listen_type", "user_gender", "user_id", "artist_id", "is_listened", "alb.genre2_01", "wd", "profile_id")
    x[,cols] <- data.frame(apply(x[cols], 2, as.factor))#change data type for these columns into factors
    return(x)
}

#DeezerNew_train <- data_type_function(DeezerNew_train)
DeezerNew_test <- data_type_function(DeezerNew_test)
```

Furthermore we did some additional adjustments. This was caused by merging the new behavior feature. For the test data we gained 4 NAs which might not be problematic as they are so less. However, in the train data set this effect occurs 1500 times. 

```{r}
sum(is.na(DeezerNew_test$behavior))
```

Checking the NAs for this specific column reveals the values which could not be joined. The spelling of the German letters ä and ü letters had caused the troubles. In the lookup table they are written like: "T<fc>rkischeVolksmusik" instead of TürkischeVolksmusik. 

```{r}
Deezer_fail <- DeezerNew_test %>%
  filter(is.na(DeezerNew_test$behavior))

Deezer_fail %>%
  group_by(alb.genre2_01) %>%
  summarise (n = n())
```

These genres are not that important for the modeling part. Also, they occur just 1500 times out of 7.5 million rows. Therefore, we decided to fill the NAs with a listening behavior of 0. 

```{r}
#DeezerNew_train$behavior[is.na(DeezerNew_train$behavior)] <- 0
DeezerNew_test$behavior[is.na(DeezerNew_test$behavior)] <- 0
```

In the end of our project we also add new time information to our models. These features are also added to the models. 

```{r}
## Add time as circular value
#DeezerNew_train$xhh <- sin(2 * pi * DeezerNew_train$hh / 24)
#DeezerNew_train$yhh <- cos(2 * pi * DeezerNew_train$hh / 24)
DeezerNew_test$xhh <- sin(2 * pi * DeezerNew_test$hh / 24)
DeezerNew_test$yhh <- cos(2 * pi * DeezerNew_test$hh / 24)

## Delete the hh column
#DeezerNew_train <- subset(DeezerNew_train, select = -hh)
DeezerNew_test <- subset(DeezerNew_test, select = -hh)
```

In the end both files are combined. This was needed for hot endcoding and creating the belonging sparse matrix. Afterwards, all three new datasets are saved.

```{r}
## Combine the files
#DeezerNew_traintest <- rbind(DeezerNew_train,DeezerNew_test)

## Save the new files
#save(DeezerNew_train, file = "/home/Deezer/30_Wrangled_Data/Deezer_train_0702.rda")
#save(DeezerNew_test, file = "/home/Deezer/30_Wrangled_Data/Deezer_test_0702.rda")
#save(DeezerNew_traintest, file = "/home/Deezer/30_Wrangled_Data/Deezer_traintest_0702.rda")  
```


```{r}
rm(list=ls())
gc()
```

## 4. Modeling

Data scientists spend most of their time on cleaning, labeling and organizing data ([link](https://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport.pdf?mkt_tok=eyJpIjoiTTJFek1UaGxNekl5TmpJeCIsInQiOiJPMXpWVnREbHJGR1RHN1BUVk9Zdm5xTEEyNXRpdDRONmZ4XC9oaFwvZGpFNnB0Z0hIWERGS1NqN0huRjB5QUE1UUJVbVZxNFhaeVBSUXJzMTNYdmRJRFhDMU94N). Fortunately, this task has been done. With a solid base of available data, the time has come to build and train our models.

This competition is a classification problem. Therefore, there are several classification algorithms to choose from, e.g. logistic regression, decision trees, boosted trees, k-nearest neighbors, singular value decomposition (SVD) and many more.

As gradient boosted trees are extremely powerful and win many kaggle competitions on a regular basis or as kaggle grandmaster Ben Gorman said, "If linear regression was a Toyota Camry, then gradient boosting would be a UH-60 Blackhawk Helicopter" ([link](https://gormanalysis.com/gradient-boosting-explained/)), we immediately decided to go with the gradient boosted tree approach. The XGboost library is particularly popular because it has won more than half of all Kaggle competitions ([link](https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-challenge-winning-solutions)). Therefore, we decided to focus on it as our main library.

Furthermore, we also tried a Naive Bayes and a Deep Learning approach. Other libraries to build these models include h2o, lightgbm and naivebayes. In the end though, our best model submitted was based on XGboost. Let us take you through the model building process in the following sub-chapters.

As a general note, we usually tested our model in an RMD file in RStudio on the Zeno server. If everything worked, we would stick our code into a regular R file (R script) and ran the code via Bash and the screen command. This allowed us to run the code for several days while closing the terminal.

#### 4(a) XGboost - Sparse

XGboost only handles numeric values. Our dataset contains a lot of categorical values though. There are two major methods to deal with this issue. One method is called label encoding. Basically a categorical values e.g. male, female and neither are assigned numbers such as 0, 1 and 2. This will then be applied to all categorical values. See the table below.

```{r echo = FALSE}
gender <- c("Male", "Female", "Neither", "Trans Male", "Trans Female")
n <- c(0, 1, 2, 3, 4)
data.frame(gender, n)
```

Surprisingly, XGboost deals with label encoded values very well. However, we felt that this was not the most appropriate method from a statistical perspective. Since XGboost does not possess a memorization feature where it recognizes that data was transformed from a factor to a numeric value we believed it was actually not appropriate to use this method because that would mean that "Female" is higher/bigger/larger/ than "Male" because it has a larger value. However, in chapter 4(b) you will see that it also achieved a high accuracy.

Therefore, we decided to use another method which is called one-hot encoding. Another aspect that added weight for this method was that we touched this topic when we talked about regularization. What follows is a description of how we built our very first finally the evolution towards our best XGboost - Sparse model, which also happened to be our best overall model. The initial model building process is based on a tutorial ([link](http://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html)) that is included in the official XGboost documentation.

We also reviewed the parameters in the official documentation ([link](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md)) and on Analytics Vidhya ([link](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)) which offers a nice guide that explains the parameters in more detail.

In this very first model we didn't include all the features we developed over the course because at that time we were still working on the feature engineering part. Thus, in this model we included the BPM and genre features.

```{r}
## 0. Load the data.
load("/home/Deezer/30_Wrangled_Data/Archiv/Deezer_new.rda")

## 1. Wrap full_dataset with data.table because its performance for large datasets is best in class.
df <- data.table(DeezerNew_all, keep.rownames = F)
summary(df) # Check for any NAs because XGboost can also not deal with those.

## 2. Create Sparse Matrix. We leave "is_listened" out because we want to predict it (is_listened ~. means all categorical features are converted to binary values except the "is_listened" column). Unfortunately, this will take around 5 minutes on the server.
sparse_matrix <- sparse.model.matrix(is_listened ~.-1, data = df)
dim(sparse_matrix) # As you can see, this is a gigantic sparse matrix.

## 2.1. Split the Sparse Matrix up according to the train (7558834 rows) and test (remaining rows) data. We will limit the number of rows for demonstration purposes. In a real setting the rows would not be restricted. 
sparse_train <- sparse_matrix[1:50000,]    # sparse_matrix[1:7558834,] in our real code
sparse_test <- sparse_matrix[50001:60000,] # sparse_matrix[7558835:7578752,] in our real code

dim(sparse_test)
dim(sparse_train)

## 3. Create the Boolean output vector for XGboost. This will be our label for the model. It generates a vector that is TRUE for every is_listened == 1 and FALSE otherwise.
output_vector <- DeezerNew_all[1:50000,"is_listened"] == 1 # DeezerNew_all[1:7558835,"is_listened"] == 1 in our real code.

## 4. Build the model. The parameters are chosen based on what we believed was "right" after reading through the official documentation.
bst <- xgboost(data = sparse_train,           # Data source to train the model.
               label = output_vector,         # Response variable.
               max.depth = 6,                 # Max depth of the tree.
               eta = 0.2,                     # Learning rate, describes the weight of a newly built tree.
               verbose = 1,                   # Information output.
               nrounds = 10,                  # Number of rounds. We limit the number of training rounds to 10 for demonstration purposes.
               nthread = 32,                  # Number of CPUs.
               early_stopping_rounds = 10,    # Number of rounds when the program should stop training without seeing an improvement.
               eval_metric = "auc",           # Evaluation metric. Since the Evaluation metric is AUC in the competition, we chose this one.
               objective = "binary:logistic") # Definition of the loss function that shall be minimized. This is a probabilistic binary classification problem (values ranging from 0 to 1). Therefore, we chose "binary:logistic".

## 4.1. Fit the model.
XGpred <- predict(bst, sparse_test)

## 5. Create a submission file.
submission <- read.csv("/home/Deezer/10_Basic_Dataset/sample_submission_kaggle.csv")
submission$is_listened[1:10000] <- XGpred
#write.csv(submission, "OHE_xgb300_May17_wo_profileID.csv", row.names = FALSE)
```

This was our first XGboost - sparse model. It achieved an AUC score of 0.58639 in the public leaderboard and 0.59343 in the private leaderboard. That is already a good first result but still far away from being very good. At that time, the best team had a score of around 0.65.

Our main considerations for improvement were:

  * Addition of additional features
  * Hyperparameter Tuning (HT)
  * Cross-validation (CV)

The following code describes our best model. Note, to fully understand what is happening, we only restricted the number of rounds and iterations in the HT+CV and fitting phases. Considering that the sparse matrix is huge, please allow the script to run for round about 7 minutes to prepare the data. Please make sure you run the script on the server.

We start with HT and CV.

```{r}
## 1. Load only the train data.
load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda")
summary(DeezerNew_train_0525) # Check for any NAs because XGboost can also not deal with those.

## 2. We wanted to use the full dataset to get the most out of the available information. We first needed to delete the "sample_id" column because they are empty.
DeezerNew_train_0525 <- subset(DeezerNew_train_0525, select = -sample_id)

## 3. Since this is a computational expensive task, we run our HT and CV on 1/3 of the data.
DeezerSampleCV <- DeezerNew_train_0525[sample(nrow(DeezerNew_train_0525), nrow(DeezerNew_train_0525) * .33), ]

## 4. Wrap the sample in a data.table for faster computation.
Deezer_DT <- data.table(DeezerSampleCV, keep.rownames = FALSE)

## 5. Create sparse matrix (leave is_listened out because we want to predict it). -1 deletes the first column because this function genrates a new first column with 1s filled out
sparse_train <- sparse.model.matrix(is_listened ~.-1, data = Deezer_DT)
dim(sparse_train) # Check the dimensions.

## 6. Create the Boolean output vector for XGboost. This will be our label for the model. It generates a vector that is TRUE for every is_listened == 1 and FALSE otherwise.
output_vector <- DeezerSampleCV[,"is_listened"] == 1
```

Now that we have set up the data we can run the HT and CV. Grid search and manual search are two of the most common used strategies for hyper-parameter tuning. Bergstra and Bengio (2012) ([link](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf)) show empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter tuning than trials on a grid. Therefore, we decided to go with random values in our grid.

```{r}
## 1. We initialize a list and some variables.
best_param <- list()    # Create an empty list to store our results.
best_seednumber <- 1234 # Set an initial seed for reproducability.
best_logloss <- Inf
best_logloss_index <- 0

## 2. We then initialize a for loop that runs through several parameter combinations.
for (iter in 1:4) { # for (iter in 1:250) in our real code
  
  ## Our parameter list. This is the HT part.
  param <- list(objective = "binary:logistic",       # Definition of the loss function.
                eval_metric = "logloss",             # Evaluation metric.
                max_depth = sample(6:12, 1),         # Max depth of the trees between 6 and 12.
                eta = runif(1, .01, .3),             # Learning rate between 0.01 and 0.3.
                gamma = runif(1, 0.0, 0.2),          # Specifies the minimum loss reduction required to make a split in a node. Higher values make the algorithm more conservative.
                subsample = runif(1, .6, .9),        # Denotes the fraction of observations to be random samples for each tree.
                colsample_bytree = runif(1, .5, .8), # Denotes the fracation of columns to be randomly chosen. 1 means to choose all columns.
                min_child_weight = sample(1:40, 1),  # Defines the minimum sum of weights of all observations required in a child. Higher values prevent the model from over-fitting.
                max_delta_step = sample(1:10, 1)     # Introduces an "absolute" regularization, capping the weight before eta is applied. Higher values make the algortihm more conservative.
                )

  ## Additional information which we need for the CV. This is the CV part.
	cv_nround <- 5                          # Set the number of training rounds. In our real code this was set to 300.
	cv_nfold <- 5                           # Set the number of folds. 5 is a standard value.
	seed_number = sample.int(10000, 1)[[1]] # Choose a random number.
	set.seed(seed_number)                   # Set a seed for reproducability.
	message("Iteration Round: ", as.character(iter), appendLF = FALSE) # Check at which iteration we are.
  
	## The actual CV model that uses a HT combination from above.
	mdcv <- xgb.cv(data = sparse_train,
	               label = output_vector,
	               params = param,
	               nfold = cv_nfold,
	               nrounds = cv_nround,
	               nthread = 32,
	               verbose = TRUE,
	               early_stopping_rounds = 20,
	               maximize = FALSE
	               )

	## We would like to store the results of our HT and CV.
	min_logloss <- min(mdcv$evaluation_log$test_logloss_mean)
	min_logloss_index <- which.min(mdcv$evaluation_log$test_logloss_mean)

	## If the result was better than the ones we saw before, we will save it in the list we created in the beginning.
	if (min_logloss < best_logloss) {
	  best_logloss = min_logloss
		best_logloss_index = min_logloss_index
		best_seednumber = seed_number
		best_param = param
		}
}

## Finally, we save the best parameters in a CSV file to add the parameters to our final model later. Alternatively, we could have just added thoses parameters to the parameter settings of the final model.
#write.csv(best_param, file = "best_parameters_FinalCV.csv")
```

It is time for the final model building part. We have to apply our best settings to train the model on the full dataset. Once, again we start with a little bit of data preparation.

```{r}
## 0. Just to make sure we have the correct source data, we load it once again. These are the train, test and combined datasets.
load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda")
load("/home/Deezer/30_Wrangled_Data/Deezer_test_0525.rda")
load("/home/Deezer/30_Wrangled_Data/Deezer_traintest_0525.rda")

## 1. In each dataset we delete the "sample_id".
DeezerNew_traintest_0525 <- subset(DeezerNew_traintest_0525, select = -sample_id)
DeezerNew_train_0525 <- subset(DeezerNew_train_0525, select = -sample_id)
DeezerNew_test_0525 <- subset(DeezerNew_test_0525, select = -sample_id)

## 2. This time we wrap the the combined dataset in a data.table for faster computation.
Deezer_DT <- data.table(DeezerNew_traintest_0525, keep.rownames = F)

## 3. Once again, we create a Sparse Matrix (leave is_listened out because we want to predict it). -1 deletes the first column because this function genrates a new first column with 1s filled out.
sparse_matrix <- sparse.model.matrix(is_listened ~.-1, data = Deezer_DT)
dim(sparse_matrix)

## 3.1. Split the Sparse Matrix up according to the train (7558834 rows) and test (remaining rows) data.
sparse_train <- sparse_matrix[1:nrow(DeezerNew_train_0525),]
sparse_test <- sparse_matrix[7558835:nrow(DeezerNew_traintest_0525),]

dim(sparse_test)
dim(sparse_train)

## 4. Create the Boolean output vector for XGboost. This will be our label for the model. It generates a vector that is TRUE for every is_listened == 1 and FALSE otherwise.
output_vector <- DeezerNew_train_0525[,"is_listened"] == 1
```

Now we set up the model and add our best parameters. During the competition we added those parameters manually from the CSV file because we always ran a normal R script on the Zeno server via bash and the "screen" command.

```{r}
## We already added the best model parameters.
param <- list(objective = "binary:logistic",
              eval_metric = "logloss",
              max_depth = 10,
              eta = .211,
              gamma = .0509,
              subsample = .65,
              colsample_bytree = .534,
              min_child_weight = 2,
              max_delta_step = 10
              )

## Model Input
XGB_300_fulldataset_0526 <- xgboost(params = param,
                                    data = sparse_train,
                                    label = output_vector,
                                    verbose = TRUE,
                                    nrounds = 2, #in our real code, this was set to 300
                                    nthread = 32,
                                    early_stopping_rounds = 20,
                                    maximize = FALSE
                                    )

XGB_300_fulldataset_0526_Pred <- predict(XGB_300_fulldataset_0526, sparse_test)

## Create submission file
submission <- read.csv("/home/Deezer/10_Basic_Dataset/sample_submission_kaggle.csv")
submission$is_listened <- XGB_300_fulldataset_0526_Pred
#write.csv(submission,"XGB_300_fulldataset_0526_Pred",row.names = FALSE)
```

Let's have a look at the most important features. Note, XGboost allows you to save trained models. We saved a model that has the same input as above but was only trained for 100 rounds. Difference between the 300 and 100 round model should be relatively small. For convenience reasons we are going to load the 100 round model and plot the feature importance on it.

```{r}
bst <- xgb.load("/home/Deezer/20_Scripts/20_Scripts_for_Modelling/30_XGB_Trained_Models/XGB_100_fulldataset_0527.model")
prediction <- predict(bst, sparse_test)

mat <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst)
print(mat)
xgb.plot.importance(xlab = "Contribution", ylab = "Feature", importance_matrix = mat[1:20])

## Relative Importance
xgb.plot.importance(mat[1:20], rel_to_first = TRUE, xlab = "Relative importance")
ggplot(mat[1:20,], aes(reorder(Feature, Gain), Gain)) +
  geom_col(fill = "lightblue") +
  coord_flip() +
  labs(x = "Feature", y = "Contribution", title = "Feature Importance", caption = "Source: XGboost Trained Model")
```

As you can see our new features "profile_ID" and "Bahavior" had quite an impact. This model was our best one. It achieved an AUC score of 0.63749 (0.58639) in the public leaderboard and 0.63860 (0.59343) in the private leaderboard. That is an improvement of around 8.7% and 7.6% respectively.

#### 4(b) Other Major Models

  * [XGBoost Numeric/ label encoding method model](https://github.com/pranavpandya84/deezer_report/blob/master/Models/XGBoost_Numeric.rmd)
  * [LightGBM](https://github.com/pranavpandya84/deezer_report/blob/master/Models/LightGBM.rmd)
  * [H2O Automated Machine Learning](https://github.com/pranavpandya84/deezer_report/blob/master/Models/H2O_autoML.rmd)
  * [H2O Distsributed Random Forest- Flow](https://github.com/pranavpandya84/deezer_report/blob/master/Models/H2O%20Flow/H2O%20DRF%2023rd%20May/screenshots.rmd)
  * [H2O Deep Learning - Flow](https://github.com/pranavpandya84/deezer_report/blob/master/Models/H2O%20Flow/H2O%20DL%20Flow/Screenshots.rmd) 
  * [Models- folder](https://github.com/pranavpandya84/deezer_report/tree/master/Models) 
  * [Data used for other models](https://www.dropbox.com/sh/1ubsg7qhovifgoj/AAAZb1hOTBESDrI3jbuYQkULa?dl=0)
(Data has also been put in zeno server under "60_data_other_models" and steps from above scripts/ models can be replicated. Dropbox link is to download data and try models locally without need of server.)

#### 4(c) Thoughts on the Best Model

* [Leaderboard Score Analysis](https://github.com/pranavpandya84/deezer_report/blob/master/Models/LB_Score/report_slides.rmd) 
![Private LB](/home/Deezer/deezer_report/2.png)

In terms of accuracy, the XGBoost Sparse model gave us the best accuracy on private leaderboard score however it was computationaly expensive and much time consuming considering the sparse matrix and grid search approach. Other models, such XGBoost (label encoded), LightGBM (super fast) and H2O autoML (extremely fast i.e 90 mins only) were computationally cheap and we were able to run these model locally without the server. In terms of accuracy, it was slightly lower than XGBoost Sparse model. 

#### 4(d) Caret

Caret is a package for feature selection, model tuning and variable importance estimation (see https://topepo.github.io/caret/index.html) and contains 233 different models. It enables to compare different models, but uses a lot of memory for that. We found that short before the deadline. Because of the memory consumption, we interrupted the processing to focus complete to the last runs of cross-validation. In general it should be worth to keep Caret in mind.

While testing the package with a small subset, the following error occured after a while: "At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names". But if we used make.names() as suggested, it's output was identical to the column names.

#### 4(e) SAP Predictive Analysis

Just out of curiosity we tried a little competition: SAP Predictive Analysis vs. R . Based on the lectures of Nancy Jones (San Diego State University) we explored the possibilities of SAP PA. We hoped, that we can import also R models to test them with automatic visualization. Unfortunately, that is not possible. Therefore we had run the complete model creation.

The used dataset was DeezerNew_train_0530.csv with 7,558,834 and 25 features. The learning phase took all in all 76 minutes. The chosen prediction engine by SAP Predictive Analysis was Kxen.RobustRegression. In the end it just kept 10 features, where the feature profile_id contributed the most. But the accuracy of the predictions were not that high, it ends up with a low scoring of 0.4798 . Because of the less influence factors, there was no possibility to improve the scoring.

SAP Predictive Analysis can be a suggestion for Managers with a low Data Science background. It is really good, that a complete summary page (as HTML, in the folder Models) is prepared to take into meetings and one summary slide to put into a presentation. With R are the possibilities nearly endless, but you have to know what to do and create the figures and numbers by your own. The advantage is, that you can get a better prediction power.


## 5. Public Leaderboard Score by Time and Total Number of Submissions

- The competition started on Friday 14 April 2017 UTC and ended on Wednesday 31 May 2017 UTC. Within this 47 days time line, we submitted total of 45 models and below is the screenshot of model accuracy over time. 

![Public Leaderboard](/home/Deezer/deezer_report/PublicLB.png)

- Below is the screenshot illustrating comparison between Public & Private leaderboard score by time and our best models: 
![Overall Leaderboard](/home/Deezer/deezer_report/3.png)

Below is the statistic about this competition. 

![Deezer Poster](/home/Deezer/deezer_report/deezer_poster1.png)

- Out of 220 teams, we were ranked 66th worldwide and 3rd in Germany. 
![Ranking](/home/Deezer/deezer_report/rank2.png)


## 6. Team Effort & Reflection

#### 6(a) Team Effort

We were quite flexible within our group and everyone was encouraged to try and test new features in the feature engineering part. Max and Vera applied advanced statistical concepts and came up with new features which significantly improved the model accuracy. Pranav and Christian focused on building models and improving the model accuracy by means of parameter tuning as well as trying and testing custom and relative features with basic feature engineering. Along with supporting tasks, Dennis also provided expert solutions in order to resolve technical issues we faced, especially with the Zeno server. Overall, we discussed our results in meetings in a biweekly manner. This helped a lot according to updating other group members, questioning some approaches and brainstorming for new ideas.

In detailed, the tasks were divided as the following:

   * Chris: Christian mainly focused on the sparse XGboost approach. He converted dense dataframes into sparse matrices, designed the model and implemented hyperparameter tuning in combination with cross-validation to improve the model's accuracy.
   
   * Dennis: While the subgroups for feature engineering and modelling were already formed and worked in efficient groups, he concentrated to supporting and independed tasks. He tried Naive Bayes to find out, if the trade-off between speed and accuracy makes sense. Furthermore he did research about Caret for feature and model selection and compared SAP Predictive Analysis with the group results.
   
   * Max: Max focused on feature engineering and teamed up with Vera a lot of the time. Since coding as a duo is, how the group found out for themself, really benefitial for qualitative working time. Max came up with the idea for the "behavior"-features and coded clustering and genre-id-cleaning. 

   * Pranav: In modeling part, Pranav focused on finding and applying most suitable modeling methods and we came up with farily latest and well known modeling algorithms such as XGBoost, LightGBM and H2O (inlcuding recent H2O automated machine learning approach). These algorithms offered very good accuracy with super fast computation specifically h2o and LighGBM. Most of these models were run on local system (laptop) in case server was occupied. He was also responsible to apply various approaches for parameters tuning for above mentioned models to reduce the test error. In the feature engineering part, Pranav applied binning method with basic feature extraction method such as date and time features from timestamp, language detection from songs and albums and then binning them into the groups in order to further improve the accuracy of model. 
  
   * Vera: She was mainly involved in creating new features. While the most ideas came up in brainstorming meetings, this task was more about providing the additional information with cleaned and correct data. For example, one specific task was to create a new column genre, as the data analysis shown that genre information from the competition dataset was corrupt. This task was mainly solved with Max together. The data could not just be provided by querying the API, it also could be cleaned by finding new ways to fill the data gaps. Furthermore, Vera was responsible for merging all new feature tables with the provided competition dataset. She also provided some data descriptions to ensure a common understanding of the datasets.
   
#### 6(b) Reflection

* Dig into the data
  + Further into the project we discovered that the distributions of the training dataset was biased compared to the test dataset. For instance, the "listen_type" column in the training dataset contains the values 0 and 1. However, in the test dataset only 1 occurs. Another example is the column "context_type" which features 40 different values in the training dataset. In the test dataset, only 4 values (1, 5, 20, 23) appear. Additionally, the appearance of the user_id is biased which you can see below.
  
![Distribution of Observations per User](/home/Max_Philipp/deezer_report/Plots_for_Presentation/observations_distribution.png)

* Hyperparameter Tuning and Cross-Validation
  + When we performed HT and CV our models improved significantly. That was one of the key takeaways for us.

* Creating additional features is not always helpful
  + Creating additional features sometimes only adds noise to the dataset and also consumes time to develop. We think spend too much time on additional features from time to time which we didn't use at all in the end. Perhaps that time could have been spent on further data exploration.

* Computational costs

* Insufficient computational power (gigantic sparse matrix; shared server; other groups; Prof. Löcher itself; R runs mostly on one core)

* Missing domain knowledge of music and listening behavior

* Problems with R Studio Web and Kick-Outs; but different accounts have the problem of different workspaces
9
----------------------------------------------------------------------------
Specifications of machine used for most XGBoost NUM, H2O and LightGBM models
----------------------------------------------------------------------------

Operating System | System Model | Processor | Memory
--- | --- | --- | --- 
Windows 10 Home 64-bit | Dell Inspiron 5567 | Intel Core i7-7500U CPU @ 2.70GHz (4 CPUs), ~2.9GHz | 16 GB RAM

----------------------------------------------------------------------------
Specifications of machine used for XGBoost Sparse models
----------------------------------------------------------------------------

* Dell PowerEdge R730 server 
* Two Intel Xeon E5-2695 v3 processors
* 56 Cores
* 512 GB main memory
* 14 TB HDD storage 
* NVIDA Tesla K80 GPU with 4992 CUDA cores, 24 GB of GDDR5 memory

----------------------------------------------------------------------------