-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathFinal_Report.Rmd
1757 lines (1343 loc) · 94.8 KB
/
Final_Report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: 'Report: Deezer Kaggle Inclass Competition 2017'
author: "Christian Kregelin, Dennis Uckel, Max Philipp, Pranav Pandya and Vera Weidmann"
date: "14. April - 31. May 2017"
output:
html_document:
toc: true
toc_float: true
toc_depth: 4
---
## 1. Introduction
This project is based on the Data Science Games 2017 hosted on Kaggle. It is a worldwide competition between universities. This year the data was provided by the music streaming application Deezer. They offer a recommendation feature, called Flow, which suggests the users songs they might like to listen to. The algorithm behind "Flow" uses collaborative filtering to provide the user with the right music at the right time. While the algorithm detects the user's current listing taste by analysing the particular streaming-session where songs are listened or skipped, the first song prediction is the most difficult one. Therefore, the goal of this challenge was to predict how likely it is that the users of the test dataset will listen to the first track of "Flow".
In the end of the project we finished 66th out of 145 teams with 45 submissions and a score of **0.63860**. This report reflects our whole project journey, including our ideas, failures and achievements.
Therefore, the report will give a short introduction to the dataset at first. Sequentially, our approaches in the field of feature engineering are explained and the used prediction models are presented.
The libraries we used for the project are the following:
```{r Initialization, warning=FALSE, message=FALSE}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr,
xgboost,
Matrix,
data.table,
caret,
splitstackshape,
ggplot2,
psychometric,
RColorBrewer,
jsonlite,
httr,
corrplot,
gridExtra,
formattable,
qdap,
tm,
textcat,
rvest,
stringr
)
```
## 2. Data Analysis & Challenges
This chapter illustrates an exploratory analysis of the competition dataset.
As it is mentioned before, the goal of this challenge was to predict whether the first recommended song will be listened or not. Not listened means, that the song was skipped within the first 30 seconds and it will be assumed that the user does not like the song.
The test dataset contains the first recommended "Flow"-track for several different users. The train dataset was generated by the user's listening history for one month, based on songs which were listened in "Flow" and songs which were listened by searches or saved playlists. Each row represents one listened or not listened song.
Data:
```{r}
Deezer <- read.csv("/home/Deezer/10_Basic_Dataset/train.csv")
Deezer_test <- read.csv("/home/Deezer/10_Basic_Dataset/test.csv")
tmp <- Deezer_test[,-1]
tmp$is_listened <- 0
all <- rbind(Deezer,tmp)
rm(tmp)
```
The train data contains `r round(dim(Deezer)[1]/1000000,2)` million rows and `r dim(Deezer)[2]` features, while the test set just contains `r round(dim(Deezer_test)[1]/1000,2)` thousand rows.
The dataset provides the following columns, which we grouped into user specific, song specific and device specific features. These features were analyzed, NAs detected and properly handled. As all features are characterized as numeric, we convert them into the right data type later on.
User specific:
* user_id - anonymized and unique id of the user
* user_gender - gender of the user
* user_age - age of the user
Song specific:
* media_id - identifiant of the song listened by the user
* album_id - identifiant of the album of the song
* artist_id - identifiant of the artist of the song
* genre_id - identifiant of the genre of the song
* media_duration - duration of the song
* context_type - type of content where the song was listened: playlist, album ...
* release_date - release date of the song with the format YYYYMMDD
Device specific:
* platform_name - type of operation system
* platform_family - type of device
Other features:
* ts_listen - timestamp of the listening in UNIX time (time since 1970-01-01 00:00, usually in seconds, sometime with milliseconds)
* listen_type - if the songs was listened in a flow or not
Response variable:
* is_listened - 1 if the track was listened longer than 30 seconds, 0 otherwise
#### 2(a) User Specific Features
The data contains `r length(unique(Deezer$user_id))` different users. Some of them occur over thousands of times, while nearly 1500 users occur just once (with one listened or non listened song). No NAs were found.
```{r, warning=FALSE}
user_distribution <- Deezer %>%
group_by(user_id) %>%
summarise(n=n()) %>%
arrange (-n)
myColors <- rep(brewer.pal(7,"Blues")[5:6],50)
ggplot(user_distribution, aes(x=n))+
geom_histogram(bins=100, fill=myColors,color="black")+
scale_x_continuous(breaks=seq(0,2000,50),limits=c(0,2000))+
labs(subtitle="in the train set",
y="number of users",
x="number of songs",
title="Distribution of Observations per User",
caption="Source: Deezer train data")+
theme_light()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#ggsave("observations_distribution.png",plot=gg6, width = 8, height=5, units="in")
```
The data reflects the listening behavior of male and female users. However, in this project we do not know how the values of 0 and 1 belong the specific genders. The plot below illustrates the amount of users per gender.
```{r}
myColors <- brewer.pal(7,"Blues")[5:6]
ggplot(Deezer,aes(x=user_gender))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
y="number of users",
x="Gender",
title="Distribution of Users Gender",
caption="Source: Deezer train data")+
theme_light()
```
The user's age is in the range of `r min(Deezer$user_age)` and `r max(Deezer$user_age)` with a median of `r round(median(Deezer$user_age))`. The following boxplot illustrates the distribution of the unique users by their age. Accordingly, 50% of the users are 24 years old or younger. The feature age does not contain any NAs in the Deezer dataset.
```{r}
sum(is.na(Deezer$user_age))
age <- Deezer %>%
group_by(user_id) %>% summarise(age=max(user_age))
boxplot(age$age)
```
```{r}
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],6),"#4292C6")
ggplot(age, aes(x=age))+stat_bin(binwidth=1, fill=myColors, color="black") + scale_y_continuous(limits=c(0,2200),breaks=seq(0,2200,100))+
stat_bin(binwidth=1, geom="text", aes(label=..count..), vjust=-1.5)+ scale_x_continuous(breaks=seq(18,30,1))+
labs(y="number of users",
x="age",
title="Distribution of Age among Users",
caption="Source: Deezer train data")+
theme_light()
#ggsave("Age_distribution.png",plot=gg7, width = 8, height=5, units="in")
```
#### 2(b) Song Specific Features
The dataset contains `r length(unique(Deezer$media_id))` different media_ids. One media_id represents one specific song from one artist on one specific album. At first, we thought that one media_id reflects one unique song. However, while checking the provided extra information (extra_infos.json), which links each media_id to a song title, album title and an artist name, it can be seen that one unique song (e.g. Everybody from Backstreet Boys) can have more than one media_id. This is caused by the album. Whenever a song is also on another album (e.g. Fetenkult - Best of the 90's) a new media_id is generated.
One example is covered by the following data table:
```{r echo = FALSE, message=FALSE, warning=FALSE}
load("/home/Deezer/deezer_report/extra.rda")
```
```{r}
extra_example <- extra %>%
filter(sng_title=="Everybody (Backstreet's Back)" & art_name=="Backstreet Boys")
data.table(extra_example)
```
Furthermore, we can count `r length(unique(Deezer$album_id))` different albums and `r length(unique(Deezer$artist_id))` different artists in the dataset. No NAs were identified.
```{r}
sum(is.na(Deezer$media_id))
sum(is.na(Deezer$album_id))
sum(is.na(Deezer$artist_id))
```
According to the extra information from the json file, we know that the media_id and the album_id are strongly correlated with each other. The correlation coefficient in that case is: `r cor(Deezer$media_id,Deezer$album_id)`.
The following histogram underlines this correlation. Furthermore, some media_ids (and therefore some albums and artists) are recommended quite often over all users.
```{r}
par(mfrow=c(1,3))
hist(Deezer$media_id)
hist(Deezer$album_id)
hist(Deezer$artist_id)
```
Likewise, `r length(unique(Deezer$genre_id))` different genres exist in the dataset. The histogram shows clearly that one specific genre with the id = 0 occurs 3 million times, while the second one for example just occurs only around 1 million times.
This would mean, that one specific genre is recommended the most. However, further investigation are done in the feature engineering chapter.
```{r}
sum(is.na(Deezer$genre_id))
number_genre <- Deezer %>%
group_by(genre_id) %>%
summarize(n=n()) %>%
arrange(-n)
head(number_genre)
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],25))
ggplot(number_genre, aes(genre_id)) +
geom_histogram(bins=50,fill=myColors, color="black")+
scale_y_continuous(limits=c(0,750),breaks=seq(0,750,100))+
labs(y="number of observations",
x="genre_id",
title="Distribution of Genre_id",
caption="Source: Deezer train data")+
theme_light()
```
We were also provided with some information about the media duration. But a quick look at the summary will reveal a couple of outliers.
```{r}
summary(Deezer$media_duration)
```
For this visualisation part, we filtered out <q5 and >q95 and kept all the values between `r quantile(Deezer$media_duration,0.05)` and `r quantile(Deezer$media_duration,0.95)`.
```{r}
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],15))
ggplot(Deezer%>%filter(media_duration>=quantile(media_duration,0.05) & media_duration<=quantile(media_duration,0.95)), aes(media_duration)) +
geom_histogram(bins=30,fill=myColors, color="black")+
labs(y="Number of Observations",
x="Media Duration in Seconds",
title="Distribution of Mediaduration",
subtitle="For 90% of the Data",
caption="Source: Deezer train data")+
theme_light()
```
The column context_type reflects where the song was listened. For example it shows if a song was listened on a playlist or on an album.
```{r}
sum(is.na(Deezer$context_type))
#hist(Deezer$context_type)
head(table(Deezer$context_type))
```
The release date of a song has the format of YYYYMM. By extracting the year of its timestamp, the most occurring release year can be better analyzed. Therefore, new songs from recent years occur most often in our dataset.
```{r}
sum(is.na(Deezer$release_date))
releaseDate = as.Date(as.character(Deezer$release_date), format="%Y%m%d")
splitrdt <- data.frame(ryear = as.numeric(format(releaseDate, format = "%Y")))
Deezer = cbind(Deezer, splitrdt)
```
```{r}
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],15))
ggplot(Deezer%>%filter(ryear>=quantile(ryear,0.05) & ryear<=quantile(ryear,0.95)), aes(ryear)) +
geom_histogram(bins=30,fill=myColors, color="black")+
labs(y="Number of Observations",
x="Release Year of a song",
title="Distribution of Release Year",
subtitle="for 90% of the data",
caption="Source: Deezer train data")+
theme_light()
```
#### 2(c) Device Specific Features
The dataset also contains information about the device and the operating system the user used to listen to the music. Both features have 3 values (1:3). As it seems that device "1" and the operating system "1" appears most of the time, we cannot say which specific device or system it is. Furthermore, these two columns are correlated (r=`r round(cor(Deezer$platform_family, Deezer$platform_name),2)`).
```{r}
sum(is.na(Deezer$platform_family))
sum(is.na(Deezer$platform_name))
par(mfrow=c(1,2))
myColors <- brewer.pal(7,"Blues")[5:7]
ggplot(Deezer,aes(x=platform_family))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
y="Observations",
x="Used Devise",
title="Distribution of Platform",
caption="Source: Deezer train data")+
theme_light()
ggplot(Deezer,aes(x=platform_name))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
y="Observations",
x="Used Operation System",
title="Distribution of Platform",
caption="Source: Deezer train data")+
theme_light()
```
#### 2(d) Other Features, incl. Response Variable
The ts_listen feature includes the time information when a specific song was recommended to a user. It is a UNIX format and will be modified later on. The third chapter about feature engineering covers some visualizations of the timestamp. No NAs were detected.
```{r}
sum(is.na(Deezer$ts_listen))
```
The listen_type feature contains information whether the song was listenend in the "Flow" or not. This is important as the training data includes the whole listening history of a user and not just the behavior on the recommendation. Songs which were not recommended by "Flow" could be songs which the user searched for or saved in specific playlists. Thereby, `r round(table(Deezer$listen_type)[1]/1000000,2)` million rows do not represent recommended songs, while `r round(table(Deezer$listen_type)[2]/1000000,2)` were predicted by the recommendation engine "Flow". The test data contains just predicted songs by "Flow".
```{r}
myColors <- brewer.pal(7,"Blues")[5:6]
ggplot(Deezer,aes(x=listen_type))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
y="number of observations",
x="listening type",
title="Distribution of Listening Type",
caption="Source: Deezer train data")+
theme_light()
```
```{r}
table(Deezer_test$listen_type)
sum(is.na(Deezer$listen_type))
```
Interestingly, we detected one single song inside the test set, which is of listening_type==0.
At last, the most important column is_listened presents the response variable in the modeling part.
```{r}
myColors <- rep(brewer.pal(7,"Blues")[5:6],2)
ggplot(Deezer,aes(x=is_listened))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
y="Number of Observations",
x="Listening Type",
title="Distribution of Is Listened versus Listening Type",
caption="Source: Deezer train data")+
theme_light()+facet_wrap(~listen_type)
```
Here, we can see that songs which were not suggested (e.g. searched for or listened to in a playlist) have a higher chance of being listened `r mean(Deezer$is_listened[Deezer$listen_type==0])` versus songs which have been suggested by the "Flow" `r mean(Deezer$is_listened[Deezer$listen_type==1])`.
#### 2(e) Summary of Main Challenges
On the first glimpse, the dataset looks like tall data. However, most of the features are factors. As already mentioned in the previous chapter, the data contains `r length(unique(Deezer$user_id))` different users, `r length(unique(Deezer$genre_id))` genres, `r length(unique(Deezer$artist_id))` artists, `r length(unique(Deezer$album_id))` albums and `r length(unique(Deezer$media_id))` unique songs.
Therefore, the main challenge of this project was to create a prediction model with this amount of factors. Furthermore, insufficient data of the column genre_id, a very granular timestamp and release date were the features which needed some improvements.
Furthermore, the following correlation plot visualizes the collinearity between the variables. It can be easily identified that songs and release date as well as the used operating system and device are highly correlated. For all other variables just a slight dependency exists.
```{r}
corDeezer <- cor(Deezer)
#png("correlationMatrix.png")
corDeezerplot <- corrplot(corDeezer, method = "ellipse", type = "full")
#dev.off()
```
## 3. Feature Engineering
In this part, we will talk about our Feature Engineering. We will start with easier transformations and calculations going into advanced thinking processes throughout this chapter. We tried to demonstrate all we have achieved and our journey to the point of most success.
#### 3(a) Timestamp Conversion
Converting the timestamp is the most basic feature engineering part we have done. We extract hour of day as well as weekday from the timestamp. This procedure rather simple, that we did it on the fly, as you will see at the end of the feature engineering section. Nevertheless, the process is described in the following.
The idea behind that is the assumption that the hour of a day or/and the day of week contributes patterns to the prediction, e.g. if users regularly go to the gym on Tuesdays and Thursdays at 8pm or if users tend to listen to slower songs in the evening, while they could listen to fast songs on the way to work. These patterns cannot be found with a time precision of seconds.
```{r}
transforming_timestamp <- function(x){
timestamp = as.POSIXct(x$ts_listen, origin="1970-01-01")
splitdt <- data.frame(
hh = as.numeric(format(timestamp, format = "%H")), #24hours format
wd = as.numeric(format(timestamp, format = "%w"))) #weekday
x = cbind(x, splitdt) #cbind the extracted time to data
return(x)
}
transforming_timestamp(Deezer)[1:10,12:17]
```
```{r}
tmp <- transforming_timestamp(Deezer)
myColors <- rep(brewer.pal(7,"Blues")[5:6],12)
ggplot(tmp, aes(x=hh))+stat_bin(binwidth=1,fill=myColors, color="black")+
scale_y_continuous(limits=c(0,600000),breaks=seq(0,600000,50000))+
scale_x_continuous(breaks=seq(0,23,1))+
labs(y="observations",
x="hour of day",
title="Distribution of Listening Hour",
caption="Source: Deezer train data")+
theme_light()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#ggsave("hh_distribution_train.png",plot=gg9, width = 8, height=5, units="in")
```
```{r}
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],3),"#4292C6")
ggplot(tmp, aes(x=wd))+stat_bin(binwidth=1,fill=myColors, color="black")+
scale_y_continuous(limits=c(0,1300000),breaks=seq(0,1300000,100000))+
scale_x_continuous(breaks=seq(0,6,1))+
labs(y="observations",
x="weekday",
title="Distribution of Listening Weekday",
caption="Source: Deezer train data")+
theme_light()
#ggsave("wd_distribution_train.png",plot=gg10, width = 8, height=5, units="in")
```
```{r}
tmp <- transforming_timestamp(Deezer_test)
myColors <- rep(brewer.pal(7,"Blues")[5:6],12)
ggplot(tmp, aes(x=hh))+stat_bin(binwidth=1,fill=myColors, color="black")+
scale_y_continuous(limits=c(0,1800),breaks=seq(0,1800,100))+
scale_x_continuous(breaks=seq(0,23,1))+
labs(y="observations",
x="hour of day",
title="Distribution of Listening Hour",
subtitle="for test data",
caption="Source: Deezer test data")+
theme_light()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#ggsave("hh_distribution_test.png",plot=gg11, width = 8, height=5, units="in")
```
```{r}
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],3),"#4292C6")
ggplot(tmp, aes(x=wd))+stat_bin(binwidth=1,fill=myColors, color="black")+
scale_y_continuous(limits=c(0,5000),breaks=seq(0,5000,250))+
scale_x_continuous(breaks=seq(0,6,1))+
labs(y="observations",
x="weekday",
title="Distribution of Listening Weekday",
subtitle="for test data",
caption="Source: Deezer test data")+
theme_light()
#ggsave("wd_distribution_test.png",plot=gg12, width = 8, height=5, units="in")
```
#### 3(b) Beats per Minute
Beats per Minute (bpm) is an easy, numeric feature which we could query from the Deezer API. We thought it may help us to differentiate within the genres, for example "soft rock" being tendentially slower (lower bpm) than its associated genre "rock". It was not our intention to create subgenres in any way but just giving this extra information into the model.
For querying the API we used the `httr` package to query and `jsonlite` to handle the received answer. To show how the code works, we limit the number of queries to 50 instead of nearly 450.000 media_ids.
```{r}
## Load necessary packages
library(jsonlite)
library(httr)
uniquetracks <- as.data.frame(unique(all$media_id)[1:50]) #getting all unique media_ids
uniquetracks$api <- paste("https://api.deezer.com/track/",uniquetracks[,1], sep="")
uniquetracks$bpm <- 0
for (i in 1:length(uniquetracks[,1])) { ## for all tracks
this.raw.result <- GET(url = as.character(uniquetracks[i,2])) ## get the infos
this.result <- fromJSON(rawToChar(this.raw.result$content)) ## turn it into a readable format
uniquetracks$bpm[i] <- ifelse(is.null(this.result$bpm),"NA",this.result$bpm) ## get the BPM
#message(as.character(i), appendLF = FALSE) ## print the iteration to see if the code is still working
Sys.sleep(time = 0.05) ## cap the speed so the 50 per 5 seconds are not violated
}
head(uniquetracks,10)
#save(tracks, file = "tracks_BPM.rda")
```
After fetching the information from the API, we need to clean it. Because the bpm is received as a character we can set all cells where the string is "NA" to 0 without using is.na(). Transforming them to numeric afterwards leads to 0 missing data but 30000 zeros (only 171 were introduced by setting "NA" to 0).
```{r warning=FALSE}
load("/home/Deezer/30_Wrangled_Data/Archiv/tracks_BPM.rda")
tracks <- tracks[,c(2,3)]
tracks$track_bpm[tracks$track_bpm=="NA"] <- "0"
tracks$track_bpm <- as.numeric(tracks$track_bpm)
myColors <- rep(brewer.pal(7,"Blues")[5:6],20)
ggplot(tracks, aes(x=track_bpm))+stat_bin(bins=40, fill=myColors, color="black")+
scale_y_continuous(limits=c(0,46000),breaks=seq(0,45000,5000))+
scale_x_continuous(breaks=seq(0,240,10))+
labs(title="Distribution of Beats per Minute",
y="observations",
x="beats per minute",
caption="Source: Deezer train data")+
theme_light()+
theme(axis.text.x = element_text(angle = 75, hjust = 1))
```
To fill the zeros, we will take the mean bpm from the album the media is on, or the genre it is from, if the album information is not available. Sadly, at this point we did not have information about the genre, so in order to keep it chronologically correct and not confusing, we will go to the genre section now and will come back to this afterwards.
#### 3(c) Discarded: EchoNest, Singer's Gender, Users Age, User's Language vs. Song Language
For increasing the available information we looked for additional sources, like API provider. And we found EchoNest and an R interface (https://github.com/mukul13/rechonest). EchoNest provides a lot of information about songs, albums, artists, etc. Unfortunately, EchoNest was bought by Deezers market competitor Spotify. Furthermore, we found a hint in the DSG forum, that we can use the Deezer API, but are not allowed to use other sources.
In a brainstorming session we thought about additional factors, which influences, if we want to listen to a song or not. One aspect is the timbre of a voice or the gender of an artist, e.g. in combination with a genre. Since we were not allowed to use EchoNest and this information is not available in Deezers API, we discarded this idea.
If one is thinking to the parents music taste, one could claim, that elder people tend to listen to older songs. One explanation could be, those users tend to listen to songs from their "youth". The question was, if the properties user_age and year of release_date correlate. Unfortunately, user_age and release_date were not significant in our first models.
#### 3(d) Exploring New Features with Language Detection
This approach was to further explore "extra info" data to check whether language detection can help identify new features or not. At first, we did text cleaning and then "textcat" package to identify unique language from song title, album title and artist's name. The next step was to include logical interaction between this variables. For example, if song language and album language is exact same them grouping them, else considering it as variation. Below is the step by step procedure followed to extract this language features.
```{r eval=FALSE}
# Add new features: join json file with media description (categorical variables)
library(jsonlite)
library(httr)
library(qdap)
library(tm)
extra = stream_in(file("/home/Deezer/60_data_other_models/extra_infos.json"))
extra = extra[1:100,]
str(extra)
##################################################
## add language feats (detect lang)
##################################################
library("textcat")
library("rvest")
library("stringr")
#text cleaning (in order to assign unique numeric value)
extra[] = lapply(extra, tolower)
extra[] = lapply(extra, removePunctuation)
extra[] = lapply(extra, stripWhitespace)
extra$songLang = textcat(extra$sng_title)
extra$albLang = textcat(extra$alb_title)
extra$artistLang = textcat(extra$art_name)
#if song lang and alb lang is exact same them grouping them, else considering it as variation
extra$langSngAlb = ifelse(extra$songLang == extra$albLang, extra$albLang, "variation")
extra$langAlbArt = ifelse(extra$albLang == extra$artistLang, extra$artistLang, "variation")
extra$langSngArt = ifelse(extra$songLang == extra$artistLang, extra$artistLang, "variation")
extra$langReg = ifelse(extra$songLang == "french" | extra$songLang == "german" | extra$songLang == "spanish" |
extra$songLang == "swedish" | extra$songLang == "italian" |
extra$songLang == "polish", "TopEuLang", "EnglishOrOther")
str(extra)
extra$media_id = as.numeric(extra$media_id)
# save(extra,file="extra_feats11.rda")
#add to train_test dataset
#train_test = left_join(train_test, extra, by = "media_id")
#str(train_test)
```
We checked importance of this language features through feature importance matrix in various XGBoost (label encoded), LightGBM and H2O models but their contributioin to accuracy was relatively low. The illustration of feature importance is included at the end of XGBoost NUM and LightGBM model's script.
* [LightGBM](https://github.com/pranavpandya84/deezer_report/blob/master/Models/LightGBM.rmd)(scroll down to the bottom for feature importance matrix)
#### 3(e) Genre_id
In a first approach we tried to get enhanced information about the genre_id, by quering the API for the genre_ids.
As seen in the visualization part, we were confronted with a vast amount of genre_id==0 and it would be nice to know which genre it is.
```{r}
uniquegenres <- unique(all$genre_id) #getting all unique genre_ids
uniquegenres <- as.data.frame(paste("https://api.deezer.com/genre/",uniquegenres, sep="")) #paste the ids into the needed API-url to access the informations
uniquegenres$name <- "" #initialise a empty column
uniquegenres$genre_id <- unique(all$genre_id)
```
At this point we are able to query from the API, using the following loop. To show the results, we set the sample size to 50.
```{r}
#library(httr)
#library(jsonlite)
for (i in 1:50){ #first 500 genres
this.raw.result <- GET(url = as.character(uniquegenres[i,1])) # get data
this.result <- fromJSON(rawToChar(this.raw.result$content)) # turn data from unix to char and from json into a variable
uniquegenres$name[i] <- ifelse(is.null(this.result$name),"NA",this.result$name) # fill NA where NULL
Sys.sleep(time = 0.1) # System Sleep time to not overload the APIs capacity of 50 requests every 5 seconds
}
head(uniquegenres[,c(2,3)],10)
```
As we can see, the genre names are mainly NAs, and if one would look at row number 21, we would see, that genre_id==0 refers to "all". `r uniquegenres[21,c(2,3)]`
Now that we know, that we can and should not rely on the genre_id, we need to get the information about the genre from another source. Luckily, we have the album_id and the API provides genre information for each album individually, which is not related to the genre_id in first place.
**Genre Information from album_id**
```{r}
albums <- unique(all$album_id) #getting all album ids
albums <- paste("https://api.deezer.com/album/",albums, sep="") #get them in the right format for query the API
albums <- as.data.frame(albums)
albums$album_id <- unique(all$album_id)
#initiating some columns
albums$alb.genre <- ""
albums$alb.genre.id <- ""
```
In the next chunk we receive the genre information for the first 50 albums.
```{r}
for (i in 1:50){ #for 50 albums. you can replace 50 by length(albums$albums) to loop over all
this.raw.result <- GET(url = as.character(albums[i,1])) #get the infos
this.result <- fromJSON(rawToChar(this.raw.result$content)) #turn it into a readable format
albums$alb.genre[i] <- ifelse(is.null(this.result$genres$data[2]),"NA",this.result$genres$data[2]) #getting the written genre
albums$alb.genre.id[i] <- ifelse(is.null(this.result$genre_id),"NA",this.result$genre_id) #getting the genre_id from the album
#message(as.character(i), appendLF = FALSE) #print the iteration to see if the code is still working
Sys.sleep(time = 0.05) #cap the speed so the 50 per 5 seconds are not violated
}
formattable(head(albums,50))
```
So far we created a new DF called "albums" which has 4 columns: API-query, album_id, genre and genre_id,
Where genre is the written name of the genre and the ID is just the corresponding ID.
We can see that some albums have more than one genre and some have NULL.
Looking back at the data, it would be quite easy to get the lists of genre out of the cells of the df. But as we started the project, we were not that fund of performing transformations on lists of lists. And since this report demonstrates what we have done and not what we should have done, here is our original approach.
We used gsub to get rid of all symbols.
```{r}
albums[,'alb.genre2'] <- gsub("c\\(", "" , albums[,'alb.genre']) #remove "c("
albums[,'alb.genre2'] <- gsub("\\)", "" , albums[,'alb.genre2']) #remove ")"
albums[,'alb.genre2'] <- gsub("\"", "" , albums[,'alb.genre2']) #remove """
albums[,'alb.genre2'] <- gsub(" ", "" , albums[,'alb.genre2']) #remove " "
albums <- splitstackshape::cSplit(albums,splitCols = "alb.genre2",direction = "wide")
formattable(head(albums[,c(2,5,6,7)],10))
```
At this stage we decided to only grab the very first of the listed genres for each album, since it would make most sense, that the main genre is named first. We still have NULLs and NAs.
Our first approach in tackling the NULLs and NAs was to get the albums corresponding artist ids. Then, with the artist information, we were able to lookup the artist´s most played genre to fill in missing data. For Example if artist 1 had 4 albums in genre "rock", one could assume that a 5th album, which genre is missing is "rock" as well.
```{r, message=FALSE, warning=FALSE}
albums$artist <- all$artist_id[albums$album_id %in% all$album_id] #joining artist informations
#save(albums,file="bums_art.rda")
#load("bums_art.rda")
```
At this point we need to load in data which has been saved during our running.
```{r}
load("/home/Deezer/30_Wrangled_Data/bums_art.rda")
formattable(head(albums,10))
```
Only with this data you can see what splitstackshape::cSplit really did. It cut the genres from the list of list and got each of them into a new column. Therefore just using the very first of these new columns is what we want, since it is the first entry of the genres.
To get going we created a new DF which contains all the missing data (NAs and NULLs).
```{r}
Na.albums <- albums %>% filter(alb.genre.id=="NA" | alb.genre.id==-1) #get all NAs and NULLs into seperate DF
Na.albums <- as.data.frame(Na.albums[,c(1,20)])
colnames(Na.albums) <- c("album_id","artist_id")
albums2 <- albums[!albums$album_id %in% Na.albums$album_id,] #get clean data into a new DF
```
As we can see, we have `r nrow(Na.albums)` rows in Na.albums, which means that we have 20521 albums out of 151605 which does not have a genre after quering the API.
```{r}
all2 <- all[,c(4,13)] #subset all to contain only album_id and artist_id
all2 <- left_join(all2,albums2[,c(1,4)],by="album_id") #joining our new genre
#creating artists loopuptable
loup.artists <- all2 %>%
group_by(artist_id,alb.genre2_01) %>%
summarise(n=n()) %>% #how often a genre at a artist appears. btw: listened/played songs more important than overall genres from albums released
filter(n==max(n))
Na.albums <- merge(Na.albums,loup.artists[,-3],by="artist_id") #reduced from 20.000 NA albums to 999.
albums2 <- albums2[,c(20,1,4)]
albums2 <- bind_rows(albums2, Na.albums[!is.na(Na.albums$alb.genre2_01),-3]) #adding albums with new information to album2
Na.albums <- Na.albums[is.na(Na.albums$alb.genre2_01),] #999 albums still without genre
```
This implementation was a big success. We were able to make a valid guess on the genre_id by using the artist´s main genre, reducing the number of NAs and NULLs by more than 95%.
To fill the final 999 genres, we used the bpm, which we have collected earlier to identify the genres, but as mentioned as well, we need what we know about the genre so far to fill our gaps in bpm knowledge adequately.
First we get all songs where our bpm is 0, afterwards we first fill in the mean bpm of the album the song is in, and if we don't have this information, we fill with the mean bpm of the genre.
Note: make sure you have run 2(b) already.
```{r warning=FALSE}
#bpm.data <- all[!duplicated(all$media_id),c("media_id","album_id")]
bpm.data <- unique(all[,c("media_id","album_id")]) #get all media
bpm.data<- merge(bpm.data,tracks,by="media_id") #join bpm
bpm.data<- merge(bpm.data,albums2[,c(-1,-4)],by="album_id", all.x=TRUE) #join genre_id we know so far
#creation if album and genre lookuptables for the everage bpm
bpm.merge.album <- bpm.data %>% filter(track_bpm>0) %>% group_by(album_id) %>% summarise(track_bpm=mean(track_bpm))
bpm.merge.genre <- bpm.data %>% filter(track_bpm>0) %>% group_by(alb.genre2_01) %>% summarise(track_bpm=mean(track_bpm))
#fill in album_mean_bpm where it is known
bpmclean2 <- bpm.data[bpm.data$track_bpm==0,]
bpmclean2 <- merge(bpmclean2[,c(1,2,4)],bpm.merge.album, by="album_id",all.x=TRUE)
bpmclean3 <- bpmclean2[is.na(bpmclean2$track_bpm),] #where we dont have bpm data from the album
bpmclean3 <- merge(bpmclean3[,c(-4)],bpm.merge.genre, by="alb.genre2_01",all.x=TRUE) #join the information from the average_genre_lookup table
```
What was quite amusing is one very resilient little fellow: In all 7.5 million songs in our data, there is only one "HörbuchaufDeutsch". Since it is the only one of his kind, and it is an album by itself, we do not have any information, but we set the bpm to 80, which is on the lower end of the scale.
```{r}
bpmclean3[is.na(bpmclean3$track_bpm),]
bpmclean3[is.na(bpmclean3$track_bpm),4] <- 80
```
Now we only had to clean everything and bind it back into a single bpm lookup table, which we will call tracks_cleaned.rda.
```{r}
bpm1 <- bpm.data %>% filter(track_bpm>0)
bpm2 <- bpmclean2 %>% filter(!is.na(track_bpm))
bpm2 <- bpm2[,c(1,2,4,3)]
bpm3 <- bpmclean3[,c(2,3,4,1)]
tracks_clean <- rbind(bpm1,bpm2,bpm3)
tracks_clean <- tracks_clean[!duplicated(tracks_clean$media_id),]
tracks_clean2 <- tracks_clean[,c(2,3)]
#save(tracks_clean, file="data_for_genre_loop.rda")
#save(tracks_clean2,file="tracks_cleaned.rda")
```
```{r, message=FALSE, warning=FALSE}
load("/home/Deezer/30_Wrangled_Data/tracks_cleaned.rda")
myColors <- rep(brewer.pal(7,"Blues")[5:6],10)
ggplot(tracks_clean2, aes(x=track_bpm))+stat_bin(bins=20, fill=myColors, color="black")+
scale_y_continuous(limits=c(0,80000),breaks=seq(0,80000,5000))+
scale_x_continuous(breaks=seq(50,220,10),limits=c(50,220))+
labs(title="Distribution of Beats per Minute",
y="observations",
x="beats per minute",
caption="Source: Deezer train data")+
theme_light()+
theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("BPM_distribution.png",plot=gg8, width = 8, height=5, units="in")
```
To fill the missing genre_ids for our 999 observations, we will conduct a glm model to predict the genre by the songs bpm, since we have already exhausted all other possibilities. According to a majority vote on how the genre_id is distributed on the known cases, we could assume that the missing ones are "pop".
```{r, fig.width=8,fig.height=5}
load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda")
library(treemap)
#png(filename="genre_treemap.png",width=16, height=9, units = "in",res=72)
treemap(DeezerNew_train_0525 %>% group_by(alb.genre2_01) %>% summarise(n=log(n())), index="alb.genre2_01",vSize = "n",title = "Logarithmic Distribution of Genres", palette = "GnBu")
#dev.off()
```
Note: this code will run for some minutes (glm for 47 models). For the sake of creating our html report, we set the chunks `eval=FALSE` to prevent the chunk from running. You can skip the following two chunks without missing too much information.
```{r eval=FALSE}
test.loop <- tracks_clean %>% filter(is.na(alb.genre2_01)) %>% select(album_id,track_bpm)
results <- test.loop
#loop and fit
for (i in levels(tracks_clean$alb.genre2_01)){
tracks_clean$y <- as.factor(as.numeric(tracks_clean$alb.genre2_01==i))
train.loop <- tracks_clean %>% filter(!is.na(alb.genre2_01)) %>% select(track_bpm,y)
fit <- glm(y~track_bpm,data=train.loop, family=binomial(link = "logit"))
results[,i] <- predict(fit,test.loop, type="response")
}
results <- results[,-2]
```
At this point we have done a prediction for every missing album and can show, that all of them are considered to be "pop".
```{r eval=FALSE}
albums.finish <- melt(results,id.vars="album_id")
albums.finish <- albums.finish %>% group_by(album_id,variable) %>% summarise(value=sum(value))
albums.finish <- albums.finish %>% group_by(album_id) %>% filter(value==max(value))
table(albums.finish$variable)
```
The tables shows that for all 20521 songs from out missing 999 albums, the looped glm model suggested that all of the songs are so be considered "pop", which also followed the majority vote.
```{r warning=FALSE}
Na.albums$alb.genre2_01 <- "Pop"
albums2 <- bind_rows(albums2[,c(2,3)], Na.albums[,c(2,3)]) #adding albums with new information to album2
albums2 <- albums2[!duplicated(albums2$album_id),]
#save(albums2,file="album_genre_clean.rda")
```
#### 3(f) User Clusters by Genre
```{r}
load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda") #we load a newer version of the data at this stage, but we only build on what has been worked out so far.
DeezerNew_train_0525$is_listened <- as.numeric(DeezerNew_train_0525$is_listened)-1
library(dplyr)
library(splitstackshape)
library(reshape2)
```
One of our first thoughts about this project was: "the genre is the most important indicator if you like a song or not."
At this point in time we already had trustworthy information about the genre, so we could feature engineer further.
In a first attempt we generated a clustering based on genre-listening-history. Sadly we cannot deliver the original code, because the Zeno Server crashed after running the code and deleted the at this point unsaved rmd file. Luckily, we saved the output of the code in an rda.
Here is an attempt to recreate the code to give you an impression:
```{r}
profs <- DeezerNew_train_0525 %>%
group_by(user_id,alb.genre2_01) %>% #for each user and each genre
summarise(c = sum(is_listened)) %>% #is_listened is 0/1 coded as.numeric, so sum() works fine
mutate(p = c/sum(c)) #summarise ungroups once, so at this point we are at groub_by(user_id), therefore sum(c)==sum(c | user(i))
profs <- profs[,-3] %>%
dcast(user_id ~ alb.genre2_01, value.var="p") #dcast is the invers of melt, but creating a new column for each value pair.
profs[is.na(profs)] <- 0
profs[1:10,c(1,3,13,17,29,41)]
```
This chunk gave us the opportunity to look how genres are distributed for every user individual, e.g. user 1 listened 11% "Alternative", 12% "Dance", 24% Electro and so on.
We used this information to create clusters of similar users, using hclust over kmeans, because we did not know how many clusters we wanted to come up with and did not wanted to rerun the code multiple times.
The code looked somewhat like this, note that creating the distance-matrix will take around 2 minutes on the zeno-server and around 5-10 minutes on a regular machine, since it consists of 198 million elements.
```{r}
d <- dist(profs[,-1])
```
We want to use this journey back into our beginnings to show how different clustering methods looked like. However, we used the method "complete" linkage (which means, that the maximum distance will be used to link two clusters) and set the cut-off threshold to 35 clusters.
```{r}
c.tree.comp = hclust(d,method="complete")
w.tree.comp = hclust(d,method="ward.D2")
s.tree.comp = hclust(d,method="single")
```
```{r fig.width=8}
par(mfrow=c(1,3))
plot(c.tree.comp)
plot(w.tree.comp)
plot(s.tree.comp)
labs.comp = cutree(c.tree.comp,k=35)
#lookup <-as.data.frame(profs$user_id)
#lookup$profile_id <- labs.comp
#colnames(lookup) <- c("user_id","profile_id")
```
This new feature added to our previously best model increased our score by 4%.
We did not stop at this point, since there are some aspects which are worrying, e.g. that we do not take into account the number of observations or how often a genre was played, in contrast, we only look at how often a user has listened to a genre. We kept on developing the idea, which is worth a chapter on its own.
#### 3(g) User Behavior Index
```{r}
load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda")
DeezerNew_train_0525$is_listened <- as.numeric(DeezerNew_train_0525$is_listened)-1
library(dplyr)
library(splitstackshape)
library(reshape2)
```
Sometimes a simple idea can lead to a cascade of thoughts and breakthroughs, as happened in our case.
The thought was a simple one: "When we look at the distribution of genres for each unique user and cluster them, don´t we actually ignore the preferences of the individual user?" What for example happens to a user, who has multiple favorite genres? Imagine a user who listens only to 5 genres in an equal distributed fashion - each of those genres have a score of 0.2 in our previous approach. Now imagine another user who listens to two genres, one 80% and one 20%. At the moment we propose, that the genres with a score of 0.2 are equally important, even though it is clearly not the case: For user 1, every genre is his favorite, while for user 2 the second genre is not his favorite.
How to bypass this problematic? By scaling our previous results user-individual. That means, we will take an user’s favorite genre (the max() of his scores), and set it to 1 and scale all other genres accordingly. With this transformation we are able to compare different users.
Going back to the example from above: For user 1, each of his/her 5 genres is now a score of 1 while for user 2 only the first genre is 1 and the second is scaled respectively.
```{r}
profs <- DeezerNew_train_0525 %>%
group_by(user_id,alb.genre2_01) %>% #for each user and each genre
summarise(c = sum(is_listened)) %>% #is_listened is 0/1 coded as.numeric, so sum() works fine
mutate(p = c/sum(c)) #summarise ungroups once, so at this point we are at groub_by(user_id), therefore sum(c)==sum(c | user(i))
profs <- profs[,-3] %>%
dcast(user_id ~ alb.genre2_01, value.var="p")
profs[is.na(profs)] <- 0
scaled <- as.data.frame(t(apply(profs[,-1], 1, function(x)(x-min(x))/(max(x)-min(x))))) #using min(x) instead of 0 for convinience, since for all observations min(x)==0
scaled$user_id <- profs$user_id
scaled[is.na(scaled)] <-0 #for users who have never listened to anything
#creating a lookuptable for final models, better not run it again, it takes ages.
#genre_scaled.molten <- melt(scaled, id.vars="user_id",variable.name="alb.genre2_01",value.name="genre_scaled")
#save(genre_scaled.molten,file="30_05_genre_scaled_lookup.rda")
```
```{r fig.width=8,fig.height=5, message=FALSE}
tmp <- melt(profs)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378 | user_id==10971), aes(x=variable,y=value, fill=user_id))+
geom_bar(stat="identity", position="dodge")+
scale_y_continuous(breaks=seq(0,0.65,0.05))+
scale_fill_brewer(palette="Set1")+
labs(x="genre",y="relative frequency", title="Relative Genre Frequency",subtitle="for three selected users")+
theme_light()+
theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("relative_genre.png",plot=gg1, width = 8, height=5, units="in")
```
```{r fig.width=8,fig.height=5, message=FALSE}
tmp <- melt(scaled)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378| user_id==10971), aes(x=variable,y=value, fill=user_id))+
geom_bar(stat="identity", position="dodge")+
scale_y_continuous(breaks=seq(0,1,0.05))+
scale_fill_brewer(palette="Set1")+
labs(x="genre",y="scaled relative frequency",title="Scaled Relative Genre Frequency",subtitle="for three selected users")+
theme_light()+
theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("scaled_relative_genre.png",plot=gg2, width = 8, height=5, units="in")
```
As told, an idea can lead to a cascade of ideas, here is our second thought: "What if the Deezer Flow-Algorithm suggested a genre very often, but it was not listened to often, even though still often enough to let it appear as a favorite genre for this user?". To get a sense for this possibility, we created a new table with the percentage of how often a genre is listened when played for every user.
```{r}
profs <- DeezerNew_train_0525 %>%
group_by(user_id,alb.genre2_01) %>% #for each user and genre
summarise(c = sum(is_listened),n=n(),p=c/n) #sum(is_listened) again, but this time n() is equal to count() and p as the quotient of c/n
profs <- profs[,-c(3,4)] %>%
dcast(user_id ~ alb.genre2_01, value.var="p") #long to wide format
profs[is.na(profs)] <- 0 #NAs are equal to 0
listened <- profs[,-1]
listened$user_id <- profs$user_id
#genre_listened.molten <- melt(listened, id.vars="user_id",variable.name="alb.genre2_01",value.name="genre_listened")
#save(genre_listened.molten,file="30_05_genre_listened_lookup.rda")
```
```{r fig.width=8,fig.height=5, message=FALSE}
tmp <- melt(listened)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378 | user_id==10971), aes(x=variable,y=value, fill=user_id))+
geom_bar(stat="identity", position="dodge")+
scale_y_continuous(breaks=seq(0,1,0.05),limits = c(0,1))+
scale_fill_brewer(palette="Set1")+
labs(x="genre",y="average is_listened",title="Average is_listened per Genre",subtitle="for three selected users")+
theme_light()+
theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("average_is_listened.png",plot=gg3, width = 8, height=5, units="in")
```
Combining the scaled genre preferences as well as the listening history is exactly what we had in mind. As one of the value decreases, the overall trustworthiness of the data is going down. As long as both values are high or extremely low, we can be pretty sure, that a user has listened or not listened to the song respectively.
One aspect we have not considered the sample size yet. Until this point we cannot differentiate between a user, who has listened for example to only one song once (scaled value and listening behavior is 1) and someone who has heard to a genre hundreds of time.
To compensate this in some way, we extracted this information as well:
```{r}
profs <- DeezerNew_train_0525 %>%
group_by(user_id,alb.genre2_01) %>% #for each user and genre
summarise(n=n()) %>% #count
dcast(user_id ~ alb.genre2_01, value.var="n") #long to wide
profs[is.na(profs)] <- 0
counts <- profs[,-1]
counts$user_id <- profs$user_id
#genre_counts.molten <- melt(counts, id.vars="user_id",variable.name="alb.genre2_01",value.name="genre_counts")
#save(genre_counts.molten,file="30_05_genre_counts_lookup.rda")
```
```{r fig.width=8,fig.height=5, message=FALSE}
tmp <- melt(counts)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378| user_id==10971), aes(x=variable,y=value, fill=user_id))+
geom_bar(stat="identity", position="dodge")+scale_y_continuous(breaks=seq(0,90,5),limits = c(0,90))+
scale_fill_brewer(palette="Set1")+
labs(x="genre",y="Count",title="Count per Genre",subtitle="for three selected users")+
theme_light()+
theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("count.png",plot=gg4, width = 8, height=5, units="in")
```
Test, if columns and user_ids are arranged identically for each of the three tables.
```{r}
sum(names(listened) != names(scaled)) #should be 0
sum(names(listened) != names(counts))
sum(scaled$user_id!=listened$user_id)
sum(scaled$user_id!=counts$user_id)
```
In our last step we combined the three different values. To show you our thought process throughout the report, we will show you how the combination changed over some testing and iterative feedback rounds.
To demonstrate what drove the changes, we will create a set of easy example data. Where:
```{r}
g <- c(1,.8,.6,.3) #Genre-Distribution
l <- c(.25,.7,.8,.95) #Listening-History
c <- c(200,68,45,19) #Count
```
Our first approach was g*l*sqrt(c)
using c as a weight to get higher index values, when we have more data available. We had never the intention to keep the value inside a 0:1 boundry
```{r}
g*l*sqrt(c)
```
As we can see, there is not enough emphasis on the last case. The last genre has 95% listening rate over 19 observations, if this genre would be played again, we could pretty sure predict, that the user is going to listen to it.
Additionally case 1 still has the second highest value, which should indicate a high likelyhood of "is_listened" for a future song of this genre. But the listening value is only 0.25, which indicates, that it is rather unlikely, that the user is going to listen to the genre next time.
In a first step, we exchanged the sqrt() to a log(x+1) function, which gives less weight relatively to the growth in count, the +1 inside the log is needed because log(0) otherwise would lead to -Inf.
```{r}
par(mfrow=c(1,2))
plot(log(1:1000))
plot(sqrt(1:1000))
```
```{r}
g*l*log(c+1)
```
We can see that the values are closer together now; still, the log is a strong enough penalty for low count cases.