Sparkify with Apache Spark Mllib & Data Science

Manipulate large and realistic datasets with Spark to engineer relevant features for predicting churn. Use Spark MLlib to build machine learning models with large datasets.

https://medium.com/@atif.zafar/7-critical-habits-of-music-service-subscribers-like-spotify-a126bd3201d2

Purpose

Predicting churn rates is a challenging and common problem that data scientists and analysts regularly encounter in any customer-facing business.

Additionally, the ability to efficiently manipulate large datasets with Spark is one of the highest-demand skills in the field of data.

Essential Skills

Load large datasets into Spark and manipulate them using Spark SQL and Spark Dataframes
Use the machine learning APIs within Spark ML to build and tune models
Apply Data Science techniques on the Spark data set

Overview

Sparkify is a cloud music streaming service just like Spotify or Pandora.

We are working on the data team of Sparkify. Million of users play songs here using free and premium service. Users can upgrade or downgrade or cancel thier serivce. So they have to remain engage. As a data scientist in this project I have to predict who will cancel the service (churn) before it happens to prevent millions of dollar in loss.

Strategy

We are provided the user log of the of the service.

The user log contains demographic information, user activities, timestamps, whether user is a paid or free tier, and etc.

I have built a log and modeling to identify customers who are highly likely to quit using our service (Customer Churn).

We can incorpoate the findings to launch a A/B test to build a recommendation engine to keep these users more engaged, and as well as pass them to the marketing team to engage these users with mails about thier favorite songs and offer them suscription discounts.

I have computed F1 scores to measure of model performance based on precision and recall.

The F1 score will guide us to predict more precisely the users who are likely to churn, thus taking Sparkify towards growth.

Results

Random Forest Metrics: Accuracy: 0.7755102040816326 F-1 Score:0.6774571897724607

Conclusion

The length of service used plays the most important factor to predict the churn rate. Subscription downgrade also indicates if the customer will churn. If the customer plays alot of artists and adds alots of songs in thier playlist then its a good sign. Lastly, songs played per session plays is also vital in predicting the churn rate.

Data

Customer User log

Packages

PySpark
Pandas
Seaborn
Matplotlib

Machine Learning Models

Support Vector Machines
Random Forest Classifier
Gradient Boosted Trees
Logistic Regression

Methodology

Data Pipeline & ETL
Explortaory Data Analysis
Customer Churn
Feature Engineering
ML Modeling
Evaluating Models
Feature import analysis

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Sparkify Notebook.ipynb		Sparkify Notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify with Apache Spark Mllib & Data Science

Purpose

Essential Skills

Overview

Strategy

Results

Conclusion

Data

Packages

Machine Learning Models

Methodology

About

Releases

Packages

Languages

License

SimplifyData/Sparkify-with-Apache-Spark-Mllib-Data-Science

Folders and files

Latest commit

History

Repository files navigation

Sparkify with Apache Spark Mllib & Data Science

** Purpose **

** Essential Skills **

** Overview **

** Strategy **

** Results **

** Conclusion **

** Data **

** Packages **

** Machine Learning Models **

** Methodology **

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Purpose

Essential Skills

Overview

Strategy

Results

Conclusion

Data

Packages

Machine Learning Models

Methodology

Packages