Churn Prediction Using Spark

Background

Sparkify is a music streaming service invented by Udacity. With Sparkify users can listen to music for free (with ads between songs) or subscribe to the platform at a flat rate. Users can upgrade, downgrade, or cancel their subscriptions as well.

The task is to predict the users who are going to leave in order to offer them a great discount before they cancel their subscription and leave the platform.

Project Steps:

Exploratory Data Analysis (EDA)
Feature engineering
Model building
Evaluation
Results
Conclusion

Installation

Required Python libraries:

PySpark
Pandas
Matplotlib
Seaborn

Project Motivation

This is a Capstone project done in conjuction with Udacity as part of the requirement for the Data Science Nanodegree program.

The goal of this project is to learn how to manipulate large and realistic datasets with Spark to engineer relevant features for predicting churn. I want to learn how to use Spark MLlib to build machine learning models with large datasets, far beyond what could be done with non-distributed technologies like scikit-learn.

Files Description

mini_sparkify_event_data.json is the data used and it is a mini subset (128MB) of the full dataset available (12GB)
Sparkify.ipynb notebook contains all steps involved in the project
gbt.model The GBT model built
svm.modelThe SVM model
logistic.modelThe Logistic model

Results

After carrying out EDA on the data understanding the different levels of the dataset and how it impacted churn. We created features that were later used in building three machine learning models, namely: Logistic regression, Support Vector Machine(SVM), and Graident Boosted Tree (GBT).

Amongst the three models, the GBT performed best with good f1 scores and hight accuracy score of 99% although it took a longer time to train.

The models resulted in:

logistic regression (best f1 score 0.7641) (Accuracy of the best Logistic model: 83.88%)
Support Vector Machine (best f1 score 0.7534) (Accuracy of the best SVM model: 83.69%)
Gradient Boosted Tree (best f1 score 0.7621 ) (Accuracy of the best GBT model: 99.05%)

The best being the GBT

Blog

More detailed findings and analysis can be found in the blog post on Medium

Acknowledgements

The data used in this project was provided by Udacity as part of the Data Science Nanodegree Program

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.ipynb_checkpoints		.ipynb_checkpoints
gbt.model		gbt.model
logistic.model		logistic.model
svm.model		svm.model
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
Sparkify.ipynb		Sparkify.ipynb
mini_sparkify_event_data.json		mini_sparkify_event_data.json
sparkify_img.png		sparkify_img.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Churn Prediction Using Spark

Background

Installation

Project Motivation

Files Description

Results

Blog

Acknowledgements

About

Releases

Packages

Languages

OmoyeniO/Churn-Prediction-Using-Spark

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Churn Prediction Using Spark

Background

Installation

Project Motivation

Files Description

Results

Blog

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages