The objective of this project is to identify potential telecom churners of TNB telco inc. (hypothetical compony), enabling the company to take proactive measures to retain these customers. Using the Telco Customer Churn dataset provided by TNB (dataset is available on kaggle), various machine learning models have been evaluated to achieve this goal. You can view the detailed report on
The primary goal is to identify possible telecom churners so that the company can implement strategies to retain these customers. While the implementation of retention strategies is outside the scope of this project, the insights provided can greatly inform decision-making.
This project utilizes the Kaggle Telco Customer Churn dataset, which contains comprehensive information about telecom customers, including their usage patterns, payment methods, and service preferences.
This project leverages a range of powerful frameworks and tools to ensure cutting-edge performance and efficiency. Here are the key technologies used:
-
Plotly
: Interactive data visualization library that brings your data to life.
-
Featuretools
: Automated feature engineering for creating meaningful features from raw data.
-
LightGBM
: Gradient boosting framework that uses tree-based learning algorithms.
-
Optuna
: Hyperparameter optimization framework to enhance model performance.
-
MLflow
: Platform for managing the end-to-end machine learning lifecycle.
-
Dagshub
: Collaborative data science platform for versioning and managing datasets and models.
-
Sphinx
: Documentation generator for creating beautiful project docs.
-
DVC
: Data version control system for managing data and model versions.
-
Scikit-learn
: Machine learning library for Python providing simple and efficient tools.
-
TensorFlow
: Open-source platform for machine learning and artificial intelligence.
-
CatBoost
: Gradient boosting library that handles categorical features efficiently.
-
Seaborn
: Statistical data visualization library built on top of Matplotlib.
-
Keras
: High-level neural networks API, written in Python and capable of running on top of TensorFlow.
Click here for more details on the Methodology
To ensure a thorough analysis and implementation, I explored multiple models and techniques, which demonstrates my adaptability and desire to leave no stone unturned. Below is a brief look into the methodologies that helped drive the project’s success:
Several machine learning models were tested to predict customer churn, including LightGBM (LGB), XGBoost (XGB), CatBoost (Cat), and Artificial Neural Networks (ANN). After thorough comparison, LightGBM and ANN outperformed the rest, offering the best balance of accuracy and interpretability.
Featuretools was used for automatic feature construction, which proved to be highly effective. The top 15 features were mostly generated by Featuretools, highlighting the benefits of automated feature engineering.
I believe that sophisticated feature engineering techniques, a key to improving model accuracy.
Missing values were handled using median imputation for numerical data, while categorical features received a special "missing" category. This was achieved using the ColumnTransformer and Imputer classes.
To balance recall and precision, I used a custom weighted recall metric:
- Weighted Recall = 0.65 * Recall + 0.35 * F1 Score.
The model achieved a recall of 0.80 and a precision of 0.54. Emphasizing the recall ensures that the model captures as many churners as possible, which is crucial for customer retention strategies.
I've optimized the metrics based on the project’s goals, ensuring I’m providing real-world, actionable insights.
-
Charges: Higher churn rates among monthly users are attributed to charges. Customers with higher costs are more likely to churn.

-
Senior Citizens: Senior citizens have a notably higher churn rate—approximately double that of younger customers. This is because they tend to be more cautious with their finances, leading them to reconsider non-essential services more frequently.

-
Automatic Payment Method: Customers using automatic payment methods have a lower churn rate. The convenience of automatic payments reduces the likelihood of reconsidering their commitment to the service.

-
Fiber Optic Service: Customers show a clear preference against fiber optic services, suggesting potential issues with reliability, speed, or customer support. Addressing these issues could help reduce churn and increase customer satisfaction.

Follow the steps below to get the project up and running and uncover the secrets behind its success.
Before you dive into the data, let’s get your environment set up:
-
Clone the Project:
git clone https://github.com/d-sutariya/customer_churn_prediction.git
-
Create and Activate a Virtual Environment:
python -m venv env # On Unix: source env/bin/activate # On Windows: env\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
-
Run the Setup Script:
cd customer_churn_prediction python src/config/setup_project.py
Now you're ready to transform raw data into valuable insights that could change business operations.
Transform raw customer data into a training-ready dataset with the following command:
-
Run the Transformation Script:
python src/data/make_dataset.py --input_file_path data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv
Curious to see the process in action? Explore my Jupyter notebooks for an in-depth look!
Here’s where the real magic happens. My Jupyter notebooks offer deep insights into customer churn predictions. Dive into them to see innovative approaches and results:
-
Explore the Notebook:
These notebooks are not just scripts—they are a window into the detailed thought process behind every step.
Head over to the src/ directory to find core production scripts designed for efficiency and scalability:
-
ETL Pipeline Script:
-
Data Pipeline Configuration:
-
Hyperparameter Optimization:
Imagine these scripts as part of your production pipeline. They are designed to be efficient and scalable.
The journey doesn’t end with deployment. The post_deployment/ directory includes scripts for:
- Transforming new data.
- Periodically retraining the model.
Check the scripts here:
These scripts ensure your operations team stays ahead of potential issues and maintains model accuracy over time.
I encourage you to explore this project thoroughly. From cutting-edge data transformations to production-ready pipelines, every piece has been crafted to address real-world problems.
As you delve into the materials, I hope you see the value and potential of this project and how it could fit into your business.
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third-party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- Final, canonical datasets ready for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- Project documentation.
│
├── models <- Trained and serialized models.
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── post_deployment <- Scripts related to post-deployment activities.
│
├── reports <- Feature transformation definitions, predictions, and mlflow runs.
│
├── requirements.txt <- The requirements file for reproducing the analysis environment.
│
├── setup.py <- Makes project pip installable (pip install -e .) so src can be imported.
│
├── src <- Source code for use in this project.
│ ├── config <- Script for setting up the project locally.
│ ├── data <- Scripts to download or generate data.
│ │ ├── make_dataset.py
│ │ └── data_utils.py <- Data processing utilities.
│ ├── features <- Scripts to turn raw data into features for modeling.
│ │ └── generate_and_transform_features.py <- Generate and transform features using Featuretools.
│ ├── models <- Scripts to train models and use them for predictions.
│ │ ├── predict_model.py
│ │ └── train_model.py
│ ├── optimization <- Scripts related to model optimization.
│ │ ├── ensemble_utils.py <- Utilities for ensembling models.
│ │ ├── model_optimization.py <- Manual model optimization.
│ │ └── tuning_and_tracking.py <- Hyperparameter tuning and tracking using MLflow and DagsHub.
│ ├── pipeline <- DVC pipeline for data cleaning to model predictions.
│ │ └── dvc.yaml <- Full pipeline configuration.
│
└── tox.ini <- Tox file with settings for running tests and managing environments.
Feel free to reach out with any questions or feedback. I look forward to your thoughts!