This project leverages machine learning to predict customer churn by analyzing customer data. The goal is to identify customers likely to discontinue services, enabling businesses to take proactive measures to improve customer retention.
-
Loaded Data:
- Imported the dataset
customer_churn_data.csv
using Pandas:import pandas as pd df = pd.read_csv("customer_churn_data.csv")
- Imported the dataset
-
Inspected Dataset:
- Used
df.info()
,df.shape
, anddf.head()
to examine data structure, dimensions, and the first few records.
- Used
-
Handled Missing Values:
- Replaced missing values in the
InternetService
column with empty spaces.
- Replaced missing values in the
-
Checked for Duplicates:
- Verified and removed duplicates using:
df.duplicated().sum()
- Verified and removed duplicates using:
- Calculated descriptive statistics using
df.describe()
to summarize the central tendency, dispersion, and shape of the dataset's distribution.
- Visualized churn distribution using pie charts to identify the proportion of churned and non-churned customers.
df['Churn'].value_counts().plot(kind='pie')
- Analyzed
MonthlyCharges
andTenure
using histograms to understand their distributions and detect any skewed patterns.plt.hist(df['MonthlyCharges'])
- Explored relationships between
ContractType
andMonthlyCharges
using bar plots, showing how charges vary by contract type.df.groupby('ContractType')['MonthlyCharges'].mean().plot(kind="bar")
- Calculated correlations between numerical columns to identify relationships between variables.
numeric_col_data.corr()
-Churn distribution: Pie chart shows the proportion of churned vs. retained customers.
-Monthly charges/tenure distributions: Histograms reveal the distribution of these numerical features.
-Contract type vs. charges: Bar plot illustrates the relationship between contract type and average monthly charges.
-
Converted Categorical Variables:
- Transformed
Gender
andChurn
columns into numerical values:df['Gender'] = df['Gender'].apply(lambda x: 1 if x == "Female" else 0) df['Churn'] = df['Churn'].apply(lambda x: 1 if x == "Yes" else 0)
- Transformed
-
Feature Selection:
- Selected relevant features:
Age
,Gender
,Tenure
,MonthlyCharges
.
- Selected relevant features:
-
Data Splitting:
- Split dataset into training and testing sets:
from sklearn.model_selection import train_test_split X = df[['Age', 'Gender', 'Tenure', 'MonthlyCharges']] y = df['Churn'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
- Split dataset into training and testing sets:
-
Feature Scaling:
- Standardized numerical features using
StandardScaler
:from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test)
- Standardized numerical features using
======
- Model Training
-
Logistic Regression:
from sklearn.linear_model import LogisticRegression log_model = LogisticRegression() log_model.fit(X_train, y_train)
-
K-Nearest Neighbors:
from sklearn.neighbors import KNeighborsClassifier knn_model = KNeighborsClassifier() # using best parameters from GridSearchCV knn_model.fit(X_train, y_train)
-
Support Vector Machine:
from sklearn.svm import SVC svm_model = SVC() # using best parameters from GridSearchCV svm_model.fit(X_train, y_train)
-
Decision Tree:
from sklearn.tree import DecisionTreeClassifier dt_model = DecisionTreeClassifier() # using best parameters from GridSearchCV dt_model.fit(X_train, y_train)
-
Random Forest:
from sklearn.ensemble import RandomForestClassifier rf_model = RandomForestClassifier() # using best parameters from GridSearchCV rf_model.fit(X_train, y_train)
- Performed hyperparameter tuning for KNN, SVM, Decision Tree, and Random Forest using
GridSearchCV
to optimize model performance. - Evaluated model performance using accuracy score:
from sklearn.metrics import accuracy_score accuracy = accuracy_score(y_test, model.predict(X_test))
-
Best Model:
- Selected Support Vector Machine (SVC) as it achieved the highest accuracy.
-
Save Model:
- Stored the trained model for future predictions:
import joblib joblib.dump(svc_model, "model.pkl")
- Stored the trained model for future predictions:
A Streamlit app was developed to allow users to input customer details and predict churn likelihood:
import streamlit as st
# Input fields
age = st.number_input('Age', min_value=18, max_value=90, value=40, step=1)
tenure = st.number_input('Tenure', min_value=0, max_value=130, value=10)
monthlycharge = st.number_input('Monthly Charges', min_value=30, max_value=150, value=50)
gender = st.selectbox('Gender', ['Male', 'Female'])
# Button to calculate prediction
calculate = st.button('Calculate')
st.divider()
# Prepare input for model
gender_num = 1 if gender == 'Female' else 0
X = [[age, gender_num, tenure, monthlycharge]]
if calculate:
st.balloons()
# Scale input and make prediction
X_scaled = scaler.transform(X)
prediction = model.predict(X_scaled)[0]
# Display results
result = "Yes" if prediction == 1 else "No"
st.write(f"Prediction: {result}")
else:
st.write("Please fill in the values and click 'Calculate'")
##Result
-If the prediction output is 1, the app displays "Prediction: YES," indicating that the customer is likely to churn. -If the prediction output is 0, the app displays "Prediction: NO," indicating that the customer is not likely to churn.
- Influential Factors: Monthly charges, contract type, and tenure significantly affect customer churn.
- Best Model: Support Vector Machine (SVC) outperformed other models in accuracy.
- Established Workflow: A complete pipeline was developed for preprocessing, training, evaluation, and deployment of the churn prediction model.