Badge source
The objective of this study is to utilize Machine Learning methods to perform customer segmentation. Customer segmentation involves dividing a market into distinct groups of customers who exhibit similar characteristics. By leveraging customer segmentation, companies can effectively identify and address unmet customer needs, gaining a competitive edge through the development of highly appealing products and services.
├── Image
│ ├── 2.1 Original dataset.jpg <- original dataset used in the README
│ ├── 2.2 Statistics.jpg <- statistics used in the README
│ ├── 2.3 Income segmentation.jpg <- Income segmentation used in the README
│ ├── 3.1 Visual inspection.jpg <- visual inspection to decide K used in the README
│ ├── 3.2 Elbow Method.jpg <- Elbow Method to decide K used in the README
│ ├── 3.3 Silhouette Score.jpg <- Silhouette Score to decide K used in the README
│ ├── 4.1 K-means.jpg <- visual inspection of K-means method used in the README
│ ├── 4.2 Hierarchical Clustering.jpg <- visual inspection of Hierarchical Clustering used in the README
│ ├── 5.1 Features of clusters.jpg <- features of clusters used in the README
│
├── Code_Mall_Customer_Segmenation.ipynb <- python code
├── Data_Mall_Customers.csv <- dataset
├── LICENSE.txt <- license
Clustering, an unsupervised machine learning technique, is employed for customer segmentation. Clustering aims to discover inherent groups or clusters within data, without prior knowledge of their existence. The following highlights the advantages and disadvantages of utilizing clustering for customer segmentation.
Advantages of clustering:
- Facilitates the identification of unexpected or unknown customer groups.
- Provides flexibility and can be applied to diverse datasets.
- Reduces the necessity for extensive expertise in understanding the relationship between customer demographics and behaviors.
- Offers quick and scalable analysis, even with large datasets.
Disadvantages of clustering:
- Generated customer groups may lack interpretability and clarity.
- If the data does not incorporate customer behavior information, such as purchase history or service usage, the practical utilization of identified clusters might be challenging.
By considering these factors, businesses can make informed decisions when leveraging clustering techniques for customer segmentation, ensuring meaningful and actionable insights that drive strategic success.
This project is a part of the Mall Customer Segmentation Data competition held on Kaggle (https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python). Exploratory data analysis (EDA) is used to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. The purpose is to understand data and encourage the following analytics.
- Original Dataset
Name | Modeling Role | Measurement Level | Description |
---|---|---|---|
CustomerID | feature | int64 | Unique ID assigned to the customer |
Gender | feature | Object | Gender of the customer |
Age | feature | int64 | Age of the customer |
Annual Income (k$) | feature | int64 | Annual Income of the customer |
Spending Score | feature | int64 | Score assigned by the mall based on customer behavior and spending nature |
- Statistics
- Income Segmentation
This is done in two steps, for each column:
- First, subtract the mean of the data from each data point. This centers the data around 0, to make the data easier to look at and interpret, although this is not strictly required for clustering.
- The second step is to divide the parameters by their standard deviation.
- Simple Visual Inspection to Choose the Optimal K
- K-means
- Hierarchical Clustering
Cluster 0: Medium income, Midum Spending Score Cluster 1: High Income, Low Spending Score Cluster 2: Low Income, Low Spending Score Cluster 3: Low Income, High Spending Score Cluster 4: High Income, High Spending Score
The advantage of machine learning-based clustering is its ability to expedite the segmentation process and discover patterns without requiring extensive domain knowledge. There are various methods available for ML clustering, such as K-means, K-medians, and hierarchical clustering, each with its own strengths and limitations. In our specific case, K-means necessitates predefining the number of clusters (K), whereas hierarchical clustering can generate cluster groups based on different K values. It is crucial to compare the results of these techniques objectively.
Moreover, the choice of the number of segments (K) should align with business requirements. Fewer segments can provide a simplified and interpretable understanding of customers, while more segments allow for finer-grained customer segmentation. However, the clusters, regardless of their quality, hold no significance if they are not actionable for the business. Non-actionability can arise in two ways:
The clusters lack business rationale. The number of clusters is excessively large. In essence, machine learning techniques must consider business value and strategy to ensure that the insights derived are meaningful and actionable. By doing so, these insights become invaluable and can drive impactful business decisions.
- Baig, M. R., Govindan, G., & Shrimali, V. R. (2021). Data Science for Marketing Analytics: A practical guide to forming a killer marketing strategy through data analysis with Python (2nd ed.), Chapter 3: Unsupervised Learning and Customer Segmentation (pp. 113-159). Packt Publishing.
- Sagar, A. (2019, August 24). Customer Segmentation Using K Means Clustering. https://towardsdatascience.com/customer-segmentation-using-k-means-clustering-d33964f238c3.