Using Machine Learning to Identify Customer Segments
Project Overview
Fortunately in data science you shouldn’t always have labels because we have unsupervised learning, it’s not as accurate as the supervise learning but sometimes we do not have labels and unsupervised learning comes to the rescue, unsupervised learning “clustering” is used via corporate to identify their customers and to try to make segments out of them, the segments could be used for making targeted campaigns and customised solutions for each segment,
Problem Statement
launching marketing campaigns without a target audience is a waste of money, you will spend extra money and get less results of your campaign in order to identify your target audience you’ll have to study and understand your current customers, in order to do that you would have to make segments out of your customers depending on the characteristics of each customer, in this project I will try to make segments out of the customers so the mailing company knows whose to target in their marketing campaigns.
Metrics
My metrics in this project will be the elbow method, The elbow method is to identify the right number of clusters in your data.
Data Exploration
Fortunately in this project we have plenty of data, first we have Udacity_AZDIAS_Subset.csv which is A demographic data Representing the whole German population, also we have in Udacity_CUSTOMERS_Subset.csv it’s also a demographic data but this time is for a specific mail order company, we also have a Data_Dictionary.md and it has the description of the above data, we also have AZDIAS_Feature_Summary.csv which is a summary of the columns in the demographic data.
In the two main files ‘demographic files’ each record represent an individual and it has multiple information about that individual like the neighbourhood and building type, which is useful information to start segmenting these people
Data Visualization
Data Preprocessing
- In order to capture missing value I think we should convert them all to NaNs, After that I Visualized the features with how many NaNs the feature has, if the column is almost empty for most of the data are missing I think that column should be dropped.
Investigating patterns in the amount of missing data in each column
When we see the number of missing values in each column we can see that there are six unusual columns with very high number of missing value this will not be good for our model to Train on, so, I decided to remove these six columns that has above 200,000 NaN values:
-‘AGER_TYP’
-‘GEBURTSJAHR’
-‘TITEL_KZ’
-‘ALTER_HH’
-‘KK_KUNDENTYP’
- ‘KBA05_BAUMAX’
After visualizing the missing values in each row, I found the number of rows with missing values to be 93k or around 10% of the data, I made threshold to be 30.
After dividing the data into two subsets, one for data points that are above 30 missing values, and a second subset for points below the threshold, I visualize the distributions of non-missing features between the data with many missing values and the data with few or no missing values. I selected 5 features to visualize.
After the analysis, I noticed that the distribution of data with more missing values is different from data with few or non-missing, meaning that the data could be special and should be kept.
- Re-Encoding and selecting Features
Since the unsupervised learning techniques to be used with data that is encoded numerically, I had to make a few encoding changes.
- Re encode binary categorical features.
- Remove Multi level categoricals from the dataset.
- Re engineer mixed type features by constructing new variables.
Feature Transformation
Feature Scaling
I performed feature scaling. So I used imputer to deal with NaNs by replacing them with the median and then applied scalar transformation to standardize the values.
Dimensionality Reduction
On the scaled data, I applied PCA, in order to find the vectors of maximal variance in the data.
From the above analysis, I found 95% of the variability appeared in the first 50 components, so, I selected only top 50 components for PCA.
Interpreting Principal Components
I mapped each weight to their corresponding feature, then sorted the features according to the weight. The most interesting features for each principal component, then, were those at the beginning and end of the sorted list as they represent the positive and negative correlation with the component respectively.
Component 1
Positive
- CAMEO_INTL_2015_Wealth (financial status of households)
- EWDICHTE (Density of households / square kilometer)
- ORTSGR_KLS9 (Size of community)
- FINANZ_SPARER (Financial typology)
- PLZ8_ANTG3 (Number of 6–10 family houses in the PLZ8 region).
Negative
- KBA05_GBZ (Number of buildings in the microcell)
- PLZ8_ANTG1 (Number of 1–2 family houses in the PLZ8 region)
- MOBI_REGIO (Movement patterns)
- FINANZ_MINIMALIST (Financial typology)
- KBA05_ANTG1 (Number of 1–2 family houses in the microcell)
we can conclude that component 1 is correlated with the financial status of a household, in addition to the size of house and family members. We can assume that component 1 has negative correlations with financial saving.
Component 2
Positive
- ALTERSKATEGORIE_GROB (Estimated age based on given name)
- SEMIO_LUST (Personality typology)
- RETOURTYP_BK_S (Return type)
- FINANZ_VORSORGER (Financial typology)
- SEMIO_ERL (Personality typology)
Negative
- SEMIO_PFLICHT (Personality typology)
- SEMIO_REL (Personality typology)
- PRAEGENDE_JUGENDJAHRE_decade (Dominating decade of person’s youth)
- FINANZ_UNAUFFAELLIGER (Financial typology)
- FINANZ_SPARER (Financial typology)
For component 2, we can see that age, personality traits and financial behaviors are correlated with Component 2, as it seems that financial behaviors such as purchase return type and financial prevention behaviors have a positive relationship with our component.
Component 3
Positive
- SEMIO_VERT (Personality typology)
- SEMIO_KULT (Personality typology)
- SEMIO_FAM (Personality typology)
- PLZ8_ANTG4 (Number of 10+ family houses in the PLZ8 region)
- SEMIO_SOZ (Personality typology)
Negative
- SEMIO_ERL (Personality typology)
- SEMIO_DOM (Personality typology)
- ANREDE_KZ (Gender)
- SEMIO_KRIT (Personality typology)
- SEMIO_KAEM (Personality typology)
For component 3, we can see that the personality traits have a big effect on component 3, as it appears on both the positive and negative side of the graph.
Implementation
K-means algorithm is an unsupervised machine learning algorithm that identifies the number of centroids and then allocates every data point to the nearest cluster while keeping the centroids as small as possible. To determine the optimal number of clusters into which the data may be clustered, I used the elbow method. However, from the analysis above it, appeared that there is no clear elbow, so I decided to go with 20 clusters as that was where it seemed to decrease endlessly.
Refinement
I applied all the previous steps on the customer dataset to compare the results with the general population. I wanted to find which cluster are overrepresented and underrepresented in the customer dataset compared to the general population.
Results
From the analysis above, we can see that cluster 1 is overrepresented in the customer data, the main characteristics of the mail-order company customers are as follow:
- Estimated age based on given name analysis is from 30–45 years old (ALTERSKATEGORIE_GROB = 2.4).
- Gender is male (ANREDE_KZ = 1.6)
- A person whose minimalist habits are high (FINANZ_MINIMALIST = 2.9)
Also from the previous analysis, we can see that cluster 8 is underrepresented in the customer data, the main characteristics of the customers are as follow:
- Estimated age based on given name analysis is younger than 30 years old (ALTERSKATEGORIE_GROB = 1.8).
- Gender is male (ANREDE_KZ = 1.9)
Justification
I used the elbow method. However, from the analysis , It appeared that there is no clear elbow, so I decided to go with 20 clusters as that was where it seemed to decrease endlessly.
Reflection
In this project, I applied unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used by Arvato to direct marketing campaigns towards audiences that will have the highest expected rate of returns.
Future Improvement
I found this project to be interesting as it contains real data in which it has many areas that can be further explored. Additionally, I found data exploration and wrangling to be quite challenging but reflective of real-life data science applications.
The clustering might be clearer if Arvato provides some new information columns and records, there might be a hidden pattern that be explored with new data.



















Comments
Post a Comment