SlideShare a Scribd company logo
Mastering Customer
Segmentation with LLM
Unlock advanced customer segmentation techniques using LLMs,
and improve your clustering models with advanced techniques
ContentTable
· Intro
· Data
· Method 1: Kmeans
· Method 2: K-Prototype
· Method 3: LLM + Kmeans
· Conclusion
Intro
A customer segmentation project can be approached in multiple
ways. In this article I will teach you advanced techniques, not only to
define the clusters, but to analyze the result. This post is intended
for those data scientists who want to have several tools to address
clustering problems and be one step closer to being seniors DS.
What will we see in this article?
Let’s see 3 methods to approach this type of project:
 Kmeans
 K-Prototype
 LLM + Kmeans
As a small preview I will show the following comparison of the 2D
representation (PCA) of the different models created:
Graphic comparison of the three methods (Image by Author).
You will also learn dimensionality reduction techniques such as:
 PCA
 t-SNE
 MCA
Some of the results being these:
Graphical comparison of the three dimensionality reduction methods (Image by
Author).
A very important clarification is that this is not an end-to-end
project. This is because we have skipped one of the most important
parts in this type of project: The exploratory data analysis
(EDA) phase or the selection of variables.
Data
The original data used in this project is from a public
Kaggle: Banking Dataset — Marketing Targets. Each row in this data
set contains information about a company’s customers. Some fields
are numerical and others are categorical, we will see that this
expands the possible ways to approach the problem.
We will only be left with the first 8 columns. Our dataset looks like
this:
Let’s see a brief description of the columns of our dataset:
 age (numeric)
 job : type of job (categorical: “admin.”
,”unknown”,”unemployed”, ”management”,
”housemaid”, ”entrepreneur”, ”student”, “blue-collar”,
”self-employed”, ”retired”, ”technician”, ”services”)
 marital : marital status (categorical:
“married”,”divorced”,”single”; note: “divorced” means
divorced or widowed)
 education (categorical:
“unknown”,”secondary”,”primary”,”tertiary”)
 default: has credit in default? (binary: “yes”,”no”)
 balance: average yearly balance, in euros (numeric)
 housing: has housing loan? (binary: “yes”,”no”)
 loan: has personal loan? (binary: “yes”,”no”)
For the project, I’ve utilized the training dataset by Kaggle. In the
project repository, you can locate the “data” folder where a
compressed file of the dataset used in the project is stored.
Additionally, you will find two CSV files inside of the compressed
file. One is the training dataset provided by Kaggle (train.csv), and
the other is the dataset after performing an embedding
(embedding_train.csv), which we will explain further later on.
To further clarify how the project is structured, the project tree is
shown:
clustering_llm
├─ data
│ ├─ data.rar
├─ img
├─ embedding.ipynb
├─ embedding_creation.py
├─ kmeans.ipynb
├─ kprototypes.ipynb
├─ README.md
└─ requirements.txt
Method 1: Kmeans
This is the most common method and the one you will surely know.
Anyway, we are going to study it because I will show advanced
analysis techniques in these cases. The Jupyter notebook where you
will find the complete procedure is called kmeans.ipynb
Preprocessed
A preprocessing of the variables is carried out:
1. It consists of converting categorical variables into
numeric ones.
We are going to apply Onehot encoder for the nominal
variables and a OrdinalEncoder for the ordinals features
(education).
2. We try to ensure that the numerical variables have a
Gaussian distribution. For them we will apply a
PowerTransformer.
Outliers
It is crucial that there are as few outliers in our data as Kmeans is
very sensitive to this. We can apply the typical method of choosing
outliers using the z score, but in this post I will show you a much
more advanced and cool method.
Well, what is this method? Well, we will use the Python Outlier
Detection (PyOD) library. This library is focused on detecting
outliers for different cases. To be more specific we will use
the ECOD method (“empirical cumulative distribution
functions for outlier detection”).
Modeling
One of the disadvantages of using the Kmeans algorithm is that you
must choose the number of clusters you want to use. In this case, in
order to obtain that data, we will use Elbow Method. It consists of
calculating the distortion that exists between the points of a cluster
and its centroid. The objective is clear, to obtain the least possible
distortion. In this case we use the following code:
from yellowbrick.cluster import KElbowVisualizer
# Instantiate the clustering model and visualizer
km = KMeans(init="k-means++", random_state=0, n_init="auto")
visualizer = KElbowVisualizer(km, k=(2,10))
visualizer.fit(data_no_outliers) # Fit the data to the visualizer
visualizer.show()
Output:
Elbow score for different numbers of clusters (Image by Author).
We see that from k=5, the distortion does not vary drastically. It is
true that the ideal is that the behavior starting from k= 5 would be
almost flat. This rarely happens and other methods can be applied to
be sure of the most optimal number of clusters. To be sure, we could
perform a Silhoutte visualization.
It can be seen that the highest silhouette score is obtained with
n_cluster=9, but it is also true that the variation in the score is quite
small if we compare it with the other scores. At the moment the
previous result does not provide us with much information. On the
other hand, the previous code creates the Silhouette visualization,
which gives us more information:
Mastering Customer Segmentation with LLM.pdf
Graphic representation of the silhouette method for different numbers of
clusters (Image by Author).
Since understanding these representations well is not the goal of this
post, I will conclude that there seems to be no very clear decision as
to which number is best. After viewing the previous representations,
we can choose K=5 or K= 6. This is because for the different
clusters, their Silhouette score is above the average value and there
is no imbalance in cluster size. Furthermore, in some situations, the
marketing department may be interested in having the smallest
number of clusters/types of customers (This may or may not be the
case).
Finally we can create our Kmeans model with K=5.
km = KMeans(n_clusters=5,
init='k-means++',
n_init=10,
max_iter=100,
random_state=42)
clusters_predict = km.fit_predict(data_no_outliers)
"""
clusters_predict -> array([4, 2, 0, ..., 3, 4, 3])
np.unique(clusters_predict) -> array([0, 1, 2, 3, 4])
"""
Evaluation
The way of evaluating kmeans models is somewhat more open than
for other models. We can use
 metrics
 visualizations
 interpretation (Something very important for companies).
As far as can be shown, we do not have an excessively good
model. Davies’ score is telling us that the distance between
clusters is quite small.
This may be due to several factors, but keep in mind that the energy
of a model is the data; if the data does not have sufficient predictive
power, you cannot expect to achieve exceptional results.
For visualizations, we can use the method to reduce
dimensionality, PCA. For them we are going to use
the Prince library, focused on exploratory analysis and
dimensionality reduction. If you prefer, you can use Sklearn’s PCA,
they are identical.
First we will calculate the principal components in 3D, and then we
will make the representation. These are the two functions performed
by the previous steps
Output:
PCA space and the clusters created by the model (Image by Author).
It can be seen that the clusters have almost no separation between
them and there is no clear division. This is in accordance with the
information provided by the metrics.
Something to keep in mind and that very few people
keep in mind is the PCA and the variability of the
eigenvectors.
Let’s say that each field contains a certain amount of information,
and this adds its bit of information. If the accumulated sum of the 3
main components adds up to around 80% variability, we can say
that it is acceptable, obtaining good results in the representations. If
the value is lower, we have to take the visualizations with a grain of
salt since we are missing a lot of information that is contained in
other eigenvectors.
The next question is obvious: What is the variability
of the PCA executed?
The answer is the following:
As can be seen, we have 27.98% variability with the first 3
components, something insufficient to draw informed conclusions.
When we apply the PCA method, since it is a linear algorithm, it is
not capable of capturing more complex relationships. Luckily there
is a method called t-SNE, which is capable of capturing these
complex polynomial relationships. This can help us visualize,
since with the previous method we have not had much success.
If you try it on your computers, keep in mind that it has a higher
computational cost. For this reason, I sampled my original dataset
and it still took me about 5 minutes to get the result. The code is as
follows:
from sklearn.manifold import TSNE
sampling_data = data_no_outliers.sample(frac=0.5, replace=True, random_state=1)
sampling_clusters = pd.DataFrame(clusters_predict).sample(frac=0.5, replace=True,
random_state=1)[0].values
df_tsne_3d = TSNE(
n_components=3,
learning_rate=500,
init='random',
perplexity=200,
n_iter = 5000).fit_transform(sampling_data)
df_tsne_3d = pd.DataFrame(df_tsne_3d, columns=["comp1", "comp2",'comp3'])
df_tsne_3d["cluster"] = sampling_clusters
plot_pca_3d(df_tsne_3d, title = "PCA Space", opacity=1, width_line = 0.1)
As a result, I got the following image. It shows a clearer separation
between clusters but unfortunately, we still don’t have good results.
t-SNE space and the clusters created by the model (Image by Author).
In fact, we can compare the reduction carried out by the PCA and
by the t-SNE, in 2 dimensions. The improvement is clear using
the second method.
Different results for different dimensionality reduction methods and clusters
defined by the model (Image by Author).
Finally, let’s explore a little how the model works, in which features
are the most important and what are the main characteristics of the
clusters.
To see the importance of each of the variables we will use a typical
“trick” in this type of situation. We are going to create a
classification model where the “X” is the inputs of the Kmeans
model, and the “y” is the clusters predicted by the Kmeans model.
The chosen model is an LGBMClassifier. This model is quite
powerful and works well having categorical and numerical variables.
Having the new model trained, using the SHAP library, we can
obtain the importance of each of the features in the prediction. The
code is:
import lightgbm as lgb
import shap
# We create the LGBMClassifier model and train it
clf_km = lgb.LGBMClassifier(colsample_by_tree=0.8)
clf_km.fit(X=data_no_outliers, y=clusters_predict)
#SHAP values
explainer_km = shap.TreeExplainer(clf_km)
shap_values_km = explainer_km.shap_values(data_no_outliers)
shap.summary_plot(shap_values_km, data_no_outliers, plot_type="bar", plot_size=(15, 10))
Output:
The importance of the variables in the model (Image by Author).
It can be seen that feature age has the greatest predictive power. It
can also be seen that cluster number 3 (green) is mainly
differentiated by the balance variable.
Finally we must analyze the characteristics of the clusters. This part
of the study is what is decisive for the business. For them we are
going to obtain the means (for the numerical variables) and the most
frequent value (categorical variables) of each of the features of the
dataset for each of the clusters:
df_no_outliers = df[df.outliers == 0]
df_no_outliers["cluster"] = clusters_predict
df_no_outliers.groupby('cluster').agg(
{
'job': lambda x: x.value_counts().index[0],
'marital': lambda x: x.value_counts().index[0],
'education': lambda x: x.value_counts().index[0],
'housing': lambda x: x.value_counts().index[0],
'loan': lambda x: x.value_counts().index[0],
'contact': lambda x: x.value_counts().index[0],
'age':'mean',
'balance': 'mean',
'default': lambda x: x.value_counts().index[0],
}
).reset_index()
Output:
We see that the clusters with job=blue-collar do not have a great
differentiation between their characteristics, except by the age
feature. This is something that is not desirable since it is difficult to
differentiate the clients of each of the clusters. In
the job=management case, we obtain better differentiation.
After carrying out the analysis in different ways, they converge on
the same conclusion: “We need to improve the results”.
Method 2: K-Prototype
If we remember our original dataset, we see that we have categorical
and numerical variables. Unfortunately, the Kmeans algorithm
provided by Skelearn does not accept categorical variables, forcing
the original dataset to be modified and drastically altered.
Luckily, you’ve taken with me and my post. But above all, thanks
to ZHEXUE HUANG and his article Extensions to the k-
Means Algorithm for Clustering Large Data Sets with
Categorical Values, there is an algorithm that accepts categorical
variables for clustering. This algorithm is called K-Prototype. The
bookstore that provides it is Prince.
The procedure is the same as in the previous case. In order not to
make this article eternal, let’s go to the most interesting parts. But
remember that you can access the Jupyter notebook here.
Preprocessed
Because we have numerical variables, we must make certain
modifications to them. It is always recommended that all numerical
variables be on similar scales and with distributions as close as
possible to Gaussian ones. The dataset that we will use to create the
models is created as follows:
pipe = Pipeline([('scaler', PowerTransformer())])
df_aux = pd.DataFrame(pipe_fit.fit_transform(df_no_outliers[["age", "balance"]] ), columns = ["age",
"balance"])
df_no_outliers_norm = df_no_outliers.copy()
# Replace age and balance columns by preprocessed values
df_no_outliers_norm = df_no_outliers_norm.drop(["age", "balance"], axis = 1)
df_no_outliers_norm["age"] = df_aux["age"].values
df_no_outliers_norm["balance"] = df_aux["balance"].values
df_no_outliers_norm
Outliers
Because the method that I have presented for outlier
detection (ECOD) only accepts numerical variables, the same
transformation must be performed as for the kmeans method. We
apply the outlier detection model that will provide us with which
rows to eliminate, finally leaving the dataset that we will use as input
for the K-Prototype model:
Modeling
We create the model and to do this we first need to obtain the
optimal k. To do this we use the Elbow Method and this piece of
code:
# Choose optimal K using Elbow method
from kmodes.kprototypes import KPrototypes
from plotnine import *
import plotnine
cost = []
range_ = range(2, 15)
for cluster in range_:
kprototype = KPrototypes(n_jobs = -1, n_clusters = cluster, init = 'Huang', random_state = 0)
kprototype.fit_predict(df_no_outliers, categorical = categorical_columns_index)
cost.append(kprototype.cost_)
print('Cluster initiation: {}'.format(cluster))
# Converting the results into a dataframe and plotting them
df_cost = pd.DataFrame({'Cluster':range_, 'Cost':cost})
# Data viz
plotnine.options.figure_size = (8, 4.8)
(
ggplot(data = df_cost)+
geom_line(aes(x = 'Cluster',
y = 'Cost'))+
geom_point(aes(x = 'Cluster',
y = 'Cost'))+
geom_label(aes(x = 'Cluster',
y = 'Cost',
label = 'Cluster'),
size = 10,
nudge_y = 1000) +
labs(title = 'Optimal number of cluster with Elbow Method')+
xlab('Number of Clusters k')+
ylab('Cost')+
theme_minimal()
)
Output:
Elbow score for different numbers of clusters (Image by Author).
We can see that the best option is K=5.
Be careful, since this algorithm takes a little longer than those
normally used. For the previous graph, 86 minutes were needed,
something to keep in mind.
Well, we are now clear about the number of clusters, we just have to
create the model:
# We get the index of categorical columns
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categorical_columns = df_no_outliers_norm.select_dtypes(exclude=numerics).columns
print(categorical_columns)
categorical_columns_index = [df_no_outliers_norm.columns.get_loc(col) for col in
categorical_columns]
# Create the model
cluster_num = 5
kprototype = KPrototypes(n_jobs = -1, n_clusters = cluster_num, init = 'Huang', random_state = 0)
kprototype.fit(df_no_outliers_norm, categorical = categorical_columns_index)
clusters = kprototype.predict(df_no_outliers , categorical = categorical_columns_index)
print(clusters) " -> array([3, 1, 1, ..., 1, 1, 2], dtype=uint16)"
We already have our model and its predictions, we just need to
evaluate it.
Evaluation
As we have seen before we can apply several visualizations to obtain
an intuitive idea of how good our model is. Unfortunately the PCA
method and t-SNE do not admit categorical variables. But don’t
worry, since the Prince library contains the MCA (Multiple
correspondence analysis) method and it does accept a mixed
dataset. In fact, I encourage you to visit the Github of this library, it
has several super useful methods for different situations, see the
following image:
The different methods of dimensionality reduction by type of case (Image by
Author and Prince Documentation).
Well, the plan is to apply a MCA to reduce the dimensionality and be
able to make graphical representations. For this we use the following
code:
from prince import MCA
def get_MCA_3d(df, predict):
mca = MCA(n_components =3, n_iter = 100, random_state = 101)
mca_3d_df = mca.fit_transform(df)
mca_3d_df.columns = ["comp1", "comp2", "comp3"]
mca_3d_df["cluster"] = predict
return mca, mca_3d_df
def get_MCA_2d(df, predict):
mca = MCA(n_components =2, n_iter = 100, random_state = 101)
mca_2d_df = mca.fit_transform(df)
mca_2d_df.columns = ["comp1", "comp2"]
mca_2d_df["cluster"] = predict
return mca, mca_2d_df
"-------------------------------------------------------------------"
mca_3d, mca_3d_df = get_MCA_3d(df_no_outliers_norm, clusters)
Remember that if you want to follow each step 100%, you
can take a look at Jupyter notebook.
The dataset named mca_3d_df contains that information:
Let’s make a plot using the reduction provided by the MCA method:
MCA space and the clusters created by the model (Image by Author)
Wow, it doesn’t look very good… It is not possible to differentiate the
clusters from each other. We can say then that the model is not good
enough, right?
I hope you said something like:
“Hey Damian, don’t go so fast!! Have you looked at the
variability of the 3 components provided by the MCA?”
Indeed, we must see if the variability of the first 3 components is
sufficient to be able to draw conclusions. The MCA method allows us
to obtain these values in a very simple way:
mca_3d.eigenvalues_summary
Aha, here we have something interesting. Due to our data we obtain
basically zero variability.
In other words, we cannot draw clear conclusions from
our model with the information provided by the
dimensionality reduction provided by MCA.
By showing these results I try to give an example of what happens in
real data projects. Good results are not always obtained, but a good
data scientist knows how to recognize the causes.
We have one last option to visually determine if the model created
by the K-Prototype method is suitable or not. This path is simple:
1. This is applying PCA to the dataset to which
preprocessing has been performed to transform the
categorical variables into numerical ones.
2. Obtain the components of the PCA
3. Make a representation using the PCA components such
as the axes and the color of the points to predict the K-
Prototype model.
Note that the components provided by the PCA will be the same as
for method 1: Kmeans, since it is the same dataframe.
Let’s see what we get…
PCA space and the clusters created by the model (Image by Author).
It doesn’t look bad, in fact it has a certain resemblance to what has
been obtained in Kmeans.
Finally we obtain the average value of the clusters and the
importance of each of the variables:
The importance of the variables in the model. The table represents the most
frequent value of each of the clusters (Image by Author).
The variables with the greatest weight are the numerical ones,
notably seeing that the confinement of these two features is almost
sufficient to differentiate each cluster.
In short, it can be said that results similar to those of Kmeans have
been obtained.
Method 3: LLM + Kmeans
This combination can be quite powerful and improve the results
obtained. Let’s get to the point!
LLMs cannot understand written text directly, we need to
transform the input of this type of models. For
this, Sentence Embedding is carried out. It consists of
transforming the text into numerical vectors. The following image
can clarify the idea:
Concept of embedding and similarity (Image by Author).
This coding is done intelligently, that is, phrases that contain a
similar meaning will have a more similar vector. See the following
image:
Concept of embedding and similarity (Image by Author).
Sentence embedding is carried out by so-called transforms,
algorithms specialized in this coding. Typically you can choose what
the size of the numerical vector coming from this encoding is. And
here is one of the key points:
Thanks to the large dimension of the vector created
by embedding, small variations in the data can be
seen with greater precision.
Therefore, if we provide input to our information-rich
Kmeans model, it will return better predictions. This is the
idea we are pursuing and these are its steps:
1. Transform our original dataset through Sentence
embedding
2. Create a Kmeans model
3. Evaluate it
Well, the first step is to encode the information through Sentence
embedding. What is intended is to take the information of each
client and unify it into text that contains all its characteristics. This
part takes a lot of computing time. That’s why I created a script that
did this job, call embedding_creation.py. This script collects the
values contained in the training dataset and creates a new dataset
provided by the embedding. This is the script code:
import pandas as pd # dataframe manipulation
import numpy as np # linear algebra
from sentence_transformers import SentenceTransformer
df = pd.read_csv("data/train.csv", sep = ";")
# -------------------- First Step --------------------
def compile_text(x):
text = f"""Age: {x['age']},
housing load: {x['housing']},
Job: {x['job']},
Marital: {x['marital']},
Education: {x['education']},
Default: {x['default']},
Balance: {x['balance']},
Personal loan: {x['loan']},
contact: {x['contact']}
"""
return text
sentences = df.apply(lambda x: compile_text(x), axis=1).tolist()
# -------------------- Second Step --------------------
model = SentenceTransformer(r"sentence-transformers/paraphrase-MiniLM-L6-v2")
output = model.encode(sentences=sentences,
show_progress_bar=True,
normalize_embeddings=True)
df_embedding = pd.DataFrame(output)
df_embedding
As it is quite important that this step is understood. Let’s go by
points:
 Step 1: The text is created for each row, which contains
the complete customer/row information. We also store it
in a python list for later use. See the following image that
exemplifies it.
Graphic description of the first step (Image by Author).
 Step 2: This is when the call to the transformer is made.
For this we are going to use the model stored
in HuggingFace. This model is specifically trained to
perform embedding at the sentence level, unlike Bert’s
model, which is focused on encoding at the level of
tokens and words. To call the model you only have to
give the repository address, which in this case
is “sentence-transformers/paraphrase-MiniLM-
L6-v2”. The numerical vector that is returned to us for
each text will be normalized, since the Kmeans model is
sensitive to the scales of the inputs. The vectors created
have a length of 384. With them what we do is create a
dataframe with the same number of columns. See the
following image:
Graphic description of the second step (Image by Author),
Finally we obtain the dataframe from the embedding, which will be
the input of our Kmeans model.
This step has been one of the most interesting and important, since
we have created the input for the Kmeans model that we will create.
The creation and evaluation procedure is similar to that shown
above. In order not to make the post excessively long, only the
results of each point will be shown. Don’t worry, all the code is
contained in the jupyter notebook called embedding, so you
can reproduce the results for yourself.
In addition, the dataset resulting from applying the Sentence
embedding has been saved in a csv file. This csv file is
called embedding_train.csv. In the Jupyter notebook you will
see that we access that dataset and create our model based on it.
# Normal Dataset
df = pd.read_csv("data/train.csv", sep = ";")
df = df.iloc[:, 0:8]
# Embedding Dataset
df_embedding = pd.read_csv("data/embedding_train.csv", sep = ",")
Preprocessed
We could consider embedding as preprocessing.
Outliers
We apply the method already presented to detect outliers, ECOD.
We create a dataset that does not contain these types of points.
df_embedding_no_out.shape -> (40690, 384)
df_embedding_with_out.shape -> (45211, 384)
Modeling
First we must find out what the optimal number of clusters is. For
this we use Elbow Method.
Elbow score for different numbers of clusters (Image by Author).
After viewing the graph, we choose k=5 as our number of clusters.
n_clusters = 5
clusters = KMeans(n_clusters=n_clusters, init = "k-means++").fit(df_embedding_no_out)
print(clusters.inertia_)
clusters_predict = clusters.predict(df_embedding_no_out)
Evaluation
The next thing is to create our Kmeans model with k=5. Next we can
obtain some metrics like these:
Davies bouldin score: 1.8095386826791042
Calinski Score: 6419.447089002081
Silhouette Score: 0.20360442824114108
Seeing then that the values are really similar to those obtained in the
previous case. Let’s study the representations obtained with PCA
analysis:
PCA space and the clusters created by the model (Image by Author).
It can be seen that the clusters are much better differentiated than
with the traditional method. This is good news. Let us remember
that it is important to take into account the variability contained in
the first 3 components of our PCA analysis. From experience, I can
say that when it is around 50% (3D PCA) more or less clear
conclusions can be drawn.
PCA space and the clusters created by the model. The variability of the first 3
components of the PCA is also shown (Image by Author).
We see then that it is 40.44% cumulative variability of the 3
components, it is acceptable but not ideal.
One way I can visually see how compact the clusters are is by
modifying the opacity of the points in the 3D representation. This
means that when the points are agglomerated in a certain space, a
black spot can be observed. In order to understand what I’m saying,
I show the following gif:
plot_pca_3d(df_pca_3d, title = "PCA Space", opacity=0.2, width_line = 0.1)
PCA space and the clusters created by the model (Image by Author).
As can be seen, there are several points in space where the points of
the same cluster cluster together. This indicates that they are well
differentiated from the other points and that the model knows how
to recognize them quite well.
Even so, it can be seen that various clusters cannot be differentiated
well (Ex: cluster 1 and 3). For this reason, we carry out a t-
SNE analysis, which we remember is a method that allows reducing
dimensionality, taking into account complex polynomial
relationships.
t-SNE space and the clusters created by the model (Image by Author).
A noticeable improvement is seen. The clusters do not overlap each
other and there is a clear differentiation between points. The
improvement obtained using the second dimensionality reduction
method is notable. Let’s see a 2D comparison:
Different results for different dimensionality reduction methods and clusters
defined by the model (Image by Author).
Again, it can be seen that the clusters in the t-SNE are more
separated and better differentiated than with the PCA. Furthermore,
the difference between the two methods in terms of quality is
smaller than when using the traditional Kmeans method.
To understand which variables our Kmeans model relies on, we do
the same move as before: we create a classification model
(LGBMClassifier) and analyze the importance of the features.
The importance of the variables in the model (Image by Author).
We see then that this model is based above all on the “marital” and
“job” variables. On the other hand we see that there are variables
that do not provide much information. In a real case, a new version
of the model should be created without these variables with little
information.
The Kmeans + Embedding model is more optimal since it
needs fewer variables to be able to give good predictions.
Good news!
We finish with the part that is most revealing and important.
Managers and the business are not interested in
PCA, t-SNE or embedding. What they want is to be
able to know what the main traits are, in this case, of
their clients.
To do this, we create a table with information about the
predominant profiles that we can find in each of the clusters:
Something very curious happens: the clusters where the most
frequent position is that of “management” are 3. In them we find
a very peculiar behavior where the single managers are younger,
those who are married are older and the divorced are the how older
they are. On the other hand, the balance behaves differently, single
people have a higher average balance than divorced people, and
married people have a higher average balance. What was said can be
summarized in the following image:
Different customer profiles defined by the model (Image by Author).
This revelation is in line with reality and social aspects. It also
reveals very specific customer profiles. This is the magic of data
science.
Conclusion
The conclusion is clear:
(Image by Author)
You have to have different tools because in a real project, not all
strategies work and you must have resources to add value. It is
clearly seen that the model created with the help of the LLMs stands
out.

More Related Content

PDF
Higgs Boson Challenge
PPT
notes as .ppt
PPTX
Overfitting & Underfitting
PPTX
07 learning
PDF
13_Data Preprocessing in Python.pptx (1).pdf
PDF
House Price Estimation as a Function Fitting Problem with using ANN Approach
PDF
3 Data scientist associate - Case GoalZone - Fitness class attendance study.pdf
PPTX
Telecom Churn Analysis
Higgs Boson Challenge
notes as .ppt
Overfitting & Underfitting
07 learning
13_Data Preprocessing in Python.pptx (1).pdf
House Price Estimation as a Function Fitting Problem with using ANN Approach
3 Data scientist associate - Case GoalZone - Fitness class attendance study.pdf
Telecom Churn Analysis

Similar to Mastering Customer Segmentation with LLM.pdf (20)

PDF
Predictive modeling
PPTX
Machine Learning basics
PDF
Explore ml day 2
PPTX
ML-Lec-18-NEW Dimensionality Reduction-PCA (1).pptx
PDF
ML.pdf
PPTX
Machine Learning.pptx
PPTX
Qt unit i
PDF
Bank loan purchase modeling
PDF
ML-Unit-4.pdf
PDF
IRJET- Machine Learning: Survey, Types and Challenges
PPTX
Regularization_BY_MOHAMED_ESSAM.pptx
DOCX
Essentials of machine learning algorithms
PDF
Top Machine Learning Algorithms Used By AI Professionals ARTiBA.pdf
PPTX
Challenges-and-Consideration-in-Programming-Logic-and-Design...pptx
PDF
A tour of the top 10 algorithms for machine learning newbies
PPTX
End-to-End Machine Learning Project
PDF
The Validity of CNN to Time-Series Forecasting Problem
PDF
Logistic Regression Classifier - Conceptual Guide
PDF
TEXT GENERATION WITH GAN NETWORKS USING FEEDBACK SCORE
PDF
Machine Learning Guide maXbox Starter62
Predictive modeling
Machine Learning basics
Explore ml day 2
ML-Lec-18-NEW Dimensionality Reduction-PCA (1).pptx
ML.pdf
Machine Learning.pptx
Qt unit i
Bank loan purchase modeling
ML-Unit-4.pdf
IRJET- Machine Learning: Survey, Types and Challenges
Regularization_BY_MOHAMED_ESSAM.pptx
Essentials of machine learning algorithms
Top Machine Learning Algorithms Used By AI Professionals ARTiBA.pdf
Challenges-and-Consideration-in-Programming-Logic-and-Design...pptx
A tour of the top 10 algorithms for machine learning newbies
End-to-End Machine Learning Project
The Validity of CNN to Time-Series Forecasting Problem
Logistic Regression Classifier - Conceptual Guide
TEXT GENERATION WITH GAN NETWORKS USING FEEDBACK SCORE
Machine Learning Guide maXbox Starter62
Ad

Recently uploaded (20)

PDF
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
PDF
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
PDF
COST SHEET- Tender and Quotation unit 2.pdf
PPTX
HR Introduction Slide (1).pptx on hr intro
PDF
Roadmap Map-digital Banking feature MB,IB,AB
PPTX
5 Stages of group development guide.pptx
PPT
Data mining for business intelligence ch04 sharda
PPTX
Amazon (Business Studies) management studies
PPTX
Belch_12e_PPT_Ch18_Accessible_university.pptx
PDF
A Brief Introduction About Julia Allison
DOCX
Business Management - unit 1 and 2
PDF
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
PDF
Unit 1 Cost Accounting - Cost sheet
PDF
Nidhal Samdaie CV - International Business Consultant
PDF
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise
PDF
Reconciliation AND MEMORANDUM RECONCILATION
PDF
Deliverable file - Regulatory guideline analysis.pdf
PDF
Business model innovation report 2022.pdf
PPTX
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
DOCX
unit 1 COST ACCOUNTING AND COST SHEET
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
COST SHEET- Tender and Quotation unit 2.pdf
HR Introduction Slide (1).pptx on hr intro
Roadmap Map-digital Banking feature MB,IB,AB
5 Stages of group development guide.pptx
Data mining for business intelligence ch04 sharda
Amazon (Business Studies) management studies
Belch_12e_PPT_Ch18_Accessible_university.pptx
A Brief Introduction About Julia Allison
Business Management - unit 1 and 2
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
Unit 1 Cost Accounting - Cost sheet
Nidhal Samdaie CV - International Business Consultant
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise
Reconciliation AND MEMORANDUM RECONCILATION
Deliverable file - Regulatory guideline analysis.pdf
Business model innovation report 2022.pdf
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
unit 1 COST ACCOUNTING AND COST SHEET
Ad

Mastering Customer Segmentation with LLM.pdf

  • 1. Mastering Customer Segmentation with LLM Unlock advanced customer segmentation techniques using LLMs, and improve your clustering models with advanced techniques ContentTable · Intro · Data · Method 1: Kmeans · Method 2: K-Prototype · Method 3: LLM + Kmeans · Conclusion Intro A customer segmentation project can be approached in multiple ways. In this article I will teach you advanced techniques, not only to define the clusters, but to analyze the result. This post is intended for those data scientists who want to have several tools to address clustering problems and be one step closer to being seniors DS. What will we see in this article? Let’s see 3 methods to approach this type of project:  Kmeans
  • 2.  K-Prototype  LLM + Kmeans As a small preview I will show the following comparison of the 2D representation (PCA) of the different models created: Graphic comparison of the three methods (Image by Author). You will also learn dimensionality reduction techniques such as:  PCA
  • 3.  t-SNE  MCA Some of the results being these: Graphical comparison of the three dimensionality reduction methods (Image by Author).
  • 4. A very important clarification is that this is not an end-to-end project. This is because we have skipped one of the most important parts in this type of project: The exploratory data analysis (EDA) phase or the selection of variables. Data The original data used in this project is from a public Kaggle: Banking Dataset — Marketing Targets. Each row in this data set contains information about a company’s customers. Some fields are numerical and others are categorical, we will see that this expands the possible ways to approach the problem. We will only be left with the first 8 columns. Our dataset looks like this: Let’s see a brief description of the columns of our dataset:
  • 5.  age (numeric)  job : type of job (categorical: “admin.” ,”unknown”,”unemployed”, ”management”, ”housemaid”, ”entrepreneur”, ”student”, “blue-collar”, ”self-employed”, ”retired”, ”technician”, ”services”)  marital : marital status (categorical: “married”,”divorced”,”single”; note: “divorced” means divorced or widowed)  education (categorical: “unknown”,”secondary”,”primary”,”tertiary”)  default: has credit in default? (binary: “yes”,”no”)  balance: average yearly balance, in euros (numeric)  housing: has housing loan? (binary: “yes”,”no”)  loan: has personal loan? (binary: “yes”,”no”) For the project, I’ve utilized the training dataset by Kaggle. In the project repository, you can locate the “data” folder where a compressed file of the dataset used in the project is stored. Additionally, you will find two CSV files inside of the compressed file. One is the training dataset provided by Kaggle (train.csv), and the other is the dataset after performing an embedding (embedding_train.csv), which we will explain further later on.
  • 6. To further clarify how the project is structured, the project tree is shown: clustering_llm ├─ data │ ├─ data.rar ├─ img ├─ embedding.ipynb ├─ embedding_creation.py ├─ kmeans.ipynb ├─ kprototypes.ipynb ├─ README.md └─ requirements.txt Method 1: Kmeans This is the most common method and the one you will surely know. Anyway, we are going to study it because I will show advanced analysis techniques in these cases. The Jupyter notebook where you will find the complete procedure is called kmeans.ipynb Preprocessed A preprocessing of the variables is carried out: 1. It consists of converting categorical variables into numeric ones. We are going to apply Onehot encoder for the nominal variables and a OrdinalEncoder for the ordinals features (education). 2. We try to ensure that the numerical variables have a Gaussian distribution. For them we will apply a PowerTransformer.
  • 7. Outliers It is crucial that there are as few outliers in our data as Kmeans is very sensitive to this. We can apply the typical method of choosing outliers using the z score, but in this post I will show you a much more advanced and cool method. Well, what is this method? Well, we will use the Python Outlier Detection (PyOD) library. This library is focused on detecting outliers for different cases. To be more specific we will use the ECOD method (“empirical cumulative distribution functions for outlier detection”). Modeling One of the disadvantages of using the Kmeans algorithm is that you must choose the number of clusters you want to use. In this case, in order to obtain that data, we will use Elbow Method. It consists of calculating the distortion that exists between the points of a cluster and its centroid. The objective is clear, to obtain the least possible distortion. In this case we use the following code: from yellowbrick.cluster import KElbowVisualizer # Instantiate the clustering model and visualizer km = KMeans(init="k-means++", random_state=0, n_init="auto") visualizer = KElbowVisualizer(km, k=(2,10)) visualizer.fit(data_no_outliers) # Fit the data to the visualizer visualizer.show() Output:
  • 8. Elbow score for different numbers of clusters (Image by Author). We see that from k=5, the distortion does not vary drastically. It is true that the ideal is that the behavior starting from k= 5 would be almost flat. This rarely happens and other methods can be applied to be sure of the most optimal number of clusters. To be sure, we could perform a Silhoutte visualization. It can be seen that the highest silhouette score is obtained with n_cluster=9, but it is also true that the variation in the score is quite small if we compare it with the other scores. At the moment the previous result does not provide us with much information. On the other hand, the previous code creates the Silhouette visualization, which gives us more information:
  • 10. Graphic representation of the silhouette method for different numbers of clusters (Image by Author). Since understanding these representations well is not the goal of this post, I will conclude that there seems to be no very clear decision as to which number is best. After viewing the previous representations, we can choose K=5 or K= 6. This is because for the different clusters, their Silhouette score is above the average value and there is no imbalance in cluster size. Furthermore, in some situations, the marketing department may be interested in having the smallest number of clusters/types of customers (This may or may not be the case). Finally we can create our Kmeans model with K=5. km = KMeans(n_clusters=5, init='k-means++', n_init=10, max_iter=100, random_state=42) clusters_predict = km.fit_predict(data_no_outliers) """ clusters_predict -> array([4, 2, 0, ..., 3, 4, 3]) np.unique(clusters_predict) -> array([0, 1, 2, 3, 4]) """ Evaluation The way of evaluating kmeans models is somewhat more open than for other models. We can use  metrics
  • 11.  visualizations  interpretation (Something very important for companies). As far as can be shown, we do not have an excessively good model. Davies’ score is telling us that the distance between clusters is quite small. This may be due to several factors, but keep in mind that the energy of a model is the data; if the data does not have sufficient predictive power, you cannot expect to achieve exceptional results. For visualizations, we can use the method to reduce dimensionality, PCA. For them we are going to use the Prince library, focused on exploratory analysis and dimensionality reduction. If you prefer, you can use Sklearn’s PCA, they are identical. First we will calculate the principal components in 3D, and then we will make the representation. These are the two functions performed by the previous steps Output:
  • 12. PCA space and the clusters created by the model (Image by Author). It can be seen that the clusters have almost no separation between them and there is no clear division. This is in accordance with the information provided by the metrics. Something to keep in mind and that very few people keep in mind is the PCA and the variability of the eigenvectors. Let’s say that each field contains a certain amount of information, and this adds its bit of information. If the accumulated sum of the 3 main components adds up to around 80% variability, we can say
  • 13. that it is acceptable, obtaining good results in the representations. If the value is lower, we have to take the visualizations with a grain of salt since we are missing a lot of information that is contained in other eigenvectors. The next question is obvious: What is the variability of the PCA executed? The answer is the following: As can be seen, we have 27.98% variability with the first 3 components, something insufficient to draw informed conclusions. When we apply the PCA method, since it is a linear algorithm, it is not capable of capturing more complex relationships. Luckily there is a method called t-SNE, which is capable of capturing these complex polynomial relationships. This can help us visualize, since with the previous method we have not had much success. If you try it on your computers, keep in mind that it has a higher computational cost. For this reason, I sampled my original dataset and it still took me about 5 minutes to get the result. The code is as follows:
  • 14. from sklearn.manifold import TSNE sampling_data = data_no_outliers.sample(frac=0.5, replace=True, random_state=1) sampling_clusters = pd.DataFrame(clusters_predict).sample(frac=0.5, replace=True, random_state=1)[0].values df_tsne_3d = TSNE( n_components=3, learning_rate=500, init='random', perplexity=200, n_iter = 5000).fit_transform(sampling_data) df_tsne_3d = pd.DataFrame(df_tsne_3d, columns=["comp1", "comp2",'comp3']) df_tsne_3d["cluster"] = sampling_clusters plot_pca_3d(df_tsne_3d, title = "PCA Space", opacity=1, width_line = 0.1) As a result, I got the following image. It shows a clearer separation between clusters but unfortunately, we still don’t have good results. t-SNE space and the clusters created by the model (Image by Author).
  • 15. In fact, we can compare the reduction carried out by the PCA and by the t-SNE, in 2 dimensions. The improvement is clear using the second method. Different results for different dimensionality reduction methods and clusters defined by the model (Image by Author). Finally, let’s explore a little how the model works, in which features are the most important and what are the main characteristics of the clusters. To see the importance of each of the variables we will use a typical “trick” in this type of situation. We are going to create a classification model where the “X” is the inputs of the Kmeans model, and the “y” is the clusters predicted by the Kmeans model. The chosen model is an LGBMClassifier. This model is quite powerful and works well having categorical and numerical variables. Having the new model trained, using the SHAP library, we can obtain the importance of each of the features in the prediction. The code is:
  • 16. import lightgbm as lgb import shap # We create the LGBMClassifier model and train it clf_km = lgb.LGBMClassifier(colsample_by_tree=0.8) clf_km.fit(X=data_no_outliers, y=clusters_predict) #SHAP values explainer_km = shap.TreeExplainer(clf_km) shap_values_km = explainer_km.shap_values(data_no_outliers) shap.summary_plot(shap_values_km, data_no_outliers, plot_type="bar", plot_size=(15, 10)) Output: The importance of the variables in the model (Image by Author). It can be seen that feature age has the greatest predictive power. It can also be seen that cluster number 3 (green) is mainly differentiated by the balance variable.
  • 17. Finally we must analyze the characteristics of the clusters. This part of the study is what is decisive for the business. For them we are going to obtain the means (for the numerical variables) and the most frequent value (categorical variables) of each of the features of the dataset for each of the clusters: df_no_outliers = df[df.outliers == 0] df_no_outliers["cluster"] = clusters_predict df_no_outliers.groupby('cluster').agg( { 'job': lambda x: x.value_counts().index[0], 'marital': lambda x: x.value_counts().index[0], 'education': lambda x: x.value_counts().index[0], 'housing': lambda x: x.value_counts().index[0], 'loan': lambda x: x.value_counts().index[0], 'contact': lambda x: x.value_counts().index[0], 'age':'mean', 'balance': 'mean', 'default': lambda x: x.value_counts().index[0], } ).reset_index() Output: We see that the clusters with job=blue-collar do not have a great differentiation between their characteristics, except by the age feature. This is something that is not desirable since it is difficult to differentiate the clients of each of the clusters. In the job=management case, we obtain better differentiation.
  • 18. After carrying out the analysis in different ways, they converge on the same conclusion: “We need to improve the results”. Method 2: K-Prototype If we remember our original dataset, we see that we have categorical and numerical variables. Unfortunately, the Kmeans algorithm provided by Skelearn does not accept categorical variables, forcing the original dataset to be modified and drastically altered. Luckily, you’ve taken with me and my post. But above all, thanks to ZHEXUE HUANG and his article Extensions to the k- Means Algorithm for Clustering Large Data Sets with Categorical Values, there is an algorithm that accepts categorical variables for clustering. This algorithm is called K-Prototype. The bookstore that provides it is Prince. The procedure is the same as in the previous case. In order not to make this article eternal, let’s go to the most interesting parts. But remember that you can access the Jupyter notebook here. Preprocessed Because we have numerical variables, we must make certain modifications to them. It is always recommended that all numerical variables be on similar scales and with distributions as close as possible to Gaussian ones. The dataset that we will use to create the models is created as follows: pipe = Pipeline([('scaler', PowerTransformer())]) df_aux = pd.DataFrame(pipe_fit.fit_transform(df_no_outliers[["age", "balance"]] ), columns = ["age",
  • 19. "balance"]) df_no_outliers_norm = df_no_outliers.copy() # Replace age and balance columns by preprocessed values df_no_outliers_norm = df_no_outliers_norm.drop(["age", "balance"], axis = 1) df_no_outliers_norm["age"] = df_aux["age"].values df_no_outliers_norm["balance"] = df_aux["balance"].values df_no_outliers_norm Outliers Because the method that I have presented for outlier detection (ECOD) only accepts numerical variables, the same transformation must be performed as for the kmeans method. We apply the outlier detection model that will provide us with which rows to eliminate, finally leaving the dataset that we will use as input for the K-Prototype model:
  • 20. Modeling We create the model and to do this we first need to obtain the optimal k. To do this we use the Elbow Method and this piece of code: # Choose optimal K using Elbow method from kmodes.kprototypes import KPrototypes from plotnine import * import plotnine cost = [] range_ = range(2, 15) for cluster in range_: kprototype = KPrototypes(n_jobs = -1, n_clusters = cluster, init = 'Huang', random_state = 0) kprototype.fit_predict(df_no_outliers, categorical = categorical_columns_index) cost.append(kprototype.cost_) print('Cluster initiation: {}'.format(cluster)) # Converting the results into a dataframe and plotting them df_cost = pd.DataFrame({'Cluster':range_, 'Cost':cost}) # Data viz plotnine.options.figure_size = (8, 4.8) ( ggplot(data = df_cost)+ geom_line(aes(x = 'Cluster', y = 'Cost'))+ geom_point(aes(x = 'Cluster', y = 'Cost'))+
  • 21. geom_label(aes(x = 'Cluster', y = 'Cost', label = 'Cluster'), size = 10, nudge_y = 1000) + labs(title = 'Optimal number of cluster with Elbow Method')+ xlab('Number of Clusters k')+ ylab('Cost')+ theme_minimal() ) Output: Elbow score for different numbers of clusters (Image by Author). We can see that the best option is K=5. Be careful, since this algorithm takes a little longer than those normally used. For the previous graph, 86 minutes were needed, something to keep in mind.
  • 22. Well, we are now clear about the number of clusters, we just have to create the model: # We get the index of categorical columns numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] categorical_columns = df_no_outliers_norm.select_dtypes(exclude=numerics).columns print(categorical_columns) categorical_columns_index = [df_no_outliers_norm.columns.get_loc(col) for col in categorical_columns] # Create the model cluster_num = 5 kprototype = KPrototypes(n_jobs = -1, n_clusters = cluster_num, init = 'Huang', random_state = 0) kprototype.fit(df_no_outliers_norm, categorical = categorical_columns_index) clusters = kprototype.predict(df_no_outliers , categorical = categorical_columns_index) print(clusters) " -> array([3, 1, 1, ..., 1, 1, 2], dtype=uint16)" We already have our model and its predictions, we just need to evaluate it. Evaluation As we have seen before we can apply several visualizations to obtain an intuitive idea of how good our model is. Unfortunately the PCA method and t-SNE do not admit categorical variables. But don’t worry, since the Prince library contains the MCA (Multiple correspondence analysis) method and it does accept a mixed dataset. In fact, I encourage you to visit the Github of this library, it has several super useful methods for different situations, see the following image:
  • 23. The different methods of dimensionality reduction by type of case (Image by Author and Prince Documentation). Well, the plan is to apply a MCA to reduce the dimensionality and be able to make graphical representations. For this we use the following code: from prince import MCA def get_MCA_3d(df, predict): mca = MCA(n_components =3, n_iter = 100, random_state = 101) mca_3d_df = mca.fit_transform(df) mca_3d_df.columns = ["comp1", "comp2", "comp3"] mca_3d_df["cluster"] = predict return mca, mca_3d_df def get_MCA_2d(df, predict): mca = MCA(n_components =2, n_iter = 100, random_state = 101) mca_2d_df = mca.fit_transform(df) mca_2d_df.columns = ["comp1", "comp2"] mca_2d_df["cluster"] = predict return mca, mca_2d_df
  • 24. "-------------------------------------------------------------------" mca_3d, mca_3d_df = get_MCA_3d(df_no_outliers_norm, clusters) Remember that if you want to follow each step 100%, you can take a look at Jupyter notebook. The dataset named mca_3d_df contains that information: Let’s make a plot using the reduction provided by the MCA method:
  • 25. MCA space and the clusters created by the model (Image by Author) Wow, it doesn’t look very good… It is not possible to differentiate the clusters from each other. We can say then that the model is not good enough, right? I hope you said something like: “Hey Damian, don’t go so fast!! Have you looked at the variability of the 3 components provided by the MCA?” Indeed, we must see if the variability of the first 3 components is sufficient to be able to draw conclusions. The MCA method allows us to obtain these values in a very simple way:
  • 26. mca_3d.eigenvalues_summary Aha, here we have something interesting. Due to our data we obtain basically zero variability. In other words, we cannot draw clear conclusions from our model with the information provided by the dimensionality reduction provided by MCA. By showing these results I try to give an example of what happens in real data projects. Good results are not always obtained, but a good data scientist knows how to recognize the causes. We have one last option to visually determine if the model created by the K-Prototype method is suitable or not. This path is simple: 1. This is applying PCA to the dataset to which preprocessing has been performed to transform the categorical variables into numerical ones. 2. Obtain the components of the PCA
  • 27. 3. Make a representation using the PCA components such as the axes and the color of the points to predict the K- Prototype model. Note that the components provided by the PCA will be the same as for method 1: Kmeans, since it is the same dataframe. Let’s see what we get… PCA space and the clusters created by the model (Image by Author). It doesn’t look bad, in fact it has a certain resemblance to what has been obtained in Kmeans.
  • 28. Finally we obtain the average value of the clusters and the importance of each of the variables: The importance of the variables in the model. The table represents the most frequent value of each of the clusters (Image by Author). The variables with the greatest weight are the numerical ones, notably seeing that the confinement of these two features is almost sufficient to differentiate each cluster.
  • 29. In short, it can be said that results similar to those of Kmeans have been obtained. Method 3: LLM + Kmeans This combination can be quite powerful and improve the results obtained. Let’s get to the point! LLMs cannot understand written text directly, we need to transform the input of this type of models. For this, Sentence Embedding is carried out. It consists of transforming the text into numerical vectors. The following image can clarify the idea: Concept of embedding and similarity (Image by Author). This coding is done intelligently, that is, phrases that contain a similar meaning will have a more similar vector. See the following image:
  • 30. Concept of embedding and similarity (Image by Author). Sentence embedding is carried out by so-called transforms, algorithms specialized in this coding. Typically you can choose what the size of the numerical vector coming from this encoding is. And here is one of the key points:
  • 31. Thanks to the large dimension of the vector created by embedding, small variations in the data can be seen with greater precision. Therefore, if we provide input to our information-rich Kmeans model, it will return better predictions. This is the idea we are pursuing and these are its steps: 1. Transform our original dataset through Sentence embedding 2. Create a Kmeans model 3. Evaluate it Well, the first step is to encode the information through Sentence embedding. What is intended is to take the information of each client and unify it into text that contains all its characteristics. This part takes a lot of computing time. That’s why I created a script that did this job, call embedding_creation.py. This script collects the values contained in the training dataset and creates a new dataset provided by the embedding. This is the script code: import pandas as pd # dataframe manipulation import numpy as np # linear algebra from sentence_transformers import SentenceTransformer df = pd.read_csv("data/train.csv", sep = ";") # -------------------- First Step -------------------- def compile_text(x):
  • 32. text = f"""Age: {x['age']}, housing load: {x['housing']}, Job: {x['job']}, Marital: {x['marital']}, Education: {x['education']}, Default: {x['default']}, Balance: {x['balance']}, Personal loan: {x['loan']}, contact: {x['contact']} """ return text sentences = df.apply(lambda x: compile_text(x), axis=1).tolist() # -------------------- Second Step -------------------- model = SentenceTransformer(r"sentence-transformers/paraphrase-MiniLM-L6-v2") output = model.encode(sentences=sentences, show_progress_bar=True, normalize_embeddings=True) df_embedding = pd.DataFrame(output) df_embedding As it is quite important that this step is understood. Let’s go by points:  Step 1: The text is created for each row, which contains the complete customer/row information. We also store it in a python list for later use. See the following image that exemplifies it.
  • 33. Graphic description of the first step (Image by Author).  Step 2: This is when the call to the transformer is made. For this we are going to use the model stored in HuggingFace. This model is specifically trained to perform embedding at the sentence level, unlike Bert’s model, which is focused on encoding at the level of tokens and words. To call the model you only have to give the repository address, which in this case is “sentence-transformers/paraphrase-MiniLM- L6-v2”. The numerical vector that is returned to us for each text will be normalized, since the Kmeans model is
  • 34. sensitive to the scales of the inputs. The vectors created have a length of 384. With them what we do is create a dataframe with the same number of columns. See the following image: Graphic description of the second step (Image by Author), Finally we obtain the dataframe from the embedding, which will be the input of our Kmeans model.
  • 35. This step has been one of the most interesting and important, since we have created the input for the Kmeans model that we will create. The creation and evaluation procedure is similar to that shown above. In order not to make the post excessively long, only the results of each point will be shown. Don’t worry, all the code is contained in the jupyter notebook called embedding, so you can reproduce the results for yourself. In addition, the dataset resulting from applying the Sentence embedding has been saved in a csv file. This csv file is called embedding_train.csv. In the Jupyter notebook you will see that we access that dataset and create our model based on it. # Normal Dataset df = pd.read_csv("data/train.csv", sep = ";") df = df.iloc[:, 0:8] # Embedding Dataset df_embedding = pd.read_csv("data/embedding_train.csv", sep = ",") Preprocessed
  • 36. We could consider embedding as preprocessing. Outliers We apply the method already presented to detect outliers, ECOD. We create a dataset that does not contain these types of points. df_embedding_no_out.shape -> (40690, 384) df_embedding_with_out.shape -> (45211, 384) Modeling First we must find out what the optimal number of clusters is. For this we use Elbow Method. Elbow score for different numbers of clusters (Image by Author). After viewing the graph, we choose k=5 as our number of clusters.
  • 37. n_clusters = 5 clusters = KMeans(n_clusters=n_clusters, init = "k-means++").fit(df_embedding_no_out) print(clusters.inertia_) clusters_predict = clusters.predict(df_embedding_no_out) Evaluation The next thing is to create our Kmeans model with k=5. Next we can obtain some metrics like these: Davies bouldin score: 1.8095386826791042 Calinski Score: 6419.447089002081 Silhouette Score: 0.20360442824114108 Seeing then that the values are really similar to those obtained in the previous case. Let’s study the representations obtained with PCA analysis:
  • 38. PCA space and the clusters created by the model (Image by Author). It can be seen that the clusters are much better differentiated than with the traditional method. This is good news. Let us remember that it is important to take into account the variability contained in the first 3 components of our PCA analysis. From experience, I can say that when it is around 50% (3D PCA) more or less clear conclusions can be drawn.
  • 39. PCA space and the clusters created by the model. The variability of the first 3 components of the PCA is also shown (Image by Author). We see then that it is 40.44% cumulative variability of the 3 components, it is acceptable but not ideal. One way I can visually see how compact the clusters are is by modifying the opacity of the points in the 3D representation. This means that when the points are agglomerated in a certain space, a
  • 40. black spot can be observed. In order to understand what I’m saying, I show the following gif: plot_pca_3d(df_pca_3d, title = "PCA Space", opacity=0.2, width_line = 0.1) PCA space and the clusters created by the model (Image by Author). As can be seen, there are several points in space where the points of the same cluster cluster together. This indicates that they are well differentiated from the other points and that the model knows how to recognize them quite well.
  • 41. Even so, it can be seen that various clusters cannot be differentiated well (Ex: cluster 1 and 3). For this reason, we carry out a t- SNE analysis, which we remember is a method that allows reducing dimensionality, taking into account complex polynomial relationships. t-SNE space and the clusters created by the model (Image by Author). A noticeable improvement is seen. The clusters do not overlap each other and there is a clear differentiation between points. The improvement obtained using the second dimensionality reduction method is notable. Let’s see a 2D comparison:
  • 42. Different results for different dimensionality reduction methods and clusters defined by the model (Image by Author). Again, it can be seen that the clusters in the t-SNE are more separated and better differentiated than with the PCA. Furthermore, the difference between the two methods in terms of quality is smaller than when using the traditional Kmeans method. To understand which variables our Kmeans model relies on, we do the same move as before: we create a classification model (LGBMClassifier) and analyze the importance of the features.
  • 43. The importance of the variables in the model (Image by Author). We see then that this model is based above all on the “marital” and “job” variables. On the other hand we see that there are variables that do not provide much information. In a real case, a new version of the model should be created without these variables with little information. The Kmeans + Embedding model is more optimal since it needs fewer variables to be able to give good predictions. Good news! We finish with the part that is most revealing and important. Managers and the business are not interested in PCA, t-SNE or embedding. What they want is to be
  • 44. able to know what the main traits are, in this case, of their clients. To do this, we create a table with information about the predominant profiles that we can find in each of the clusters: Something very curious happens: the clusters where the most frequent position is that of “management” are 3. In them we find a very peculiar behavior where the single managers are younger, those who are married are older and the divorced are the how older they are. On the other hand, the balance behaves differently, single people have a higher average balance than divorced people, and married people have a higher average balance. What was said can be summarized in the following image:
  • 45. Different customer profiles defined by the model (Image by Author). This revelation is in line with reality and social aspects. It also reveals very specific customer profiles. This is the magic of data science. Conclusion The conclusion is clear:
  • 46. (Image by Author) You have to have different tools because in a real project, not all strategies work and you must have resources to add value. It is clearly seen that the model created with the help of the LLMs stands out.