SlideShare a Scribd company logo
Data Analysis Course
Cluster Analysis
Venkat Reddy
Contents
• What is the need of Segmentation
• Introduction to Segmentation & Cluster analysis
• Applications of Cluster Analysis
• Types of Clusters
• K-Means clustering
DataAnalysisCourse
VenkatReddy
2
What is the need of segmentation?
Problem:
• 10,000 Customers - we know their age, city name, income,
employment status, designation
• You have to sell 100 Blackberry phones(each costs $1000) to
the people in this group. You have maximum of 7 days
• If you start giving demos to each individual, 10,000 demos will
take more than one year. How will you sell maximum number
of phones by giving minimum number of demos?
DataAnalysisCourse
VenkatReddy
3
What is the need of segmentation?
Solution
• Divide the whole population into two groups employed / unemployed
• Further divide the employed population into two groups high/low salary
• Further divide that group into high /low designation
DataAnalysisCourse
VenkatReddy
4
10000
customers
Unemployed
3000
Employed
7000
Low salary
5000
High Salary
2000
Low
Designation
1800
High
Designation
200
Segmentation and Cluster Analysis
• Cluster is a group of similar objects (cases, points, observations,
examples, members, customers, patients, locations, etc)
• Finding the groups of cases/observations/ objects in the
population such that the objects are
• Homogeneous within the group (high intra-class similarity)
• Heterogeneous between the groups(low inter-class similarity )
DataAnalysisCourse
VenkatReddy
5
Inter-cluster
distances are
maximized
Intra-cluster distances are
minimized
DataAnalysisCourse
VenkatReddy
Applications of Cluster Analysis
• Market Segmentation: Grouping people (with the willingness,
purchasing power, and the authority to buy) according to their
similarity in several dimensions related to a product under
consideration.
• Sales Segmentation: Clustering can tell you what types of customers
buy what products
• Credit Risk: Segmentation of customers based on their credit history
• Operations: High performer segmentation & promotions based on
person’s performance
• Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost.
• City-planning: Identifying groups of houses according to their house
type, value, and geographical location
• Geographical: Identification of areas of similar land use in an earth
observation database.
DataAnalysisCourse
VenkatReddy
6
Types of Clusters
DataAnalysisCourse
VenkatReddy
7
• Partitional clustering or non-hierarchical : A division
of objects into non-overlapping subsets (clusters) such
that each object is in exactly one cluster
• The non-hierarchical methods divide a dataset of N
objects into M clusters.
• K-means clustering, a non-hierarchical technique, is
the most commonly used one in business analytics
• Hierarchical clustering: A set of nested clusters
organized as a hierarchical tree
• The hierarchical methods produce a set of nested
clusters in which each pair of objects or clusters is
progressively nested in a larger cluster until only one
cluster remains
• CHAID tree is most widely used in business analytics
Cluster Analysis -Example
DataAnalysisCourse
VenkatReddy
8
Maths Science Gk Apt
Student-1 94 82 87 89
Student-2 46 67 33 72
Student-3 98 97 93 100
Student-4 14 5 7 24
Student-5 86 97 95 95
Student-6 34 32 75 66
Student-7 69 44 59 55
Student-8 85 90 96 89
Student-9 24 26 15 22
Maths Science Gk Apt
Student-1 94 82 87 89
Student-2 46 67 33 72
Student-3 98 97 93 100
Student-4 14 5 7 24
Student-5 86 97 95 95
Student-6 34 32 75 66
Student-7 69 44 59 55
Student-8 85 90 96 89
Student-9 24 26 15 22
Maths Science Gk Apt
Student-4 14 5 7 24
Student-9 24 26 15 22
Student-6 34 32 75 66
Student-2 46 67 33 72
Student-7 69 44 59 55
Student-8 85 90 96 89
Student-5 86 97 95 95
Student-1 94 82 87 89
Student-3 98 97 93 100
4,9,6
2,7
8,5,1,3
Building Clusters
1. Select a distance measure
2. Select a clustering algorithm
3. Define the distance between two clusters
4. Determine the number of clusters
5. Validate the analysis
DataAnalysisCourse
VenkatReddy
9
• The aim is to build clusters i.e divide the whole population into group of similar
objects
• What is similarity/dis-similarity?
• How do you define distance between two clusters
Dissimilarity & Similarity
DataAnalysisCourse
VenkatReddy
10
Weight
Cust1 68
Cust2 72
Cust3 100
Weight Age
Cust1 68 25
Cust2 72 70
Cust3 100 28
Weight Age Income
Cust1 68 25 60,000
Cust2 72 70 9,000
Cust3 100 28 62,000
Which two customers are similar?
Which two customers are similar now?
Which two customers are similar in
this case?
Quantify dissimilarity-Distancemeasures
• To measure similarity between two observations a
distance measure is needed. With a single variable,
similarity is straightforward
• Example: income – two individuals are similar if their income
level is similar and the level of dissimilarity increases as the
income gap increases
• Multiple variables require an aggregate distance
measure
• Many characteristics (e.g. income, age, consumption habits,
family composition, owning a car, education level, job…), it
becomes more difficult to define similarity with a single value
• The most known measure of distance is the Euclidean
distance, which is the concept we use in everyday life for
spatial coordinates.
DataAnalysisCourse
VenkatReddy
11
Examples of distances
DataAnalysisCourse
VenkatReddy
12
 
2
1
n
ij ki kj
k
D x x

 
1
n
ij ki kj
k
D x x

 
Euclidean distance
City-block (Manhattan) distance
A
B
A
B
Dij distance between cases i and j xkj - value of variable xk for case j
Other distance measures: Chebychev, Minkowski, Mahalanobis,
maximum distance, cosine similarity, simple correlation between
observations etc.,


















npx...nfx...n1x
...............
ipx...ifx...i1x
...............
1px...1fx...11x
















0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
Data matrix Dissimilarity matrix
Calculating the distance
DataAnalysisCourse
VenkatReddy
13
Weight
Cust1 68
Cust2 72
Cust3 100
• Cust1 vs Cust2 :- (68-72)= 4
• Cust2 vs Cust3 :- (72-100) = 28
• Cust3 vs Cust1 :- (100-68) =32
Weight Age
Cust1 68 25
Cust2 72 70
Cust3 100 28
• Cust1 vs Cust2 :- sqrt((68-72)^2 + (25-70)^2)=44.9
• Cust2 vs Cust3 :- 50.54
• Cust3 vs Cust1 :- 32.14
Demo: Calculation of distance
proc distance data=cust_data out=Dist method=Euclid nostd;
var interval(Credit_score Expenses);
run;
proc print data=Dist;
run;
DataAnalysisCourse
VenkatReddy
14
Lab: Distance Calculation
proc distance data=cust_data out=Count_Dist method=Euclid
nostd;
var interval(Area_Sq_Miles_ GDP_MM_ Unemp_rate);
run;
proc print data=Count_Dist;
run;
DataAnalysisCourse
VenkatReddy
15
Clustering algorithms
• k-means clustering algorithm
• Fuzzy c-means clustering algorithm
• Hierarchical clustering algorithm
• Gaussian(EM) clustering algorithm
• Quality Threshold (QT) clustering algorithm
• MST based clustering algorithm
• Density based clustering algorithm
• kernel k-means clustering algorithm
DataAnalysisCourse
VenkatReddy
16
K -Means Clustering – Algorithm
1. The number k of clusters is fixed
2. An initial set of k “seeds” (aggregation centres) is provided
1. First k elements
2. Other seeds (randomly selected or explicitly defined)
3. Given a certain fixed threshold, all units are assigned to the
nearest cluster seed
4. New seeds are computed
5. Go back to step 3 until no reclassification is necessary
Or simply
Initialize k cluster centers
Do
Assignment step: Assign each data point to its closest cluster center
Re-estimation step: Re-compute cluster centers
While (there are still changes in the cluster centers)
DataAnalysisCourse
VenkatReddy
17
K-Means clustering
DataAnalysisCourse
VenkatReddy
18
Overall population
K-Means clustering
DataAnalysisCourse
VenkatReddy
19
Fix the number of clusters
K-Means clustering
DataAnalysisCourse
VenkatReddy
20
Calculate the distance of
each case from all clusters
K-Means clustering
DataAnalysisCourse
VenkatReddy
21
Assign each case to nearest
cluster
K-Means clustering
DataAnalysisCourse
VenkatReddy
22
Re calculate the cluster
centers
K-Means clustering
DataAnalysisCourse
VenkatReddy
23
K-Means clustering
DataAnalysisCourse
VenkatReddy
24
K-Means clustering
DataAnalysisCourse
VenkatReddy
25
K-Means clustering
DataAnalysisCourse
VenkatReddy
26
K-Means clustering
DataAnalysisCourse
VenkatReddy
27
K-Means clustering
DataAnalysisCourse
VenkatReddy
28
K-Means clustering
DataAnalysisCourse
VenkatReddy
29
Reassign after changing the
cluster centers
K-Means clustering
DataAnalysisCourse
VenkatReddy
30
K-Means clustering
DataAnalysisCourse
VenkatReddy
31
Continue till there is no
significant change between
two iterations
K Means clustering in action
DataAnalysisCourse
VenkatReddy
32
• Dividing the data into 10 clusters using K-Means
Distance metric will
decide cluster for
these points
K-Means Clustering SAS Demo
proc fastclus data= sup_market radius=0 replace=full
maxclusters =5 maxiter =20 distance out=clustr_out;
id cust_id;
Var age family_size income spend visit_Other_shops;
run;
DataAnalysisCourse
VenkatReddy
33
• A Supermarket wanted to send some promotional coupons to 100
families
• The idea is to identify 100 customers with medium income and low
recent spends
Lab: K- Means Clustering
• Download contact center agents data
• The performance data contains
• Average handling time
• Average number of calls
• CSAT
• Resolution score
• Identify top 10 agents for promotion based on below criteria
• High C_SAT
• High Resolution
• Low Average handling time
• High number of calls
DataAnalysisCourse
VenkatReddy
34
SAS Code Options
• The RADIUS= option establishes the minimum distance criterion for
selecting new seeds. No observation is considered as a new seed unless its
minimum distance to previous seeds exceeds the value given by the
RADIUS= option. The default value is 0.
• The MAXCLUSTERS= option specifies the maximum number of clusters
allowed. If you omit the MAXCLUSTERS= option, a value of 100 is assumed.
• The REPLACE= option specifies how seed replacement is performed.
• FULL :requests default seed replacement.
• PART :requests seed replacement only when the distance between the
observation and the closest seed is greater than the minimum distance between
seeds.
• NONE : suppresses seed replacement.
• RANDOM :Selects a simple pseudo-random sample of complete observations as
initial cluster seeds.
DataAnalysisCourse
VenkatReddy
35
SAS Code & Options
• The MAXITER= option specifies the maximum number of iterations for re
computing cluster seeds. When the value of the MAXITER= option is greater
than 0, each observation is assigned to the nearest seed, and the seeds are
recomputed as the means of the clusters.
• The LIST option lists all observations, giving the value of the ID variable (if
any), the number of the cluster to which the observation is assigned, and
the distance between the observation and the final cluster seed.
• The DISTANCE option computes distances between the cluster means.
• The ID variable, which can be character or numeric, identifies observations
on the output when you specify the LIST option.
• The VAR statement lists the numeric variables to be used in the cluster
analysis. If you omit the VAR statement, all numeric variables not listed in
other statements are used.
DataAnalysisCourse
VenkatReddy
36
Distance between Clusters
• Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an element in
the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) =
dist(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) =
dist(Mi, Mj) Medoid: a chosen, centrally located object in the cluster
DataAnalysisCourse
VenkatReddy
37
X X
SAS output interpretation
• RMSSTD - Pooled standard deviation of all the variables forming the
cluster.(Variance within a cluster) Since the objective of cluster analysis is to
form homogeneous groups, the
• RMSSTD of a cluster should be as small as possible
• SPRSQ -Semipartial R-squared is a measure of the homogeneity of merged
clusters, so SPRSQ is the loss of homogeneity due to combining two groups
or clusters to form a new group or cluster. (error incurred by combining two
groups)
• Thus, the SPRSQ value should be small to imply that we are merging two
homogeneous groups
DataAnalysisCourse
VenkatReddy
38
SAS output interpretation
• RSQ (R-squared) measures the extent to which groups or clusters
are different from each other. (Variance between the clusters)
• So, when you have just one cluster RSQ value is, intuitively, zero).
Thus, the RSQ value should be high.
• Centroid Distance is simply the Euclidian distance between the
centroid of the two clusters that are to be joined or merged.
• So, Centroid Distance is a measure of the homogeneity of merged
clusters and the value should be small.
DataAnalysisCourse
VenkatReddy
39
Distance Calculation on
standardized data
DataAnalysisCourse
VenkatReddy
40
Weight Income
Cust1 68 60,000
Cust2 72 9,000
Cust3 100 62,000
Average 80 43667
Stdev 14 24527
Weight Income
Cust1 -0.84 0.67
Cust2 -0.56 -1.41
Cust3 1.40 0.75

More Related Content

PPTX
Hierarchical clustering
PPTX
PPT
3.3 hierarchical methods
PPTX
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
PPTX
Clustering, k-means clustering
PDF
CLUSTERING IN DATA MINING.pdf
PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
PPTX
Data mining: Classification and prediction
Hierarchical clustering
3.3 hierarchical methods
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
Clustering, k-means clustering
CLUSTERING IN DATA MINING.pdf
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Data mining: Classification and prediction

What's hot (20)

PPTX
Presentation on K-Means Clustering
PPTX
Cluster Analysis
PPTX
Naive Bayes Presentation
PPT
Cluster analysis
PPTX
K-means Clustering
PPT
KNN - Classification Model (Step by Step)
PDF
Data Science - Part III - EDA & Model Selection
PPT
Clustering
PDF
Cluster analysis
PPTX
Randomized Algorithm- Advanced Algorithm
PPT
K means Clustering Algorithm
PPTX
Lect4 principal component analysis-I
PDF
Hierarchical Clustering
PDF
Machine Learning Performance metrics for classification
ODP
NAIVE BAYES CLASSIFIER
PDF
Basics statistics
PPTX
Machine learning clustering
PPTX
Linear and Logistics Regression
PPTX
Time Series Analysis - 2 | Time Series in R | ARIMA Model Forecasting | Data ...
PPTX
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Presentation on K-Means Clustering
Cluster Analysis
Naive Bayes Presentation
Cluster analysis
K-means Clustering
KNN - Classification Model (Step by Step)
Data Science - Part III - EDA & Model Selection
Clustering
Cluster analysis
Randomized Algorithm- Advanced Algorithm
K means Clustering Algorithm
Lect4 principal component analysis-I
Hierarchical Clustering
Machine Learning Performance metrics for classification
NAIVE BAYES CLASSIFIER
Basics statistics
Machine learning clustering
Linear and Logistics Regression
Time Series Analysis - 2 | Time Series in R | ARIMA Model Forecasting | Data ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Ad

Viewers also liked (9)

PDF
Model selection and cross validation techniques
PPT
Individual movements and geographical data mining. Clustering algorithms for ...
PDF
Homotopic Frechet Distance Between Curves
PDF
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
PDF
Trajectory clustering - Traclus Algorithm
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
PPT
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
PDF
GBM theory code and parameters
PPTX
Cluster analysis
Model selection and cross validation techniques
Individual movements and geographical data mining. Clustering algorithms for ...
Homotopic Frechet Distance Between Curves
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Trajectory clustering - Traclus Algorithm
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
GBM theory code and parameters
Cluster analysis
Ad

Similar to Cluster Analysis for Dummies (20)

PPT
26-Clustering MTech-2017.ppt
PDF
ClusteringClusteringClusteringClustering.pdf
DOCX
8.clustering algorithm.k means.em algorithm
PPTX
unitvclusteranalysis-221214135407-1956d6ef.pptx
PDF
Clustering techniques
PPTX
K Means Clustering_ clustering neighbor.
PPTX
Cluster Analysis in Business Research Methods
PPT
Cluster analysis
PPT
DM_clustering.ppt
PDF
Clustering
PDF
L13. Cluster Analysis
PPTX
Customer segmentation.pptx
PPT
Data mining clustering-2009-v0
PPT
Cs501 cluster analysis
PPT
DM UNIT_4 PPT for btech final year students
PPT
Lect4
PPTX
CHAPTER 14 CLUSTERING.PPTX
PDF
K means Clustering
PPT
Cluster analysis
26-Clustering MTech-2017.ppt
ClusteringClusteringClusteringClustering.pdf
8.clustering algorithm.k means.em algorithm
unitvclusteranalysis-221214135407-1956d6ef.pptx
Clustering techniques
K Means Clustering_ clustering neighbor.
Cluster Analysis in Business Research Methods
Cluster analysis
DM_clustering.ppt
Clustering
L13. Cluster Analysis
Customer segmentation.pptx
Data mining clustering-2009-v0
Cs501 cluster analysis
DM UNIT_4 PPT for btech final year students
Lect4
CHAPTER 14 CLUSTERING.PPTX
K means Clustering
Cluster analysis

More from Venkata Reddy Konasani (20)

PDF
Transformers 101
PDF
Machine Learning Deep Learning AI and Data Science
PDF
Neural Network Part-2
PDF
Neural Networks made easy
PPTX
Decision tree
PPTX
Step By Step Guide to Learn R
PPTX
Credit Risk Model Building Steps
PDF
Table of Contents - Practical Business Analytics using SAS
PPTX
SAS basics Step by step learning
PPTX
Testing of hypothesis case study
DOCX
L101 predictive modeling case_study
PDF
Machine Learning for Dummies
PDF
Online data sources for analaysis
PDF
A data analyst view of Bigdata
PPTX
R- Introduction
PDF
Data exploration validation and sanitization
PPTX
Introduction to predictive modeling v1
PDF
Big data Introduction by Mohan
PDF
Data Analyst - Interview Guide
Transformers 101
Machine Learning Deep Learning AI and Data Science
Neural Network Part-2
Neural Networks made easy
Decision tree
Step By Step Guide to Learn R
Credit Risk Model Building Steps
Table of Contents - Practical Business Analytics using SAS
SAS basics Step by step learning
Testing of hypothesis case study
L101 predictive modeling case_study
Machine Learning for Dummies
Online data sources for analaysis
A data analyst view of Bigdata
R- Introduction
Data exploration validation and sanitization
Introduction to predictive modeling v1
Big data Introduction by Mohan
Data Analyst - Interview Guide

Recently uploaded (20)

PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PPTX
master seminar digital applications in india
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
O5-L3 Freight Transport Ops (International) V1.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
master seminar digital applications in india
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Pharma ospi slides which help in ospi learning
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPH.pptx obstetrics and gynecology in nursing
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
TR - Agricultural Crops Production NC III.pdf
Basic Mud Logging Guide for educational purpose
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Anesthesia in Laparoscopic Surgery in India
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
01-Introduction-to-Information-Management.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF

Cluster Analysis for Dummies

  • 1. Data Analysis Course Cluster Analysis Venkat Reddy
  • 2. Contents • What is the need of Segmentation • Introduction to Segmentation & Cluster analysis • Applications of Cluster Analysis • Types of Clusters • K-Means clustering DataAnalysisCourse VenkatReddy 2
  • 3. What is the need of segmentation? Problem: • 10,000 Customers - we know their age, city name, income, employment status, designation • You have to sell 100 Blackberry phones(each costs $1000) to the people in this group. You have maximum of 7 days • If you start giving demos to each individual, 10,000 demos will take more than one year. How will you sell maximum number of phones by giving minimum number of demos? DataAnalysisCourse VenkatReddy 3
  • 4. What is the need of segmentation? Solution • Divide the whole population into two groups employed / unemployed • Further divide the employed population into two groups high/low salary • Further divide that group into high /low designation DataAnalysisCourse VenkatReddy 4 10000 customers Unemployed 3000 Employed 7000 Low salary 5000 High Salary 2000 Low Designation 1800 High Designation 200
  • 5. Segmentation and Cluster Analysis • Cluster is a group of similar objects (cases, points, observations, examples, members, customers, patients, locations, etc) • Finding the groups of cases/observations/ objects in the population such that the objects are • Homogeneous within the group (high intra-class similarity) • Heterogeneous between the groups(low inter-class similarity ) DataAnalysisCourse VenkatReddy 5 Inter-cluster distances are maximized Intra-cluster distances are minimized DataAnalysisCourse VenkatReddy
  • 6. Applications of Cluster Analysis • Market Segmentation: Grouping people (with the willingness, purchasing power, and the authority to buy) according to their similarity in several dimensions related to a product under consideration. • Sales Segmentation: Clustering can tell you what types of customers buy what products • Credit Risk: Segmentation of customers based on their credit history • Operations: High performer segmentation & promotions based on person’s performance • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Geographical: Identification of areas of similar land use in an earth observation database. DataAnalysisCourse VenkatReddy 6
  • 7. Types of Clusters DataAnalysisCourse VenkatReddy 7 • Partitional clustering or non-hierarchical : A division of objects into non-overlapping subsets (clusters) such that each object is in exactly one cluster • The non-hierarchical methods divide a dataset of N objects into M clusters. • K-means clustering, a non-hierarchical technique, is the most commonly used one in business analytics • Hierarchical clustering: A set of nested clusters organized as a hierarchical tree • The hierarchical methods produce a set of nested clusters in which each pair of objects or clusters is progressively nested in a larger cluster until only one cluster remains • CHAID tree is most widely used in business analytics
  • 8. Cluster Analysis -Example DataAnalysisCourse VenkatReddy 8 Maths Science Gk Apt Student-1 94 82 87 89 Student-2 46 67 33 72 Student-3 98 97 93 100 Student-4 14 5 7 24 Student-5 86 97 95 95 Student-6 34 32 75 66 Student-7 69 44 59 55 Student-8 85 90 96 89 Student-9 24 26 15 22 Maths Science Gk Apt Student-1 94 82 87 89 Student-2 46 67 33 72 Student-3 98 97 93 100 Student-4 14 5 7 24 Student-5 86 97 95 95 Student-6 34 32 75 66 Student-7 69 44 59 55 Student-8 85 90 96 89 Student-9 24 26 15 22 Maths Science Gk Apt Student-4 14 5 7 24 Student-9 24 26 15 22 Student-6 34 32 75 66 Student-2 46 67 33 72 Student-7 69 44 59 55 Student-8 85 90 96 89 Student-5 86 97 95 95 Student-1 94 82 87 89 Student-3 98 97 93 100 4,9,6 2,7 8,5,1,3
  • 9. Building Clusters 1. Select a distance measure 2. Select a clustering algorithm 3. Define the distance between two clusters 4. Determine the number of clusters 5. Validate the analysis DataAnalysisCourse VenkatReddy 9 • The aim is to build clusters i.e divide the whole population into group of similar objects • What is similarity/dis-similarity? • How do you define distance between two clusters
  • 10. Dissimilarity & Similarity DataAnalysisCourse VenkatReddy 10 Weight Cust1 68 Cust2 72 Cust3 100 Weight Age Cust1 68 25 Cust2 72 70 Cust3 100 28 Weight Age Income Cust1 68 25 60,000 Cust2 72 70 9,000 Cust3 100 28 62,000 Which two customers are similar? Which two customers are similar now? Which two customers are similar in this case?
  • 11. Quantify dissimilarity-Distancemeasures • To measure similarity between two observations a distance measure is needed. With a single variable, similarity is straightforward • Example: income – two individuals are similar if their income level is similar and the level of dissimilarity increases as the income gap increases • Multiple variables require an aggregate distance measure • Many characteristics (e.g. income, age, consumption habits, family composition, owning a car, education level, job…), it becomes more difficult to define similarity with a single value • The most known measure of distance is the Euclidean distance, which is the concept we use in everyday life for spatial coordinates. DataAnalysisCourse VenkatReddy 11
  • 12. Examples of distances DataAnalysisCourse VenkatReddy 12   2 1 n ij ki kj k D x x    1 n ij ki kj k D x x    Euclidean distance City-block (Manhattan) distance A B A B Dij distance between cases i and j xkj - value of variable xk for case j Other distance measures: Chebychev, Minkowski, Mahalanobis, maximum distance, cosine similarity, simple correlation between observations etc.,                   npx...nfx...n1x ............... ipx...ifx...i1x ............... 1px...1fx...11x                 0...)2,()1,( ::: )2,3() ...ndnd 0dd(3,1 0d(2,1) 0 Data matrix Dissimilarity matrix
  • 13. Calculating the distance DataAnalysisCourse VenkatReddy 13 Weight Cust1 68 Cust2 72 Cust3 100 • Cust1 vs Cust2 :- (68-72)= 4 • Cust2 vs Cust3 :- (72-100) = 28 • Cust3 vs Cust1 :- (100-68) =32 Weight Age Cust1 68 25 Cust2 72 70 Cust3 100 28 • Cust1 vs Cust2 :- sqrt((68-72)^2 + (25-70)^2)=44.9 • Cust2 vs Cust3 :- 50.54 • Cust3 vs Cust1 :- 32.14
  • 14. Demo: Calculation of distance proc distance data=cust_data out=Dist method=Euclid nostd; var interval(Credit_score Expenses); run; proc print data=Dist; run; DataAnalysisCourse VenkatReddy 14
  • 15. Lab: Distance Calculation proc distance data=cust_data out=Count_Dist method=Euclid nostd; var interval(Area_Sq_Miles_ GDP_MM_ Unemp_rate); run; proc print data=Count_Dist; run; DataAnalysisCourse VenkatReddy 15
  • 16. Clustering algorithms • k-means clustering algorithm • Fuzzy c-means clustering algorithm • Hierarchical clustering algorithm • Gaussian(EM) clustering algorithm • Quality Threshold (QT) clustering algorithm • MST based clustering algorithm • Density based clustering algorithm • kernel k-means clustering algorithm DataAnalysisCourse VenkatReddy 16
  • 17. K -Means Clustering – Algorithm 1. The number k of clusters is fixed 2. An initial set of k “seeds” (aggregation centres) is provided 1. First k elements 2. Other seeds (randomly selected or explicitly defined) 3. Given a certain fixed threshold, all units are assigned to the nearest cluster seed 4. New seeds are computed 5. Go back to step 3 until no reclassification is necessary Or simply Initialize k cluster centers Do Assignment step: Assign each data point to its closest cluster center Re-estimation step: Re-compute cluster centers While (there are still changes in the cluster centers) DataAnalysisCourse VenkatReddy 17
  • 31. K-Means clustering DataAnalysisCourse VenkatReddy 31 Continue till there is no significant change between two iterations
  • 32. K Means clustering in action DataAnalysisCourse VenkatReddy 32 • Dividing the data into 10 clusters using K-Means Distance metric will decide cluster for these points
  • 33. K-Means Clustering SAS Demo proc fastclus data= sup_market radius=0 replace=full maxclusters =5 maxiter =20 distance out=clustr_out; id cust_id; Var age family_size income spend visit_Other_shops; run; DataAnalysisCourse VenkatReddy 33 • A Supermarket wanted to send some promotional coupons to 100 families • The idea is to identify 100 customers with medium income and low recent spends
  • 34. Lab: K- Means Clustering • Download contact center agents data • The performance data contains • Average handling time • Average number of calls • CSAT • Resolution score • Identify top 10 agents for promotion based on below criteria • High C_SAT • High Resolution • Low Average handling time • High number of calls DataAnalysisCourse VenkatReddy 34
  • 35. SAS Code Options • The RADIUS= option establishes the minimum distance criterion for selecting new seeds. No observation is considered as a new seed unless its minimum distance to previous seeds exceeds the value given by the RADIUS= option. The default value is 0. • The MAXCLUSTERS= option specifies the maximum number of clusters allowed. If you omit the MAXCLUSTERS= option, a value of 100 is assumed. • The REPLACE= option specifies how seed replacement is performed. • FULL :requests default seed replacement. • PART :requests seed replacement only when the distance between the observation and the closest seed is greater than the minimum distance between seeds. • NONE : suppresses seed replacement. • RANDOM :Selects a simple pseudo-random sample of complete observations as initial cluster seeds. DataAnalysisCourse VenkatReddy 35
  • 36. SAS Code & Options • The MAXITER= option specifies the maximum number of iterations for re computing cluster seeds. When the value of the MAXITER= option is greater than 0, each observation is assigned to the nearest seed, and the seeds are recomputed as the means of the clusters. • The LIST option lists all observations, giving the value of the ID variable (if any), the number of the cluster to which the observation is assigned, and the distance between the observation and the final cluster seed. • The DISTANCE option computes distances between the cluster means. • The ID variable, which can be character or numeric, identifies observations on the output when you specify the LIST option. • The VAR statement lists the numeric variables to be used in the cluster analysis. If you omit the VAR statement, all numeric variables not listed in other statements are used. DataAnalysisCourse VenkatReddy 36
  • 37. Distance between Clusters • Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq) • Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq) • Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq) • Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj) • Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj) Medoid: a chosen, centrally located object in the cluster DataAnalysisCourse VenkatReddy 37 X X
  • 38. SAS output interpretation • RMSSTD - Pooled standard deviation of all the variables forming the cluster.(Variance within a cluster) Since the objective of cluster analysis is to form homogeneous groups, the • RMSSTD of a cluster should be as small as possible • SPRSQ -Semipartial R-squared is a measure of the homogeneity of merged clusters, so SPRSQ is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster. (error incurred by combining two groups) • Thus, the SPRSQ value should be small to imply that we are merging two homogeneous groups DataAnalysisCourse VenkatReddy 38
  • 39. SAS output interpretation • RSQ (R-squared) measures the extent to which groups or clusters are different from each other. (Variance between the clusters) • So, when you have just one cluster RSQ value is, intuitively, zero). Thus, the RSQ value should be high. • Centroid Distance is simply the Euclidian distance between the centroid of the two clusters that are to be joined or merged. • So, Centroid Distance is a measure of the homogeneity of merged clusters and the value should be small. DataAnalysisCourse VenkatReddy 39
  • 40. Distance Calculation on standardized data DataAnalysisCourse VenkatReddy 40 Weight Income Cust1 68 60,000 Cust2 72 9,000 Cust3 100 62,000 Average 80 43667 Stdev 14 24527 Weight Income Cust1 -0.84 0.67 Cust2 -0.56 -1.41 Cust3 1.40 0.75