SlideShare a Scribd company logo
Clustering
1
Cluster Analysis
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
 For Outlier Analysis
2
Clustering: Rich Applications and
Multidisciplinary Efforts
 Pattern Recognition
 Spatial Data Analysis
 Create thematic maps in GIS by clustering feature spaces
 Detect spatial clusters or for other spatial mining tasks
 Image Processing
 Economic Science (especially market research)
 WWW
 Document classification
 Cluster Weblog data to discover groups of similar access patterns
3
Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 Land use: Identification of areas of similar land use in an earth
observation database
 Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
4
Clustering Quality
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
5
Measure the Quality of Clustering
 Dissimilarity/Similarity metric: Similarity is expressed in terms of a
distance function, typically metric: d(i, j)
 There is a separate “quality” function that measures the “goodness” of a cluster.
 The definitions of distance functions are usually very different for
interval-scaled, boolean, categorical, ordinal ratio, and vector
variables.
 Weights should be associated with different variables based on
applications and data semantics.
 It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.
6
Requirements of Clustering in Data
Mining
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to determine input
parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
7
Types of Data
Clustering Data Structures
 Data matrix
 Two modes
 Object by variable structure
 Dissimilarity matrix
 One mode
 Object by object structure
8


















npx...nfx...n1x
...............
ipx...ifx...i1x
...............
1px...1fx...11x
















0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
Type of data in clustering analysis
 Interval-scaled variables
 Binary variables
 Nominal, ordinal, and ratio variables
 Variables of mixed types
9
Interval-scaled variables
 Continuous measurements
 Standardize data
 Calculate the mean absolute deviation:
where
 Calculate the standardized measurement (z-score)
 Using mean absolute deviation is more robust than using standard
deviation
10
.)...
21
1
nffff
xx(xnm +++=
|)|...|||(|1
21 fnffffff
mxmxmxns −++−+−=
f
fif
if s
mx
z
−
=
Similarity and Dissimilarity Between
Objects
 Distances are normally used to measure the similarity or
dissimilarity between two data objects
 Some popular ones include: Minkowski distance
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional
data objects, and q is a positive integer
 If q = 1, d is Manhattan distance
11
q
q
pp
qq
j
x
i
x
j
x
i
x
j
x
i
xjid )||...|||(|),(
2211
−++−+−=
||...||||),(
2211 pp j
x
i
x
j
x
i
x
j
x
i
xjid −++−+−=
Similarity and Dissimilarity Between
Objects (Cont.)
 If q = 2, d is Euclidean distance
 Weighted Euclidean distance
 Properties of Euclidean and Manhattan distance
 d(i,j) ≥ 0
 d(i,i) = 0
 d(i,j) = d(j,i)
 d(i,j) ≤ d(i,k) + d(k,j)
12
)||...|||(|),( 22
22
2
11 pp j
x
i
x
j
x
i
x
j
x
i
xjid −++−+−=
Binary Variables
 A contingency table for
binary data
 Distance measure for
symmetric binary variables:
 Distance measure for
asymmetric binary variables:
 Jaccard coefficient (similarity
measure for asymmetric
binary variables):
13
dcba
cbjid
+++
+=),(
cba
cbjid
++
+=),(
pdbcasum
dcdc
baba
sum
++
+
+
0
1
01
Object i
Object j
),(1),( jidjisim −=
Example
 gender is a symmetric attribute
 the remaining attributes are asymmetric binary
 let the values Y and P be set to 1, and the value N be set to 0
14
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
75.0
211
21
),(
67.0
111
11
),(
33.0
102
10
),(
=
++
+
=
=
++
+
=
=
++
+
=
maryjimd
jimjackd
maryjackd
Categorical or Nominal Variables
 A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
 Method 1: Simple matching
 m: # of matches, p: total # of variables
 Method 2: use a large number of binary variables
 creating a new binary variable for each of the M nominal states
 Using similarity measures of binary variables
15
p
mpjid −=),(
Ordinal Variables
 An ordinal variable can be discrete or continuous
 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace xif by their rank
 map the range of each variable onto [0, 1] by replacing i-th object in
the f-th variable by
 compute the dissimilarity using methods for interval-scaled
variables
16
1
1
−
−
=
f
if
if M
r
z
},...,1{ fif
Mr ∈
Ratio-Scaled Variables
 Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale, such
as AeBt
or Ae-Bt
 Methods:
 treat them like interval-scaled variables—not a good choice
 apply logarithmic transformation
yif = log(xif)
 treat them as continuous ordinal data treat their rank as interval-
scaled
17
Variables of Mixed Types
 A database may contain all the six types of variables
 symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio
 One may use a weighted formula to combine their
effects
 f is binary or nominal:
dij
(f)
= 0 if xif = xjf , or dij
(f)
= 1 otherwise
 f is interval-based: use the normalized distance
 f is ordinal or ratio-scaled
 compute ranks rif and
 and treat zif as interval-scaled
18
)(
1
)()(
1
),( f
ij
p
f
f
ij
f
ij
p
f
d
jid
δ
δ
=
=
Σ
Σ
=
1
1
−
−
=
f
if
M
rzif
Vector Objects
 Cosine measure
s(x,y) = xt
. y / ||x|| ||y||
xt
. y = number of attributes shared by x and y
 Tanimoto distance
s(x,y) = xt
. y / xt
. x + yt
. y - xt
. y
Ratio of attributes shared by x and y to number of
attributes possessed by x or y
19
22
2
2
1 ...|||| pxxxx +++=

More Related Content

PDF
Cluster analysis
PPTX
Dbscan algorithom
PPTX
Clustering in Data Mining
PPT
1.2 steps and functionalities
PPTX
Machine learning clustering
PPTX
CART – Classification & Regression Trees
PPTX
DBSCAN (2014_11_25 06_21_12 UTC)
PPTX
05 Clustering in Data Mining
Cluster analysis
Dbscan algorithom
Clustering in Data Mining
1.2 steps and functionalities
Machine learning clustering
CART – Classification & Regression Trees
DBSCAN (2014_11_25 06_21_12 UTC)
05 Clustering in Data Mining

What's hot (20)

PPTX
Data mining: Classification and prediction
PPT
K means Clustering Algorithm
PPT
1.8 discretization
PPTX
Data cubes
PPTX
Clusters techniques
PPTX
Data reduction
PPTX
Ensemble learning
PPTX
Data Mining: clustering and analysis
PPTX
Supervised and unsupervised learning
PDF
Density Based Clustering
PPSX
Frequent itemset mining methods
PDF
Bayes Belief Networks
PPTX
Classification in data mining
PPTX
Exploratory data analysis with Python
PDF
Bayesian Networks - A Brief Introduction
PPTX
Major issues in data mining
PPT
Chap8 basic cluster_analysis
PPT
Data preprocessing in Data Mining
PPTX
Knn Algorithm presentation
PPTX
Classification techniques in data mining
Data mining: Classification and prediction
K means Clustering Algorithm
1.8 discretization
Data cubes
Clusters techniques
Data reduction
Ensemble learning
Data Mining: clustering and analysis
Supervised and unsupervised learning
Density Based Clustering
Frequent itemset mining methods
Bayes Belief Networks
Classification in data mining
Exploratory data analysis with Python
Bayesian Networks - A Brief Introduction
Major issues in data mining
Chap8 basic cluster_analysis
Data preprocessing in Data Mining
Knn Algorithm presentation
Classification techniques in data mining
Ad

Viewers also liked (7)

PPT
Cluster analysis for market segmentation
PDF
Basics of Clustering
PPTX
pratik meshram-Unit 5 (contemporary mkt r sch)
PPTX
A STUDY ON EMPLOYEE'S TRAINING IN INSURANCE SECTOR (RELIABILITY ANALYSIS)
 
PPTX
Cluster analysis
 
PDF
Cluster Analysis for Dummies
PPTX
Cluster analysis
Cluster analysis for market segmentation
Basics of Clustering
pratik meshram-Unit 5 (contemporary mkt r sch)
A STUDY ON EMPLOYEE'S TRAINING IN INSURANCE SECTOR (RELIABILITY ANALYSIS)
 
Cluster analysis
 
Cluster Analysis for Dummies
Cluster analysis
Ad

Similar to 3.1 clustering (20)

PPT
Cs501 cluster analysis
PPT
DM UNIT_4 PPT for btech final year students
PDF
Cluster Analysis: Measuring Similarity & Dissimilarity
PPT
PPTX
Cluster Analysis in Business Research Methods
PPT
Datamining tools and techniques_lec-2.ppt
PPT
20IT501_DWDM_PPT_Unit_IV.ppt
PPT
20IT501_DWDM_PPT_Unit_IV.ppt
PPTX
Cluster Analysis.pptx
PPT
4_22865_IS465_2019_1__2_1_02Data-2.ppt
PDF
PPT
Chapter 07
PDF
clusteranalysis_simplexrelated to ai.pdf
PPT
Jewei Hans & Kamber Capter 7
PPT
clustering.ppt
PDF
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
PDF
A Novel Algorithm for Design Tree Classification with PCA
PDF
1376846406 14447221
PPTX
Cluster analysis (2)
PPT
ClusetrigBasic.ppt
Cs501 cluster analysis
DM UNIT_4 PPT for btech final year students
Cluster Analysis: Measuring Similarity & Dissimilarity
Cluster Analysis in Business Research Methods
Datamining tools and techniques_lec-2.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
Cluster Analysis.pptx
4_22865_IS465_2019_1__2_1_02Data-2.ppt
Chapter 07
clusteranalysis_simplexrelated to ai.pdf
Jewei Hans & Kamber Capter 7
clustering.ppt
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
A Novel Algorithm for Design Tree Classification with PCA
1376846406 14447221
Cluster analysis (2)
ClusetrigBasic.ppt

More from Krish_ver2 (20)

PPT
5.5 back tracking
PPT
5.5 back track
PPT
5.5 back tracking 02
PPT
5.4 randomized datastructures
PPT
5.4 randomized datastructures
PPT
5.4 randamized algorithm
PPT
5.3 dynamic programming 03
PPT
5.3 dynamic programming
PPT
5.3 dyn algo-i
PPT
5.2 divede and conquer 03
PPT
5.2 divide and conquer
PPT
5.2 divede and conquer 03
PPT
5.1 greedyyy 02
PPT
5.1 greedy
PPT
5.1 greedy 03
PPT
4.4 hashing02
PPT
4.4 hashing
PPT
4.4 hashing ext
PPT
4.4 external hashing
PPT
4.2 bst
5.5 back tracking
5.5 back track
5.5 back tracking 02
5.4 randomized datastructures
5.4 randomized datastructures
5.4 randamized algorithm
5.3 dynamic programming 03
5.3 dynamic programming
5.3 dyn algo-i
5.2 divede and conquer 03
5.2 divide and conquer
5.2 divede and conquer 03
5.1 greedyyy 02
5.1 greedy
5.1 greedy 03
4.4 hashing02
4.4 hashing
4.4 hashing ext
4.4 external hashing
4.2 bst

Recently uploaded (20)

PPTX
Cell Structure & Organelles in detailed.
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Lesson notes of climatology university.
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
RMMM.pdf make it easy to upload and study
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Classroom Observation Tools for Teachers
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Cell Structure & Organelles in detailed.
Module 4: Burden of Disease Tutorial Slides S2 2025
Final Presentation General Medicine 03-08-2024.pptx
Lesson notes of climatology university.
STATICS OF THE RIGID BODIES Hibbelers.pdf
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
O5-L3 Freight Transport Ops (International) V1.pdf
RMMM.pdf make it easy to upload and study
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
Supply Chain Operations Speaking Notes -ICLT Program
Classroom Observation Tools for Teachers
Weekly quiz Compilation Jan -July 25.pdf
Microbial disease of the cardiovascular and lymphatic systems
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Chinmaya Tiranga quiz Grand Finale.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf

3.1 clustering

  • 2. Cluster Analysis  Cluster: a collection of data objects  Similar to one another within the same cluster  Dissimilar to the objects in other clusters  Cluster analysis  Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters  Unsupervised learning: no predefined classes  Typical applications  As a stand-alone tool to get insight into data distribution  As a preprocessing step for other algorithms  For Outlier Analysis 2
  • 3. Clustering: Rich Applications and Multidisciplinary Efforts  Pattern Recognition  Spatial Data Analysis  Create thematic maps in GIS by clustering feature spaces  Detect spatial clusters or for other spatial mining tasks  Image Processing  Economic Science (especially market research)  WWW  Document classification  Cluster Weblog data to discover groups of similar access patterns 3
  • 4. Examples of Clustering Applications  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  Land use: Identification of areas of similar land use in an earth observation database  Insurance: Identifying groups of motor insurance policy holders with a high average claim cost  City-planning: Identifying groups of houses according to their house type, value, and geographical location  Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults 4
  • 5. Clustering Quality  A good clustering method will produce high quality clusters with  high intra-class similarity  low inter-class similarity  The quality of a clustering result depends on both the similarity measure used by the method and its implementation 5
  • 6. Measure the Quality of Clustering  Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)  There is a separate “quality” function that measures the “goodness” of a cluster.  The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables.  Weights should be associated with different variables based on applications and data semantics.  It is hard to define “similar enough” or “good enough”  the answer is typically highly subjective. 6
  • 7. Requirements of Clustering in Data Mining  Scalability  Ability to deal with different types of attributes  Discovery of clusters with arbitrary shape  Minimal requirements for domain knowledge to determine input parameters  Able to deal with noise and outliers  Insensitive to order of input records  High dimensionality  Incorporation of user-specified constraints  Interpretability and usability 7
  • 8. Types of Data Clustering Data Structures  Data matrix  Two modes  Object by variable structure  Dissimilarity matrix  One mode  Object by object structure 8                   npx...nfx...n1x ............... ipx...ifx...i1x ............... 1px...1fx...11x                 0...)2,()1,( ::: )2,3() ...ndnd 0dd(3,1 0d(2,1) 0
  • 9. Type of data in clustering analysis  Interval-scaled variables  Binary variables  Nominal, ordinal, and ratio variables  Variables of mixed types 9
  • 10. Interval-scaled variables  Continuous measurements  Standardize data  Calculate the mean absolute deviation: where  Calculate the standardized measurement (z-score)  Using mean absolute deviation is more robust than using standard deviation 10 .)... 21 1 nffff xx(xnm +++= |)|...|||(|1 21 fnffffff mxmxmxns −++−+−= f fif if s mx z − =
  • 11. Similarity and Dissimilarity Between Objects  Distances are normally used to measure the similarity or dissimilarity between two data objects  Some popular ones include: Minkowski distance where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer  If q = 1, d is Manhattan distance 11 q q pp qq j x i x j x i x j x i xjid )||...|||(|),( 2211 −++−+−= ||...||||),( 2211 pp j x i x j x i x j x i xjid −++−+−=
  • 12. Similarity and Dissimilarity Between Objects (Cont.)  If q = 2, d is Euclidean distance  Weighted Euclidean distance  Properties of Euclidean and Manhattan distance  d(i,j) ≥ 0  d(i,i) = 0  d(i,j) = d(j,i)  d(i,j) ≤ d(i,k) + d(k,j) 12 )||...|||(|),( 22 22 2 11 pp j x i x j x i x j x i xjid −++−+−=
  • 13. Binary Variables  A contingency table for binary data  Distance measure for symmetric binary variables:  Distance measure for asymmetric binary variables:  Jaccard coefficient (similarity measure for asymmetric binary variables): 13 dcba cbjid +++ +=),( cba cbjid ++ +=),( pdbcasum dcdc baba sum ++ + + 0 1 01 Object i Object j ),(1),( jidjisim −=
  • 14. Example  gender is a symmetric attribute  the remaining attributes are asymmetric binary  let the values Y and P be set to 1, and the value N be set to 0 14 Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N 75.0 211 21 ),( 67.0 111 11 ),( 33.0 102 10 ),( = ++ + = = ++ + = = ++ + = maryjimd jimjackd maryjackd
  • 15. Categorical or Nominal Variables  A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green  Method 1: Simple matching  m: # of matches, p: total # of variables  Method 2: use a large number of binary variables  creating a new binary variable for each of the M nominal states  Using similarity measures of binary variables 15 p mpjid −=),(
  • 16. Ordinal Variables  An ordinal variable can be discrete or continuous  Order is important, e.g., rank  Can be treated like interval-scaled  replace xif by their rank  map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by  compute the dissimilarity using methods for interval-scaled variables 16 1 1 − − = f if if M r z },...,1{ fif Mr ∈
  • 17. Ratio-Scaled Variables  Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt  Methods:  treat them like interval-scaled variables—not a good choice  apply logarithmic transformation yif = log(xif)  treat them as continuous ordinal data treat their rank as interval- scaled 17
  • 18. Variables of Mixed Types  A database may contain all the six types of variables  symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio  One may use a weighted formula to combine their effects  f is binary or nominal: dij (f) = 0 if xif = xjf , or dij (f) = 1 otherwise  f is interval-based: use the normalized distance  f is ordinal or ratio-scaled  compute ranks rif and  and treat zif as interval-scaled 18 )( 1 )()( 1 ),( f ij p f f ij f ij p f d jid δ δ = = Σ Σ = 1 1 − − = f if M rzif
  • 19. Vector Objects  Cosine measure s(x,y) = xt . y / ||x|| ||y|| xt . y = number of attributes shared by x and y  Tanimoto distance s(x,y) = xt . y / xt . x + yt . y - xt . y Ratio of attributes shared by x and y to number of attributes possessed by x or y 19 22 2 2 1 ...|||| pxxxx +++=