3.1 clustering

Cluster Analysis
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
 For Outlier Analysis
2

Clustering: Rich Applications and
Multidisciplinary Efforts
 Pattern Recognition
 Spatial Data Analysis
 Create thematic maps in GIS by clustering feature spaces
 Detect spatial clusters or for other spatial mining tasks
 Image Processing
 Economic Science (especially market research)
 WWW
 Document classification
 Cluster Weblog data to discover groups of similar access patterns
3

Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 Land use: Identification of areas of similar land use in an earth
observation database
 Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
4

Clustering Quality
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
5

Measure the Quality of Clustering
 Dissimilarity/Similarity metric: Similarity is expressed in terms of a
distance function, typically metric: d(i, j)
 There is a separate “quality” function that measures the “goodness” of a cluster.
 The definitions of distance functions are usually very different for
interval-scaled, boolean, categorical, ordinal ratio, and vector
variables.
 Weights should be associated with different variables based on
applications and data semantics.
 It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.
6

Requirements of Clustering in Data
Mining
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to determine input
parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
7

Types of Data
Clustering Data Structures
 Data matrix
 Two modes
 Object by variable structure
 Dissimilarity matrix
 One mode
 Object by object structure
8


















npx...nfx...n1x
...............
ipx...ifx...i1x
...............
1px...1fx...11x
















0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0

Type of data in clustering analysis
 Interval-scaled variables
 Binary variables
 Nominal, ordinal, and ratio variables
 Variables of mixed types
9

Interval-scaled variables
 Continuous measurements
 Standardize data
 Calculate the mean absolute deviation:
where
 Calculate the standardized measurement (z-score)
 Using mean absolute deviation is more robust than using standard
deviation
10
.)...
21
1
nffff
xx(xnm +++=
|)|...|||(|1
21 fnffffff
mxmxmxns −++−+−=
f
fif
if s
mx
z
−
=

Similarity and Dissimilarity Between
Objects
 Distances are normally used to measure the similarity or
dissimilarity between two data objects
 Some popular ones include: Minkowski distance
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional
data objects, and q is a positive integer
 If q = 1, d is Manhattan distance
11
q
q
pp
qq
j
x
i
x
j
x
i
x
j
x
i
xjid )||...|||(|),(
2211
−++−+−=
||...||||),(
2211 pp j
x
i
x
j
x
i
x
j
x
i
xjid −++−+−=

Similarity and Dissimilarity Between
Objects (Cont.)
 If q = 2, d is Euclidean distance
 Weighted Euclidean distance
 Properties of Euclidean and Manhattan distance
 d(i,j) ≥ 0
 d(i,i) = 0
 d(i,j) = d(j,i)
 d(i,j) ≤ d(i,k) + d(k,j)
12
)||...|||(|),( 22
22
2
11 pp j
x
i
x
j
x
i
x
j
x
i
xjid −++−+−=

Binary Variables
 A contingency table for
binary data
 Distance measure for
symmetric binary variables:
 Distance measure for
asymmetric binary variables:
 Jaccard coefficient (similarity
measure for asymmetric
binary variables):
13
dcba
cbjid
+++
+=),(
cba
cbjid
++
+=),(
pdbcasum
dcdc
baba
sum
++
+
+
0
1
01
Object i
Object j
),(1),( jidjisim −=

Example
 gender is a symmetric attribute
 the remaining attributes are asymmetric binary
 let the values Y and P be set to 1, and the value N be set to 0
14
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
75.0
211
21
),(
67.0
111
11
),(
33.0
102
10
),(
=
++
+
=
=
++
+
=
=
++
+
=
maryjimd
jimjackd
maryjackd

Categorical or Nominal Variables
 A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
 Method 1: Simple matching
 m: # of matches, p: total # of variables
 Method 2: use a large number of binary variables
 creating a new binary variable for each of the M nominal states
 Using similarity measures of binary variables
15
p
mpjid −=),(

Ordinal Variables
 An ordinal variable can be discrete or continuous
 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace xif by their rank
 map the range of each variable onto [0, 1] by replacing i-th object in
the f-th variable by
 compute the dissimilarity using methods for interval-scaled
variables
16
1
1
−
−
=
f
if
if M
r
z
},...,1{ fif
Mr ∈

Ratio-Scaled Variables
 Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale, such
as AeBt
or Ae-Bt
 Methods:
 treat them like interval-scaled variables—not a good choice
 apply logarithmic transformation
yif = log(xif)
 treat them as continuous ordinal data treat their rank as interval-
scaled
17

Variables of Mixed Types
 A database may contain all the six types of variables
 symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio
 One may use a weighted formula to combine their
effects
 f is binary or nominal:
dij
(f)
= 0 if xif = xjf , or dij
(f)
= 1 otherwise
 f is interval-based: use the normalized distance
 f is ordinal or ratio-scaled
 compute ranks rif and
 and treat zif as interval-scaled
18
)(
1
)()(
1
),( f
ij
p
f
f
ij
f
ij
p
f
d
jid
δ
δ
=
=
Σ
Σ
=
1
1
−
−
=
f
if
M
rzif

Vector Objects
 Cosine measure
s(x,y) = xt
. y / ||x|| ||y||
xt
. y = number of attributes shared by x and y
 Tanimoto distance
s(x,y) = xt
. y / xt
. x + yt
. y - xt
. y
Ratio of attributes shared by x and y to number of
attributes possessed by x or y
19
22
2
2
1 ...|||| pxxxx +++=

3.1 clustering

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to 3.1 clustering (20)

More from Krish_ver2 (20)

Recently uploaded (20)

3.1 clustering