SlideShare a Scribd company logo
Nov 16th, 2001
Copyright © 2001, Andrew W. Moore
K-means and
Hierarchical
Clustering
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
awm@cs.cmu.edu
412-268-7599
Note to other teachers and users of
these slides. Andrew would be
delighted if you found this source
material useful in giving your own
lectures. Feel free to use these slides
verbatim, or to modify them to fit
your own needs. PowerPoint originals
are available. If you make use of a
significant portion of these slides in
your own lecture, please include this
message, or the following link to the
source repository of Andrew’s
tutorials:
http://www.cs.cmu.edu/~awm/tutoria
ls
. Comments and corrections
gratefully received.
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 2
Some
Data
This could easily be
modeled by a
Gaussian Mixture
(with 5 components)
But let’s look at an
satisfying, friendly
and infinitely popular
alternative…
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 3
Lossy Compression
Suppose you transmit the
coordinates of points drawn
randomly from this dataset.
You can install decoding
software at the receiver.
You’re only allowed to send
two bits per point.
It’ll have to be a “lossy
transmission”.
Loss = Sum Squared Error
between decoded coords
and original coords.
What encoder/decoder will
lose the least information?
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 4
Suppose you transmit the
coordinates of points drawn
randomly from this dataset.
You can install decoding
software at the receiver.
You’re only allowed to send
two bits per point.
It’ll have to be a “lossy
transmission”.
Loss = Sum Squared Error
between decoded coords
and original coords.
What encoder/decoder will
lose the least information?
Idea One
00
11
10
01
Break into a grid,
decode each bit-
pair as the middle
of each grid-cell
Any Better Ideas?
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 5
Suppose you transmit the
coordinates of points drawn
randomly from this dataset.
You can install decoding
software at the receiver.
You’re only allowed to send
two bits per point.
It’ll have to be a “lossy
transmission”.
Loss = Sum Squared Error
between decoded coords
and original coords.
What encoder/decoder will
lose the least information?
Idea Two
00
11
10
01
Break into a grid, decode
each bit-pair as the
centroid of all data in
that grid-cell
Any Further
Ideas?
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 6
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 7
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 8
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint
finds out which
Center it’s closest
to. (Thus each
Center “owns” a set
of datapoints)
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 9
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint
finds out which
Center it’s closest
to.
4. Each Center finds
the centroid of the
points it owns
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 10
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint
finds out which
Center it’s closest
to.
4. Each Center finds
the centroid of the
points it owns…
5. …and jumps there
6. …Repeat until
terminated!
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 11
K-means
Start
Advance apologies: in
Black and White this
example will
deteriorate
Example generated by
Dan Pelleg’s super-
duper fast K-means
system:
Dan Pelleg and Andrew
Moore. Accelerating Exact
k-means Algorithms with
Geometric Reasoning.
Proc. Conference on
Knowledge Discovery in
Databases 1999, (KDD99)
(available on
www.autonlab.org/pap.htm
l)
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 12
K-means
continue
s…
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 13
K-means
continue
s…
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 14
K-means
continue
s…
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 15
K-means
continue
s…
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 16
K-means
continue
s…
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 17
K-means
continue
s…
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 18
K-means
continue
s…
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 19
K-means
continue
s…
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 20
K-means
terminate
s
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 21
K-means Questions
• What is it trying to optimize?
• Are we sure it will terminate?
• Are we sure it will find an optimal
clustering?
• How should we start it?
• How could we automatically choose the
number of centers?
….we’ll deal with these questions over the next few slides
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 22
Distortion
Given..
•an encoder function: ENCODE : m
 [1..k]
•a decoder function: DECODE : [1..k]  m
Define…
 




R
i
i
i
1
2
)]
(
[
Distortion ENCODE
DECODE x
x
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 23
Distortion
Given..
•an encoder function: ENCODE : m
 [1..k]
•a decoder function: DECODE : [1..k]  m
Define…
We may as well write





R
i
i
j
i
j
1
2
)
(
ENCODE )
(
Distortion
so
]
[
DECODE
x
c
x
c
 




R
i
i
i
1
2
)]
(
[
Distortion ENCODE
DECODE x
x
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 24
The Minimal Distortion
What properties must centers c1 , c2 , … , ck have
when distortion is minimized?




R
i
i i
1
2
)
(
ENCODE )
(
Distortion x
c
x
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 25
The Minimal Distortion (1)
What properties must centers c1 , c2 , … , ck have
when distortion is minimized?
(1) xi must be encoded by its nearest center
….why?




R
i
i i
1
2
)
(
ENCODE )
(
Distortion x
c
x
2
}
,...
,
{
)
(
ENCODE )
(
min
arg
2
1
j
i
k
j
i
c
x
c
c
c
c
c
x 


..at the minimal distortion
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 26
The Minimal Distortion (1)
What properties must centers c1 , c2 , … , ck have
when distortion is minimized?
(1) xi must be encoded by its nearest center
….why?




R
i
i i
1
2
)
(
ENCODE )
(
Distortion x
c
x
2
}
,...
,
{
)
(
ENCODE )
(
min
arg
2
1
j
i
k
j
i
c
x
c
c
c
c
c
x 


..at the minimal distortion
Otherwise distortion could be
reduced by replacing ENCODE[xi]
by the nearest center
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 27
The Minimal Distortion (2)
What properties must centers c1 , c2 , … , ck have
when distortion is minimized?
(2) The partial derivative of Distortion with
respect to each center location must be zero.




R
i
i i
1
2
)
(
ENCODE )
(
Distortion x
c
x
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 28
(2) The partial derivative of Distortion with
respect to each center location must be zero.
minimum)
a
(for
0
)
(
2
)
(
Distortion
)
(
)
(
Distortion
)
OwnedBy(
)
OwnedBy(
2
1 )
OwnedBy(
2
1
2
)
(
ENCODE
















 



 

j
j
j
i
i
j
i
i
j
i
j
j
k
j i
j
i
R
i
i
c
c
c
x
c
x
c
x
c
c
c
x
c
x
OwnedBy(cj ) = the set
of records owned by
Center cj .
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 29
(2) The partial derivative of Distortion with
respect to each center location must be zero.
minimum)
a
(for
0
)
(
2
)
(
Distortion
)
(
)
(
Distortion
)
OwnedBy(
)
OwnedBy(
2
1 )
OwnedBy(
2
1
2
)
(
ENCODE
















 



 

j
j
j
i
i
j
i
i
j
i
j
j
k
j i
j
i
R
i
i
c
c
c
x
c
x
c
x
c
c
c
x
c
x



)
OwnedBy(
|
)
OwnedBy(
|
1
j
i
i
j
j
c
x
c
c
Thus, at a minimum:
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 30
At the minimum distortion
What properties must centers c1 , c2 , … , ck have when
distortion is minimized?
(1) xi must be encoded by its nearest center
(2) Each Center must be at the centroid of points it owns.




R
i
i i
1
2
)
(
ENCODE )
(
Distortion x
c
x
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 31
Improving a suboptimal
configuration…
What properties can be changed for centers c1 , c2 , … , ck
have when distortion is not minimized?
(1) Change encoding so that xi is encoded by its nearest
center
(2) Set each Center to the centroid of points it owns.
There’s no point applying either operation twice in
succession.
But it can be profitable to alternate.
…And that’s K-means!
Easy to prove this procedure will terminate in a state at




R
i
i i
1
2
)
(
ENCODE )
(
Distortion x
c
x
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 32
Improving a suboptimal
configuration…
What properties can be changed for centers c1 , c2 , … , ck
have when distortion is not minimized?
(1) Change encoding so that xi is encoded by its nearest
center
(2) Set each Center to the centroid of points it owns.
There’s no point applying either operation twice in
succession.
But it can be profitable to alternate.
…And that’s K-means!
Easy to prove this procedure will terminate in a state at




R
i
i i
1
2
)
(
ENCODE )
(
Distortion x
c
x
There are only a finite number of ways of partitioning
R records into k groups.
So there are only a finite number of possible
configurations in which all Centers are the centroids of
the points they own.
If the configuration changes on an iteration, it must
have improved the distortion.
So each time the configuration changes it must go to
a configuration it’s never been to before.
So if it tried to go on forever, it would eventually run
out of configurations.
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 33
Will we find the optimal
configuration?
• Not necessarily.
• Can you invent a configuration that has
converged, but does not have the
minimum distortion?
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 34
Will we find the optimal
configuration?
• Not necessarily.
• Can you invent a configuration that has
converged, but does not have the minimum
distortion? (Hint: try a fiendish k=3 configuration here…)
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 35
Will we find the optimal
configuration?
• Not necessarily.
• Can you invent a configuration that has
converged, but does not have the minimum
distortion? (Hint: try a fiendish k=3 configuration here…)
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 36
Trying to find good optima
• Idea 1: Be careful about where you start
• Idea 2: Do many runs of k-means, each from
a different random start configuration
• Many other ideas floating around.
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 37
Trying to find good optima
• Idea 1: Be careful about where you start
• Idea 2: Do many runs of k-means, each from
a different random start configuration
• Many other ideas floating around.
Neat trick:
Place first center on top of randomly chosen
datapoint.
Place second center on datapoint that’s as far away
as possible from first center
:
Place j’th center on datapoint that’s as far away as
possible from the closest of Centers 1 through j-1
:
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 38
Choosing the number of
Centers
• A difficult problem
• Most common approach is to try to find
the solution that minimizes the Schwarz
Criterion (also related to the BIC)
log
)
parameters
(#
Distortion R
λ

log
Distortion R
λmk


m=#dimension
s
k=#Centers R=#Records
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 39
Common uses of K-means
• Often used as an exploratory data analysis tool
• In one-dimension, a good way to quantize real-
valued variables into k non-uniform buckets
• Used on acoustic data in speech understanding
to convert waveforms into one of k categories
(known as Vector Quantization)
• Also used for choosing color palettes on old
fashioned graphical display devices!
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 40
Single Linkage Hierarchical
Clustering
1. Say “Every point is its
own cluster”
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 41
Single Linkage Hierarchical
Clustering
1. Say “Every point is its
own cluster”
2. Find “most similar” pair
of clusters
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 42
Single Linkage Hierarchical
Clustering
1. Say “Every point is its
own cluster”
2. Find “most similar” pair
of clusters
3. Merge it into a parent
cluster
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 43
Single Linkage Hierarchical
Clustering
1. Say “Every point is its
own cluster”
2. Find “most similar” pair
of clusters
3. Merge it into a parent
cluster
4. Repeat
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 44
Single Linkage Hierarchical
Clustering
1. Say “Every point is its
own cluster”
2. Find “most similar” pair
of clusters
3. Merge it into a parent
cluster
4. Repeat
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 45
Single Linkage Hierarchical
Clustering
1. Say “Every point is its
own cluster”
2. Find “most similar” pair
of clusters
3. Merge it into a parent
cluster
4. Repeat…until you’ve
merged the whole
dataset into one cluster
You’re left with a nice
dendrogram, or taxonomy,
or hierarchy of datapoints
(not shown here)
How do we define similarity
between clusters?
• Minimum distance between
points in clusters (in which
case we’re simply doing
Euclidian Minimum
Spanning Trees)
• Maximum distance between
points in clusters
• Average distance between
points in clusters
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 46
Single Linkage Comments
• It’s nice that you get a hierarchy instead
of an amorphous collection of groups
• If you want k groups, just cut the (k-1)
longest links
• There’s no real statistical or information-
theoretic foundation to this. Makes your
lecturer feel a bit queasy.
Also known in the trade as
Hierarchical Agglomerative
Clustering (note the acronym)
Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 47
What you should know
• All the details of K-means
• The theory behind K-means as an
optimization algorithm
• How K-means can get stuck
• The outline of Hierarchical clustering
• Be able to contrast between which
problems would be relatively well/poorly
suited to K-means vs Gaussian Mixtures
vs Hierarchical clustering

More Related Content

PDF
K-means and Hierarchical Clustering
PPT
neural network artificial for process analysis
PPT
Classification and regression power point
PPT
svm-jain.ppt
PPT
gmatrix distro_gmatrix distro_gmatrix distro
PPT
PDF
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
K-means and Hierarchical Clustering
neural network artificial for process analysis
Classification and regression power point
svm-jain.ppt
gmatrix distro_gmatrix distro_gmatrix distro
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

Similar to For beginner who wants to know k means lecture 1.ppt (20)

PPT
Support Vector Machines Support Vector Machines
PPT
f37-book-intarch-pres-pt1.ppt
PPT
f37-book-intarch-pres-pt1.ppt
PPT
f37-book-intarch-pres-pt1.ppt
PDF
Improving Personal Sound Zone Reproductions
PPTX
DeepFak.pptx asdasdasdasdasdasdasdasdasd
PDF
Slides on Photosynth.net, from my MSc at Imperial
PDF
Matrix stiffness method 0910
PPT
Ch1 Introduction to operating system
PPT
Unit 4 SVM and AVR.ppt
PDF
Find all hazards in this circuit. Redesign the circuit as a three-le.pdf
PDF
DWT-SVD Based Visual Cryptography Scheme for Audio Watermarking
PDF
Computer Based Free Vibration Analysis of Isotropic Thin Rectangular Flat CCC...
PDF
Introduction to Deep Neural Network
PPTX
IOEfficientParalleMatrixMultiplication_present
PPT
Human Visual System in Digital Image Processing.ppt
PPTX
Introduction to Prolog
PPTX
Nonlinear Structural Dynamics: The Fundamentals Tutorial
PPTX
Deep Learning, Keras, and TensorFlow
PDF
Generic Programming Galore Using D
Support Vector Machines Support Vector Machines
f37-book-intarch-pres-pt1.ppt
f37-book-intarch-pres-pt1.ppt
f37-book-intarch-pres-pt1.ppt
Improving Personal Sound Zone Reproductions
DeepFak.pptx asdasdasdasdasdasdasdasdasd
Slides on Photosynth.net, from my MSc at Imperial
Matrix stiffness method 0910
Ch1 Introduction to operating system
Unit 4 SVM and AVR.ppt
Find all hazards in this circuit. Redesign the circuit as a three-le.pdf
DWT-SVD Based Visual Cryptography Scheme for Audio Watermarking
Computer Based Free Vibration Analysis of Isotropic Thin Rectangular Flat CCC...
Introduction to Deep Neural Network
IOEfficientParalleMatrixMultiplication_present
Human Visual System in Digital Image Processing.ppt
Introduction to Prolog
Nonlinear Structural Dynamics: The Fundamentals Tutorial
Deep Learning, Keras, and TensorFlow
Generic Programming Galore Using D
Ad

Recently uploaded (20)

PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
01-Introduction-to-Information-Management.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
Classroom Observation Tools for Teachers
PPTX
Cell Types and Its function , kingdom of life
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Pharma ospi slides which help in ospi learning
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Lesson notes of climatology university.
O5-L3 Freight Transport Ops (International) V1.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
01-Introduction-to-Information-Management.pdf
Anesthesia in Laparoscopic Surgery in India
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
Classroom Observation Tools for Teachers
Cell Types and Its function , kingdom of life
Final Presentation General Medicine 03-08-2024.pptx
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
FourierSeries-QuestionsWithAnswers(Part-A).pdf
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Pharma ospi slides which help in ospi learning
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
O7-L3 Supply Chain Operations - ICLT Program
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Final Presentation General Medicine 03-08-2024.pptx
Computing-Curriculum for Schools in Ghana
Lesson notes of climatology university.
Ad

For beginner who wants to know k means lecture 1.ppt

  • 1. Nov 16th, 2001 Copyright © 2001, Andrew W. Moore K-means and Hierarchical Clustering Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutoria ls . Comments and corrections gratefully received.
  • 2. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 2 Some Data This could easily be modeled by a Gaussian Mixture (with 5 components) But let’s look at an satisfying, friendly and infinitely popular alternative…
  • 3. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 3 Lossy Compression Suppose you transmit the coordinates of points drawn randomly from this dataset. You can install decoding software at the receiver. You’re only allowed to send two bits per point. It’ll have to be a “lossy transmission”. Loss = Sum Squared Error between decoded coords and original coords. What encoder/decoder will lose the least information?
  • 4. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 4 Suppose you transmit the coordinates of points drawn randomly from this dataset. You can install decoding software at the receiver. You’re only allowed to send two bits per point. It’ll have to be a “lossy transmission”. Loss = Sum Squared Error between decoded coords and original coords. What encoder/decoder will lose the least information? Idea One 00 11 10 01 Break into a grid, decode each bit- pair as the middle of each grid-cell Any Better Ideas?
  • 5. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 5 Suppose you transmit the coordinates of points drawn randomly from this dataset. You can install decoding software at the receiver. You’re only allowed to send two bits per point. It’ll have to be a “lossy transmission”. Loss = Sum Squared Error between decoded coords and original coords. What encoder/decoder will lose the least information? Idea Two 00 11 10 01 Break into a grid, decode each bit-pair as the centroid of all data in that grid-cell Any Further Ideas?
  • 6. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 6 K-means 1. Ask user how many clusters they’d like. (e.g. k=5)
  • 7. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 7 K-means 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations
  • 8. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 8 K-means 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)
  • 9. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 9 K-means 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it’s closest to. 4. Each Center finds the centroid of the points it owns
  • 10. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 10 K-means 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it’s closest to. 4. Each Center finds the centroid of the points it owns… 5. …and jumps there 6. …Repeat until terminated!
  • 11. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 11 K-means Start Advance apologies: in Black and White this example will deteriorate Example generated by Dan Pelleg’s super- duper fast K-means system: Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999, (KDD99) (available on www.autonlab.org/pap.htm l)
  • 12. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 12 K-means continue s…
  • 13. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 13 K-means continue s…
  • 14. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 14 K-means continue s…
  • 15. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 15 K-means continue s…
  • 16. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 16 K-means continue s…
  • 17. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 17 K-means continue s…
  • 18. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 18 K-means continue s…
  • 19. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 19 K-means continue s…
  • 20. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 20 K-means terminate s
  • 21. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 21 K-means Questions • What is it trying to optimize? • Are we sure it will terminate? • Are we sure it will find an optimal clustering? • How should we start it? • How could we automatically choose the number of centers? ….we’ll deal with these questions over the next few slides
  • 22. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 22 Distortion Given.. •an encoder function: ENCODE : m  [1..k] •a decoder function: DECODE : [1..k]  m Define…       R i i i 1 2 )] ( [ Distortion ENCODE DECODE x x
  • 23. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 23 Distortion Given.. •an encoder function: ENCODE : m  [1..k] •a decoder function: DECODE : [1..k]  m Define… We may as well write      R i i j i j 1 2 ) ( ENCODE ) ( Distortion so ] [ DECODE x c x c       R i i i 1 2 )] ( [ Distortion ENCODE DECODE x x
  • 24. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 24 The Minimal Distortion What properties must centers c1 , c2 , … , ck have when distortion is minimized?     R i i i 1 2 ) ( ENCODE ) ( Distortion x c x
  • 25. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 25 The Minimal Distortion (1) What properties must centers c1 , c2 , … , ck have when distortion is minimized? (1) xi must be encoded by its nearest center ….why?     R i i i 1 2 ) ( ENCODE ) ( Distortion x c x 2 } ,... , { ) ( ENCODE ) ( min arg 2 1 j i k j i c x c c c c c x    ..at the minimal distortion
  • 26. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 26 The Minimal Distortion (1) What properties must centers c1 , c2 , … , ck have when distortion is minimized? (1) xi must be encoded by its nearest center ….why?     R i i i 1 2 ) ( ENCODE ) ( Distortion x c x 2 } ,... , { ) ( ENCODE ) ( min arg 2 1 j i k j i c x c c c c c x    ..at the minimal distortion Otherwise distortion could be reduced by replacing ENCODE[xi] by the nearest center
  • 27. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 27 The Minimal Distortion (2) What properties must centers c1 , c2 , … , ck have when distortion is minimized? (2) The partial derivative of Distortion with respect to each center location must be zero.     R i i i 1 2 ) ( ENCODE ) ( Distortion x c x
  • 28. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 28 (2) The partial derivative of Distortion with respect to each center location must be zero. minimum) a (for 0 ) ( 2 ) ( Distortion ) ( ) ( Distortion ) OwnedBy( ) OwnedBy( 2 1 ) OwnedBy( 2 1 2 ) ( ENCODE                         j j j i i j i i j i j j k j i j i R i i c c c x c x c x c c c x c x OwnedBy(cj ) = the set of records owned by Center cj .
  • 29. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 29 (2) The partial derivative of Distortion with respect to each center location must be zero. minimum) a (for 0 ) ( 2 ) ( Distortion ) ( ) ( Distortion ) OwnedBy( ) OwnedBy( 2 1 ) OwnedBy( 2 1 2 ) ( ENCODE                         j j j i i j i i j i j j k j i j i R i i c c c x c x c x c c c x c x    ) OwnedBy( | ) OwnedBy( | 1 j i i j j c x c c Thus, at a minimum:
  • 30. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 30 At the minimum distortion What properties must centers c1 , c2 , … , ck have when distortion is minimized? (1) xi must be encoded by its nearest center (2) Each Center must be at the centroid of points it owns.     R i i i 1 2 ) ( ENCODE ) ( Distortion x c x
  • 31. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 31 Improving a suboptimal configuration… What properties can be changed for centers c1 , c2 , … , ck have when distortion is not minimized? (1) Change encoding so that xi is encoded by its nearest center (2) Set each Center to the centroid of points it owns. There’s no point applying either operation twice in succession. But it can be profitable to alternate. …And that’s K-means! Easy to prove this procedure will terminate in a state at     R i i i 1 2 ) ( ENCODE ) ( Distortion x c x
  • 32. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 32 Improving a suboptimal configuration… What properties can be changed for centers c1 , c2 , … , ck have when distortion is not minimized? (1) Change encoding so that xi is encoded by its nearest center (2) Set each Center to the centroid of points it owns. There’s no point applying either operation twice in succession. But it can be profitable to alternate. …And that’s K-means! Easy to prove this procedure will terminate in a state at     R i i i 1 2 ) ( ENCODE ) ( Distortion x c x There are only a finite number of ways of partitioning R records into k groups. So there are only a finite number of possible configurations in which all Centers are the centroids of the points they own. If the configuration changes on an iteration, it must have improved the distortion. So each time the configuration changes it must go to a configuration it’s never been to before. So if it tried to go on forever, it would eventually run out of configurations.
  • 33. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 33 Will we find the optimal configuration? • Not necessarily. • Can you invent a configuration that has converged, but does not have the minimum distortion?
  • 34. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 34 Will we find the optimal configuration? • Not necessarily. • Can you invent a configuration that has converged, but does not have the minimum distortion? (Hint: try a fiendish k=3 configuration here…)
  • 35. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 35 Will we find the optimal configuration? • Not necessarily. • Can you invent a configuration that has converged, but does not have the minimum distortion? (Hint: try a fiendish k=3 configuration here…)
  • 36. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 36 Trying to find good optima • Idea 1: Be careful about where you start • Idea 2: Do many runs of k-means, each from a different random start configuration • Many other ideas floating around.
  • 37. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 37 Trying to find good optima • Idea 1: Be careful about where you start • Idea 2: Do many runs of k-means, each from a different random start configuration • Many other ideas floating around. Neat trick: Place first center on top of randomly chosen datapoint. Place second center on datapoint that’s as far away as possible from first center : Place j’th center on datapoint that’s as far away as possible from the closest of Centers 1 through j-1 :
  • 38. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 38 Choosing the number of Centers • A difficult problem • Most common approach is to try to find the solution that minimizes the Schwarz Criterion (also related to the BIC) log ) parameters (# Distortion R λ  log Distortion R λmk   m=#dimension s k=#Centers R=#Records
  • 39. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 39 Common uses of K-means • Often used as an exploratory data analysis tool • In one-dimension, a good way to quantize real- valued variables into k non-uniform buckets • Used on acoustic data in speech understanding to convert waveforms into one of k categories (known as Vector Quantization) • Also used for choosing color palettes on old fashioned graphical display devices!
  • 40. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 40 Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster”
  • 41. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 41 Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster” 2. Find “most similar” pair of clusters
  • 42. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 42 Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster” 2. Find “most similar” pair of clusters 3. Merge it into a parent cluster
  • 43. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 43 Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster” 2. Find “most similar” pair of clusters 3. Merge it into a parent cluster 4. Repeat
  • 44. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 44 Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster” 2. Find “most similar” pair of clusters 3. Merge it into a parent cluster 4. Repeat
  • 45. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 45 Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster” 2. Find “most similar” pair of clusters 3. Merge it into a parent cluster 4. Repeat…until you’ve merged the whole dataset into one cluster You’re left with a nice dendrogram, or taxonomy, or hierarchy of datapoints (not shown here) How do we define similarity between clusters? • Minimum distance between points in clusters (in which case we’re simply doing Euclidian Minimum Spanning Trees) • Maximum distance between points in clusters • Average distance between points in clusters
  • 46. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 46 Single Linkage Comments • It’s nice that you get a hierarchy instead of an amorphous collection of groups • If you want k groups, just cut the (k-1) longest links • There’s no real statistical or information- theoretic foundation to this. Makes your lecturer feel a bit queasy. Also known in the trade as Hierarchical Agglomerative Clustering (note the acronym)
  • 47. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 47 What you should know • All the details of K-means • The theory behind K-means as an optimization algorithm • How K-means can get stuck • The outline of Hierarchical clustering • Be able to contrast between which problems would be relatively well/poorly suited to K-means vs Gaussian Mixtures vs Hierarchical clustering