SlideShare a Scribd company logo
1
K-Means
Class Algorithmic Methods of Data Mining
Program M. Sc. Data Science
University Sapienza University of Rome
Semester Fall 2015
Lecturer Carlos Castillo http://chato.cl/
Sources:
● Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis:
Fundamental Concepts and Algorithms, Cambridge University
Press, May 2014. Example 13.1. [download]
● Evimaria Terzi: Data Mining course at Boston University
http://www.cs.bu.edu/~evimaria/cs565-13.html
2
Boston University Slideshow Title Goes Here
The k-means problem
• consider set X={x1,...,xn} of n points in Rd
• assume that the number k is given
• problem:
• find k points c1,...,ck (named centers or means)
so that the cost
is minimized
3
Boston University Slideshow Title Goes Here
The k-means problem
• k=1 and k=n are easy special cases (why?)
• an NP-hard problem if the dimension of the
data is at least 2 (d≥2)
• in practice, a simple iterative algorithm
works quite well
4
Boston University Slideshow Title Goes Here
The k-means
algorithm
• voted among the top-10
algorithms in data mining
• one way of solving the k-
means problem
5
K-means algorithm
6
Boston University Slideshow Title Goes Here
The k-means algorithm
1.randomly (or with another method) pick k
cluster centers {c1,...,ck}
2.for each j, set the cluster Xj to be the set of
points in X that are the closest to center cj
3.for each j let cj be the center of cluster Xj
(mean of the vectors in Xj)
1.repeat (go to step 2) until convergence
7
Boston University Slideshow Title Goes Here
Sample execution
8
1-dimensional clustering exercise
Exercise:
● For the data in the figure
● Run k-means with k=2 and initial centroids u1=2, u2=4
(Verify: last centroids are 18 units apart)
● Try with k=3 and initialization 2,3,30
http://www.dataminingbook.info/pmwiki.php/Main/BookDownload Exercise 13.1
9
Limitations of k-means
● Clusters of different size
● Clusters of different density
● Clusters of non-globular shape
● Sensitive to initialization
10
Boston University Slideshow Title Goes Here
Limitations of k-means: different sizes
11
Boston University Slideshow Title Goes Here
Limitations of k-means: different
density
12
Boston University Slideshow Title Goes Here
Limitations of k-means: non-spherical
shapes
13
Boston University Slideshow Title Goes Here
Effects of bad initialization
14
Boston University Slideshow Title Goes Here
k-means algorithm
• finds a local optimum
• often converges quickly
but not always
• the choice of initial points can have large
influence in the result
• tends to find spherical clusters
• outliers can cause a problem
• different densities may cause a problem
15
Advanced: k-means initialization
16
Boston University Slideshow Title Goes Here
Initialization
• random initialization
• random, but repeat many times and take the
best solution
• helps, but solution can still be bad
• pick points that are distant to each other
• k-means++
• provable guarantees
17
Boston University Slideshow Title Goes Here
k-means++
David Arthur and Sergei Vassilvitskii
k-means++: The advantages of careful
seeding
SODA 2007
18
Boston University Slideshow Title Goes Here
k-means algorithm: random
initialization
19
Boston University Slideshow Title Goes Here
k-means algorithm: random
initialization
20
Boston University Slideshow Title Goes Here
1
2
3
4
k-means algorithm:
initialization with further-first
traversal
21
Boston University Slideshow Title Goes Here
k-means algorithm:
initialization with further-first
traversal
22
Boston University Slideshow Title Goes Here
1
2
3
but... sensitive to outliers
23
Boston University Slideshow Title Goes Here
but... sensitive to outliers
24
Boston University Slideshow Title Goes Here
Here random may work well
25
Boston University Slideshow Title Goes Here
k-means++ algorithm
• interpolate between the two methods
• let D(x) be the distance between x and the
nearest center selected so far
• choose next center with probability proportional to
(D(x))a = Da(x)
 a = 0      random initialization
 a = ∞ furthest­first traversal
 a = 2      k­means++ 
26
Boston University Slideshow Title Goes Here
k-means++ algorithm
• initialization phase:
• choose the first center uniformly at random
• choose next center with probability proportional
to D2(x)
• iteration phase:
• iterate as in the k-means algorithm until
convergence
27
Boston University Slideshow Title Goes Here
k-means++ initialization
1
2
3
28
Boston University Slideshow Title Goes Here
k-means++ result
29
Boston University Slideshow Title Goes Here
• approximation guarantee comes just from the
first iteration (initialization)
• subsequent iterations can only improve cost
k-means++ provable guarantee
30
Boston University Slideshow Title Goes Here
Lesson learned
• no reason to use k-means and not k-means++
• k-means++ :
• easy to implement
• provable guarantee
• works well in practice
31
k-means--
● Algorithm 4.1 in [Chawla & Gionis SDM 2013]

More Related Content

PPT
K means Clustering Algorithm
PPTX
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
PDF
Decision tree
PPTX
K MEANS CLUSTERING
PPT
Cluster analysis
PPTX
Clustering in Data Mining
PDF
K - Nearest neighbor ( KNN )
PDF
Dimensionality Reduction
K means Clustering Algorithm
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
Decision tree
K MEANS CLUSTERING
Cluster analysis
Clustering in Data Mining
K - Nearest neighbor ( KNN )
Dimensionality Reduction

What's hot (20)

PPT
PPT
2.5 backpropagation
PPTX
Kmeans
PPTX
Fuzzy c means manual work
PDF
Principal component analysis and lda
PPTX
support vector regression
PPTX
Decision Tree - ID3
PPTX
Data reduction
PPTX
Forms of learning in ai
PPTX
Knn Algorithm presentation
PDF
K means Clustering
PPTX
Decision Trees
PPTX
Multilayer perceptron
PPTX
K-means clustering algorithm
ODP
Machine Learning With Logistic Regression
PPT
Chap8 basic cluster_analysis
PPT
Clustering
PPTX
K-Means Clustering Algorithm.pptx
PPTX
Classification
PPTX
Classification Algorithm.
2.5 backpropagation
Kmeans
Fuzzy c means manual work
Principal component analysis and lda
support vector regression
Decision Tree - ID3
Data reduction
Forms of learning in ai
Knn Algorithm presentation
K means Clustering
Decision Trees
Multilayer perceptron
K-means clustering algorithm
Machine Learning With Logistic Regression
Chap8 basic cluster_analysis
Clustering
K-Means Clustering Algorithm.pptx
Classification
Classification Algorithm.
Ad

Viewers also liked (20)

PPT
K mean-clustering algorithm
PDF
slides Céline Beji
PPTX
K-Means manual work
PPT
Enhance The K Means Algorithm On Spatial Dataset
PDF
K-Means, its Variants and its Applications
PPSX
Decision tree Using c4.5 Algorithm
PDF
Kmeans initialization
PPTX
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
PDF
K means
PPTX
Project PPT
DOCX
Neural nw k means
PPTX
K means clustering algorithm
PDF
Databeers: Big Crisis Data
PPTX
Large Scale Data Clustering: an overview
PDF
Big Crisis Data for ISPC
PDF
Detecting Algorithmic Bias (keynote at DIR 2016)
PDF
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
PPT
15857 cse422 unsupervised-learning
PPT
Data miningpresentation
PDF
Fairness-Aware Data Mining
K mean-clustering algorithm
slides Céline Beji
K-Means manual work
Enhance The K Means Algorithm On Spatial Dataset
K-Means, its Variants and its Applications
Decision tree Using c4.5 Algorithm
Kmeans initialization
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K means
Project PPT
Neural nw k means
K means clustering algorithm
Databeers: Big Crisis Data
Large Scale Data Clustering: an overview
Big Crisis Data for ISPC
Detecting Algorithmic Bias (keynote at DIR 2016)
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
15857 cse422 unsupervised-learning
Data miningpresentation
Fairness-Aware Data Mining
Ad

Similar to K-Means Algorithm (20)

PDF
clustering unsupervised learning and machine learning.pdf
PPTX
K-Means Clustering Presentation Slides for Machine Learning Course
PDF
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
PPT
On-Homomorphic-Encryption-and-Secure-Computation.ppt
PPTX
Learning multifractal structure in large networks (Purdue ML Seminar)
PDF
Deep Learning for Personalized Search and Recommender Systems
PPTX
ECCV WS 2012 (Frank)
PPT
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
PDF
Introduction to Big Data Science
PPTX
Fast Single-pass K-means Clusterting at Oxford
PPTX
IOEfficientParalleMatrixMultiplication_present
PPTX
Deep Learning Bangalore meet up
PPTX
DLBLR talk
PPTX
Deep learning from mashine learning AI..
PDF
SP18 Generative Design - Week 8 - Optimization
PDF
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
PDF
More investment in Research and Development for better Education in the future?
PPT
Fuzzy c means clustering protocol for wireless sensor networks
PPT
Poggi analytics - distance - 1a
PPTX
Knn 160904075605-converted
clustering unsupervised learning and machine learning.pdf
K-Means Clustering Presentation Slides for Machine Learning Course
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
On-Homomorphic-Encryption-and-Secure-Computation.ppt
Learning multifractal structure in large networks (Purdue ML Seminar)
Deep Learning for Personalized Search and Recommender Systems
ECCV WS 2012 (Frank)
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
Introduction to Big Data Science
Fast Single-pass K-means Clusterting at Oxford
IOEfficientParalleMatrixMultiplication_present
Deep Learning Bangalore meet up
DLBLR talk
Deep learning from mashine learning AI..
SP18 Generative Design - Week 8 - Optimization
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
More investment in Research and Development for better Education in the future?
Fuzzy c means clustering protocol for wireless sensor networks
Poggi analytics - distance - 1a
Knn 160904075605-converted

More from Carlos Castillo (ChaTo) (20)

PDF
Finding High Quality Content in Social Media
PDF
When no clicks are good news
PDF
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
PDF
Discrimination Discovery
PDF
Observational studies in social media
PDF
Natural experiments
PDF
Content-based link prediction
PDF
Link prediction
PDF
Recommender Systems
PDF
Graph Partitioning and Spectral Methods
PDF
Finding Dense Subgraphs
PDF
Graph Evolution Models
PDF
Link-Based Ranking
PDF
Text Indexing / Inverted Indices
PDF
Text Summarization
PDF
Hierarchical Clustering
PDF
Text similarity and the vector space model
PDF
Keynote talk: Big Crisis Data, an Open Invitation
Finding High Quality Content in Social Media
When no clicks are good news
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Discrimination Discovery
Observational studies in social media
Natural experiments
Content-based link prediction
Link prediction
Recommender Systems
Graph Partitioning and Spectral Methods
Finding Dense Subgraphs
Graph Evolution Models
Link-Based Ranking
Text Indexing / Inverted Indices
Text Summarization
Hierarchical Clustering
Text similarity and the vector space model
Keynote talk: Big Crisis Data, an Open Invitation

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Approach and Philosophy of On baking technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Cloud computing and distributed systems.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Big Data Technologies - Introduction.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Empathic Computing: Creating Shared Understanding
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
Teaching material agriculture food technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Approach and Philosophy of On baking technology
Chapter 3 Spatial Domain Image Processing.pdf
Cloud computing and distributed systems.
Digital-Transformation-Roadmap-for-Companies.pptx
Network Security Unit 5.pdf for BCA BBA.
Understanding_Digital_Forensics_Presentation.pptx
Big Data Technologies - Introduction.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Empathic Computing: Creating Shared Understanding
Dropbox Q2 2025 Financial Results & Investor Presentation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Machine learning based COVID-19 study performance prediction
Review of recent advances in non-invasive hemoglobin estimation
Teaching material agriculture food technology
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf

K-Means Algorithm

  • 1. 1 K-Means Class Algorithmic Methods of Data Mining Program M. Sc. Data Science University Sapienza University of Rome Semester Fall 2015 Lecturer Carlos Castillo http://chato.cl/ Sources: ● Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, May 2014. Example 13.1. [download] ● Evimaria Terzi: Data Mining course at Boston University http://www.cs.bu.edu/~evimaria/cs565-13.html
  • 2. 2 Boston University Slideshow Title Goes Here The k-means problem • consider set X={x1,...,xn} of n points in Rd • assume that the number k is given • problem: • find k points c1,...,ck (named centers or means) so that the cost is minimized
  • 3. 3 Boston University Slideshow Title Goes Here The k-means problem • k=1 and k=n are easy special cases (why?) • an NP-hard problem if the dimension of the data is at least 2 (d≥2) • in practice, a simple iterative algorithm works quite well
  • 4. 4 Boston University Slideshow Title Goes Here The k-means algorithm • voted among the top-10 algorithms in data mining • one way of solving the k- means problem
  • 6. 6 Boston University Slideshow Title Goes Here The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j, set the cluster Xj to be the set of points in X that are the closest to center cj 3.for each j let cj be the center of cluster Xj (mean of the vectors in Xj) 1.repeat (go to step 2) until convergence
  • 7. 7 Boston University Slideshow Title Goes Here Sample execution
  • 8. 8 1-dimensional clustering exercise Exercise: ● For the data in the figure ● Run k-means with k=2 and initial centroids u1=2, u2=4 (Verify: last centroids are 18 units apart) ● Try with k=3 and initialization 2,3,30 http://www.dataminingbook.info/pmwiki.php/Main/BookDownload Exercise 13.1
  • 9. 9 Limitations of k-means ● Clusters of different size ● Clusters of different density ● Clusters of non-globular shape ● Sensitive to initialization
  • 10. 10 Boston University Slideshow Title Goes Here Limitations of k-means: different sizes
  • 11. 11 Boston University Slideshow Title Goes Here Limitations of k-means: different density
  • 12. 12 Boston University Slideshow Title Goes Here Limitations of k-means: non-spherical shapes
  • 13. 13 Boston University Slideshow Title Goes Here Effects of bad initialization
  • 14. 14 Boston University Slideshow Title Goes Here k-means algorithm • finds a local optimum • often converges quickly but not always • the choice of initial points can have large influence in the result • tends to find spherical clusters • outliers can cause a problem • different densities may cause a problem
  • 16. 16 Boston University Slideshow Title Goes Here Initialization • random initialization • random, but repeat many times and take the best solution • helps, but solution can still be bad • pick points that are distant to each other • k-means++ • provable guarantees
  • 17. 17 Boston University Slideshow Title Goes Here k-means++ David Arthur and Sergei Vassilvitskii k-means++: The advantages of careful seeding SODA 2007
  • 18. 18 Boston University Slideshow Title Goes Here k-means algorithm: random initialization
  • 19. 19 Boston University Slideshow Title Goes Here k-means algorithm: random initialization
  • 20. 20 Boston University Slideshow Title Goes Here 1 2 3 4 k-means algorithm: initialization with further-first traversal
  • 21. 21 Boston University Slideshow Title Goes Here k-means algorithm: initialization with further-first traversal
  • 22. 22 Boston University Slideshow Title Goes Here 1 2 3 but... sensitive to outliers
  • 23. 23 Boston University Slideshow Title Goes Here but... sensitive to outliers
  • 24. 24 Boston University Slideshow Title Goes Here Here random may work well
  • 25. 25 Boston University Slideshow Title Goes Here k-means++ algorithm • interpolate between the two methods • let D(x) be the distance between x and the nearest center selected so far • choose next center with probability proportional to (D(x))a = Da(x)  a = 0      random initialization  a = ∞ furthest­first traversal  a = 2      k­means++ 
  • 26. 26 Boston University Slideshow Title Goes Here k-means++ algorithm • initialization phase: • choose the first center uniformly at random • choose next center with probability proportional to D2(x) • iteration phase: • iterate as in the k-means algorithm until convergence
  • 27. 27 Boston University Slideshow Title Goes Here k-means++ initialization 1 2 3
  • 28. 28 Boston University Slideshow Title Goes Here k-means++ result
  • 29. 29 Boston University Slideshow Title Goes Here • approximation guarantee comes just from the first iteration (initialization) • subsequent iterations can only improve cost k-means++ provable guarantee
  • 30. 30 Boston University Slideshow Title Goes Here Lesson learned • no reason to use k-means and not k-means++ • k-means++ : • easy to implement • provable guarantee • works well in practice
  • 31. 31 k-means-- ● Algorithm 4.1 in [Chawla & Gionis SDM 2013]