SlideShare a Scribd company logo
Clustering
Machine Learning
Machine Learning Paradigm :
 Observe set of examples: training data
 Infer something about process that generated that
data
 Use inference to make predictions about previously
unseen data: test data
Supervised: given a set of feature/label pairs, find a
rule that predicts the label associated with a
previously unseen input
 Unsupervised: given a set of feature vectors
(without labels) group them into “natural clusters”
Clustering: An Optimization Problem
 Why not divide variability by size of cluster?
◦ Big and bad worse than small and bad
 Is optimization problem finding a C that
minimizes dissimilarity(C)?
◦ No, otherwise could put each example in its
own cluster
 Need a constraint, e.g.,
Minimum distance between clusters
Number of clusters
Hierarchical Clustering:
 Start by assigning each item to a cluster, so that if
you have N items, you now have N clusters, each
containing just one item.
Find the closest (most similar) pair of clusters and
merge them into a single cluster, so that now you
have one fewer cluster.
 Continue the process until all items are clustered
into a single cluster of size N.
What does distance mean?
Linkage Metrics
 Single-linkage: consider the distance between one
cluster and another cluster to be equal to the
shortest distance from any member of one cluster
to any member of the other cluster
 Complete-linkage: consider the distance between
one cluster and another cluster to be equal to the
greatest distance from any member of one cluster
to any member of the other cluster
Average-linkage: consider the distance between
one cluster and another cluster to be equal to the
average distance from any member of one cluster
to any member of the other cluster
Example of Hierarchical Clustering:
Clustering Algorithms:
 Hierarchical clustering
 Can select number of clusters using dendogram
 Deterministic
 Flexible with respect to linkage criteria
 Slow
 Naïve algorithm n3
 n2 algorithms exist for some linkage criteria
 K-means a much faster greedy algorithm
 Most useful when you know how many clusters
you want
K-means Algorithm:
randomly chose k examples as initial centroids
while true:
create k clusters by assigning each
example to closest centroid
compute k new centroids by averaging
examples in each cluster
if centroids don’t change:
Break
What is complexity of one iteration?
k*n*d, where n is number of points and d time
required to compute the distance between a pair of
points.
An Example:
K = 4, Initial Centroids
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Issues with k-means:
 Choosing the “wrong” k can lead to strange results
 Consider k = 3
 Result can depend upon initial centroids
 Number of iterations
 Even final result
 Greedy algorithm can find different local optimas
How to Choose K:
A priori knowledge about application domain
There are two kinds of people in the world: k = 2
 There are five different types of bacteria: k = 5
 Search for a good k
Try different values of k and evaluate quality of
results
 Run hierarchical clustering on subset of data

More Related Content

PPTX
Lecture 09(introduction to machine learning)
PPT
Chapter 09 class advanced
PPTX
Machine learning clustering
PPTX
Cluster analysis
PPTX
Unsupervised learning clustering
PDF
Text Classification/Categorization
PPTX
Types of clustering and different types of clustering algorithms
PPT
Lect4
Lecture 09(introduction to machine learning)
Chapter 09 class advanced
Machine learning clustering
Cluster analysis
Unsupervised learning clustering
Text Classification/Categorization
Types of clustering and different types of clustering algorithms
Lect4

What's hot (20)

PPT
Textmining Retrieval And Clustering
PPT
Clustering
PPTX
05 Clustering in Data Mining
PPTX
Introduction to Clustering algorithm
PPTX
Cluster Analysis
PPT
Clustering
PPT
3.2 partitioning methods
PPTX
Support Vector Machine without tears
PDF
K means Clustering
PDF
Spss tutorial-cluster-analysis
PPTX
K Nearest Neighbor Algorithm
PPT
3.3 hierarchical methods
PPT
Textmining Predictive Models
PDF
Hierarchical Clustering
PPTX
SVM - Functional Verification
PPTX
K nearest neighbor
PPT
Dataa miining
PPT
Chap8 basic cluster_analysis
PDF
Data clustering
PPTX
Cluster analysis
Textmining Retrieval And Clustering
Clustering
05 Clustering in Data Mining
Introduction to Clustering algorithm
Cluster Analysis
Clustering
3.2 partitioning methods
Support Vector Machine without tears
K means Clustering
Spss tutorial-cluster-analysis
K Nearest Neighbor Algorithm
3.3 hierarchical methods
Textmining Predictive Models
Hierarchical Clustering
SVM - Functional Verification
K nearest neighbor
Dataa miining
Chap8 basic cluster_analysis
Data clustering
Cluster analysis
Ad

Similar to Lecture 11 (20)

PPT
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
PDF
10 clusbasic
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
PPT
CLUSTERING
PPT
Capter10 cluster basic
PPT
Capter10 cluster basic : Han & Kamber
PPT
data mining cocepts and techniques chapter
PPT
data mining cocepts and techniques chapter
PPT
PPT
Clustering
PPT
Basic Clustering Algorithms in Data Warehouisng and Data Miningppt
PPT
10 clusbasic
PPT
Data Mining and Warehousing Concept and Techniques
PPTX
Unsupervised learning Algorithms and Assumptions
PPT
multiarmed bandit.ppt
PPT
Data mining concepts and techniques Chapter 10
PPT
Cluster spss week7
PPT
26-Clustering MTech-2017.ppt
PDF
Algorithm for mining cluster and association patterns
PPTX
K_means ppt in machine learning concepts
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
10 clusbasic
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
CLUSTERING
Capter10 cluster basic
Capter10 cluster basic : Han & Kamber
data mining cocepts and techniques chapter
data mining cocepts and techniques chapter
Clustering
Basic Clustering Algorithms in Data Warehouisng and Data Miningppt
10 clusbasic
Data Mining and Warehousing Concept and Techniques
Unsupervised learning Algorithms and Assumptions
multiarmed bandit.ppt
Data mining concepts and techniques Chapter 10
Cluster spss week7
26-Clustering MTech-2017.ppt
Algorithm for mining cluster and association patterns
K_means ppt in machine learning concepts
Ad

More from Jeet Das (13)

PPTX
Lecture 13
PPTX
Lecture 12
PPTX
Lecture 10
PPT
Information Retrieval 08
PPT
Information Retrieval 02
PPTX
Information Retrieval 07
PPTX
Information Retrieval-06
PPTX
Information Retrieval-05(wild card query_positional index_spell correction)
PPTX
Information Retrieval-4(inverted index_&_query handling)
PPTX
Information Retrieval-1
PPTX
PPTX
Token classification using Bengali Tokenizer
PPTX
Silent sound technology
Lecture 13
Lecture 12
Lecture 10
Information Retrieval 08
Information Retrieval 02
Information Retrieval 07
Information Retrieval-06
Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-4(inverted index_&_query handling)
Information Retrieval-1
Token classification using Bengali Tokenizer
Silent sound technology

Recently uploaded (20)

PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
Sustainable Sites - Green Building Construction
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Geodesy 1.pptx...............................................
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
additive manufacturing of ss316l using mig welding
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
DOCX
573137875-Attendance-Management-System-original
PPTX
UNIT 4 Total Quality Management .pptx
PPT
Project quality management in manufacturing
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
Structs to JSON How Go Powers REST APIs.pdf
Sustainable Sites - Green Building Construction
Embodied AI: Ushering in the Next Era of Intelligent Systems
Strings in CPP - Strings in C++ are sequences of characters used to store and...
CH1 Production IntroductoryConcepts.pptx
Internet of Things (IOT) - A guide to understanding
Geodesy 1.pptx...............................................
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
additive manufacturing of ss316l using mig welding
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Foundation to blockchain - A guide to Blockchain Tech
573137875-Attendance-Management-System-original
UNIT 4 Total Quality Management .pptx
Project quality management in manufacturing
Lesson 3_Tessellation.pptx finite Mathematics

Lecture 11

  • 2. Machine Learning Paradigm :  Observe set of examples: training data  Infer something about process that generated that data  Use inference to make predictions about previously unseen data: test data Supervised: given a set of feature/label pairs, find a rule that predicts the label associated with a previously unseen input  Unsupervised: given a set of feature vectors (without labels) group them into “natural clusters”
  • 3. Clustering: An Optimization Problem  Why not divide variability by size of cluster? ◦ Big and bad worse than small and bad  Is optimization problem finding a C that minimizes dissimilarity(C)? ◦ No, otherwise could put each example in its own cluster  Need a constraint, e.g., Minimum distance between clusters Number of clusters
  • 4. Hierarchical Clustering:  Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one fewer cluster.  Continue the process until all items are clustered into a single cluster of size N. What does distance mean?
  • 5. Linkage Metrics  Single-linkage: consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster  Complete-linkage: consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster Average-linkage: consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster
  • 7. Clustering Algorithms:  Hierarchical clustering  Can select number of clusters using dendogram  Deterministic  Flexible with respect to linkage criteria  Slow  Naïve algorithm n3  n2 algorithms exist for some linkage criteria  K-means a much faster greedy algorithm  Most useful when you know how many clusters you want
  • 8. K-means Algorithm: randomly chose k examples as initial centroids while true: create k clusters by assigning each example to closest centroid compute k new centroids by averaging examples in each cluster if centroids don’t change: Break What is complexity of one iteration? k*n*d, where n is number of points and d time required to compute the distance between a pair of points.
  • 10. K = 4, Initial Centroids
  • 16. Issues with k-means:  Choosing the “wrong” k can lead to strange results  Consider k = 3  Result can depend upon initial centroids  Number of iterations  Even final result  Greedy algorithm can find different local optimas
  • 17. How to Choose K: A priori knowledge about application domain There are two kinds of people in the world: k = 2  There are five different types of bacteria: k = 5  Search for a good k Try different values of k and evaluate quality of results  Run hierarchical clustering on subset of data