SlideShare a Scribd company logo
Deepak George
Staff Data Scientist
Unsupervised Learning: Clustering
K-Means, Hierarchical Clustering & DBSCAN
➢ Data Science Career
▪ General Electric
▪ Accenture Management Consulting
▪ Mu Sigma
➢ Highlights
▪ 1st Prize Best Data Science Project (BAI 5) – IIM Bangalore
▪ Co-author of Markdown Optimization case published at Harvard Business School
▪ Kaggle Bronze medal – Toxic Comment Classification
▪ Kaggle Bronze medal - Coupon Purchase Prediction (Recommender System)
▪ SAS Certified Statistical Business Analyst: Regression and Modeling Credentials
➢ Education
▪ Indian Institute Of Management Bangalore - Business Analytics & Intelligence
▪ College Of Engineering Trivandrum - Computer Science Engineering
➢ Passion
▪ Deep Learning, Photography, Football
▪ Profile
▪ linkedin.com/in/deepakgeorge7/
▪ https://github.com/deepakiim
Deepak George, IIM Bangalore
2
About Me
1. Introduction to clustering and unsupervised learning
2. K means
3. Divisive and agglomerative clustering (Hierarchical)
4. Density-based clustering (DBSCAN)
5. Recommendations
Agenda
Deepak George, IIM Bangalore
What is Unsupervised Learning?
• Training data is labelled
• Used for predict the label
• Classification and Regression
• Training data is unlabelled
• Used for finding patterns in the data
• Clustering, Dimensionality reduction, Association Rules
.
Deepak George, IIM Bangalore
What is Clustering?
Deepak George, IIM Bangalore
What is Norm?
Let p ≥ 1 be a real number. The p-norm (also called of Lp norm) of vector x =(x1, x2 ….,xn)
• Norm measures the magnitude (or size, length) of vector
• On an intuitive level, the norm of a vector x measures the distance from the origin to the point x.
Geometric Interpretation of L2 Norm
Consider a unit ball containing the origin.
The Euclidean norm of a vector is simply the factor by which the ball must
be expanded or shrunk in order to fit the given vector exactly
Deepak George, IIM Bangalore
Dissimilarity/Proximity Matrix
Euclidean distance
Dissimilarity Matrix
Weighted Dissimilarity Matrix
Data Matrix (n*p)
Dissimilarity Matrix (n*n)
Distance is inversely proportional to Similarity
Deepak George, IIM Bangalore
Types of Clustering Algorithms
1.Combinatorial algorithms
2.Mixture modelling
3.Mode Seekers
Deepak George, IIM Bangalore
Minimizing W(C) is equivalent to maximizing B(C) given that T is
constant for any given data.
C(i) is the encoder that we seek which assigns the ith observation to the kth cluster
Within Cluster Point Scatter
Total point scatter
Between Cluster Point Scatter
Combinatorial algorithm directly specify a mathematical loss function and attempt to minimize it through some combinatorial
optimization algorithm. Since the goal is to assign close points to the same cluster, a natural loss function would be
Combinatorial Algorithm
Deepak George, IIM Bangalore
K-Means Visual Explanation
Random seeds Assign Update
It is intended for situations in which all variables are of the quantitative type, and squared Euclidean distance
Deepak George, IIM Bangalore
K-Means Mathematical Explanation
(for special case of K means)
* The Elements of Statistical Learning Deepak George, IIM Bangalore
K-Means algorithm animation
X1
X2
Deepak George, IIM Bangalore
Kmeans_animation.gif
X1
X2
K-Means starting seed position issue animation
Deepak George, IIM Bangalore
Kmeans_starting_issue_animation.gif
K-Means Clustering
Advantages
• Scales well on large dataset
• Does NOT require ANY assumptions about data distribution
Disadvantages
• Assumes clusters are spherical
• Assumes clusters are approximately equal in size
• Can only use Euclidean dissimilarity
• Choosing the wrong K
• Doesn’t guarantee global optima
• Could depend on choice of initial seeds
• Works only with continuous data
Deepak George, IIM Bangalore
Hierarchical Clustering
Agglomerative Clustering:
• Bottom Up
• Each object is initially considered as a single-element
cluster
• At each step, the two clusters that are the most similar
are combined into a new bigger cluster
• Repeated until all points are member of just one single
big cluster
Divisive Clustering:
• Top Down
• Initially all objects are assigned to a single cluster
• At each step, the most heterogeneous cluster is divided
into two.
• Repeated until all objects are in their own cluster
Deepak George, IIM Bangalore
Measuring Dissimilarity between two clusters
Deepak George, IIM Bangalore
Hierarchical Clustering Visual Explanation
Deepak George, IIM Bangalore
Hierarchical Clustering Algorithm
* The Elements of Statistical Learning Deepak George, IIM Bangalore
Hierarchical Clustering algorithm animation
X1
X2
Deepak George, IIM Bangalore
Hierarchical_animation.gif
Hierarchical Clustering
Advantages
• No need to choose K before running the algorithm
• Dendrogram will give visual guidance in choosing K
• Can use any dissimilarity measure
• Works on any kind of data including categorical and mixed
• Does NOT require ANY assumptions about data distribution
Disadvantages
• Doesn’t scales well on large dataset
• Doesn’t guarantee global optima
Deepak George, IIM Bangalore
Density-based spatial clustering of applications with noise
DBSCAN Parameters:
1. Minpts - Minimum number of points required to form a cluster
2. Epsilon – Radius of the circle drown around a point within which all points falling inside the circle
belong to the same cluster.
DBSCAN Fundamentals
• Clusters are considered zones that are sufficiently dense.
• Points that lack neighbours i.e. not dense do not belong to any cluster are
classified as noise
• DBSCAN can return clusters of any shape
Deepak George, IIM Bangalore
DBSCAN Algorithm
Density Reachable
Not Density Reachable
Deepak George, IIM Bangalore
DBSCAN algorithm animation
Deepak George, IIM Bangalore
DBSCAN_animation.gif
Advantages
• It can discover any number of clusters
• Clusters of varying shapes and sizes can be obtained using the DBSCAN algorithm
• It can detect and ignore outliers
Disadvantages
• Assumes that’s clusters are of uniform density
• The epsilon value could be sensitive
• Too small a value can result in elimination of spare clusters as outliers
• Too large a value would merge dense clusters together giving incorrect clusters
DBSCAN Pros & Cons
Deepak George, IIM Bangalore
General recommendations
Profiling
• Identify the unique properties of each cluster and give appropriate labels
• Identify which feature is dominating in which cluster
• Ensure that clusters are well separated and can be explained from business point of view
Appropriate Dissimilarity measure
• For mix data try Gower distance
Feature scaling
• Always scale/normalize the features before training the clustering algorithm
Stability check
• Before clustering split data into training and test.
• Run the same final clustering model on both
• If clustering is stable, you will get the same metrics in both the datasets
Deepak George, IIM Bangalore

More Related Content

PPTX
Types of Machine Learning
PPTX
Unsupervised learning (clustering)
PPTX
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
PPTX
Random forest
PPT
pattern classification
PPTX
Unsupervised learning
PDF
Lecture 1: What is Machine Learning?
PDF
Introduction to XGBoost
Types of Machine Learning
Unsupervised learning (clustering)
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Random forest
pattern classification
Unsupervised learning
Lecture 1: What is Machine Learning?
Introduction to XGBoost

What's hot (20)

PDF
From decision trees to random forests
PPTX
OLAP & DATA WAREHOUSE
PDF
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
PPT
Decision tree and random forest
PPT
3.3 hierarchical methods
PDF
Understanding Bagging and Boosting
PPTX
K MEANS CLUSTERING
PPTX
Clustering, k-means clustering
PPTX
Introduction to Clustering algorithm
PPTX
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
PPTX
Lect6 Association rule & Apriori algorithm
PPTX
Clustering in Data Mining
PPTX
Presentation on K-Means Clustering
PDF
Notes from Coursera Deep Learning courses by Andrew Ng
PPTX
K-Means Clustering Algorithm.pptx
PDF
Lecture 9 Markov decision process
PPTX
K means clustering
ODP
Machine Learning with Decision trees
PPTX
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
From decision trees to random forests
OLAP & DATA WAREHOUSE
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Decision tree and random forest
3.3 hierarchical methods
Understanding Bagging and Boosting
K MEANS CLUSTERING
Clustering, k-means clustering
Introduction to Clustering algorithm
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Lect6 Association rule & Apriori algorithm
Clustering in Data Mining
Presentation on K-Means Clustering
Notes from Coursera Deep Learning courses by Andrew Ng
K-Means Clustering Algorithm.pptx
Lecture 9 Markov decision process
K means clustering
Machine Learning with Decision trees
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Ad

Similar to Unsupervised learning: Clustering (20)

PDF
PPTX
Cluster Analysis.pptx
PDF
Clustering.pdf
PPTX
clustering and distance metrics.pptx
PDF
Module - 5 Machine Learning-22ISE62.pdf
PPT
Chap8 basic cluster_analysis
DOCX
8.clustering algorithm.k means.em algorithm
PDF
Machine Learning - Clustering
PPTX
Algorithms used in AIML and the need for aiml basic use cases
PPTX
Clustering on DSS
PPT
about data mining and Exp about data mining and Exp.
PPTX
Lec13 Clustering.pptx
PPTX
Unsupervised%20Learninffffg (2).pptx. application
PPTX
Classification & Clustering.pptx
PDF
CSA 3702 machine learning module 3
PPT
Clustering_Unsupervised learning Unsupervised learning.ppt
PDF
Ijartes v1-i2-006
PDF
Clustering
PPTX
AI-Lec20 Clustering I - Kmean.pptx
Cluster Analysis.pptx
Clustering.pdf
clustering and distance metrics.pptx
Module - 5 Machine Learning-22ISE62.pdf
Chap8 basic cluster_analysis
8.clustering algorithm.k means.em algorithm
Machine Learning - Clustering
Algorithms used in AIML and the need for aiml basic use cases
Clustering on DSS
about data mining and Exp about data mining and Exp.
Lec13 Clustering.pptx
Unsupervised%20Learninffffg (2).pptx. application
Classification & Clustering.pptx
CSA 3702 machine learning module 3
Clustering_Unsupervised learning Unsupervised learning.ppt
Ijartes v1-i2-006
Clustering
AI-Lec20 Clustering I - Kmean.pptx
Ad

Recently uploaded (20)

PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Mega Projects Data Mega Projects Data
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Lecture1 pattern recognition............
PPT
Quality review (1)_presentation of this 21
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
annual-report-2024-2025 original latest.
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
.pdf is not working space design for the following data for the following dat...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Database Infoormation System (DBIS).pptx
[EN] Industrial Machine Downtime Prediction
Mega Projects Data Mega Projects Data
Galatica Smart Energy Infrastructure Startup Pitch Deck
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction-to-Cloud-ComputingFinal.pptx
Lecture1 pattern recognition............
Quality review (1)_presentation of this 21
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
annual-report-2024-2025 original latest.
oil_refinery_comprehensive_20250804084928 (1).pptx

Unsupervised learning: Clustering

  • 1. Deepak George Staff Data Scientist Unsupervised Learning: Clustering K-Means, Hierarchical Clustering & DBSCAN
  • 2. ➢ Data Science Career ▪ General Electric ▪ Accenture Management Consulting ▪ Mu Sigma ➢ Highlights ▪ 1st Prize Best Data Science Project (BAI 5) – IIM Bangalore ▪ Co-author of Markdown Optimization case published at Harvard Business School ▪ Kaggle Bronze medal – Toxic Comment Classification ▪ Kaggle Bronze medal - Coupon Purchase Prediction (Recommender System) ▪ SAS Certified Statistical Business Analyst: Regression and Modeling Credentials ➢ Education ▪ Indian Institute Of Management Bangalore - Business Analytics & Intelligence ▪ College Of Engineering Trivandrum - Computer Science Engineering ➢ Passion ▪ Deep Learning, Photography, Football ▪ Profile ▪ linkedin.com/in/deepakgeorge7/ ▪ https://github.com/deepakiim Deepak George, IIM Bangalore 2 About Me
  • 3. 1. Introduction to clustering and unsupervised learning 2. K means 3. Divisive and agglomerative clustering (Hierarchical) 4. Density-based clustering (DBSCAN) 5. Recommendations Agenda Deepak George, IIM Bangalore
  • 4. What is Unsupervised Learning? • Training data is labelled • Used for predict the label • Classification and Regression • Training data is unlabelled • Used for finding patterns in the data • Clustering, Dimensionality reduction, Association Rules . Deepak George, IIM Bangalore
  • 5. What is Clustering? Deepak George, IIM Bangalore
  • 6. What is Norm? Let p ≥ 1 be a real number. The p-norm (also called of Lp norm) of vector x =(x1, x2 ….,xn) • Norm measures the magnitude (or size, length) of vector • On an intuitive level, the norm of a vector x measures the distance from the origin to the point x. Geometric Interpretation of L2 Norm Consider a unit ball containing the origin. The Euclidean norm of a vector is simply the factor by which the ball must be expanded or shrunk in order to fit the given vector exactly Deepak George, IIM Bangalore
  • 7. Dissimilarity/Proximity Matrix Euclidean distance Dissimilarity Matrix Weighted Dissimilarity Matrix Data Matrix (n*p) Dissimilarity Matrix (n*n) Distance is inversely proportional to Similarity Deepak George, IIM Bangalore
  • 8. Types of Clustering Algorithms 1.Combinatorial algorithms 2.Mixture modelling 3.Mode Seekers Deepak George, IIM Bangalore
  • 9. Minimizing W(C) is equivalent to maximizing B(C) given that T is constant for any given data. C(i) is the encoder that we seek which assigns the ith observation to the kth cluster Within Cluster Point Scatter Total point scatter Between Cluster Point Scatter Combinatorial algorithm directly specify a mathematical loss function and attempt to minimize it through some combinatorial optimization algorithm. Since the goal is to assign close points to the same cluster, a natural loss function would be Combinatorial Algorithm Deepak George, IIM Bangalore
  • 10. K-Means Visual Explanation Random seeds Assign Update It is intended for situations in which all variables are of the quantitative type, and squared Euclidean distance Deepak George, IIM Bangalore
  • 11. K-Means Mathematical Explanation (for special case of K means) * The Elements of Statistical Learning Deepak George, IIM Bangalore
  • 12. K-Means algorithm animation X1 X2 Deepak George, IIM Bangalore Kmeans_animation.gif
  • 13. X1 X2 K-Means starting seed position issue animation Deepak George, IIM Bangalore Kmeans_starting_issue_animation.gif
  • 14. K-Means Clustering Advantages • Scales well on large dataset • Does NOT require ANY assumptions about data distribution Disadvantages • Assumes clusters are spherical • Assumes clusters are approximately equal in size • Can only use Euclidean dissimilarity • Choosing the wrong K • Doesn’t guarantee global optima • Could depend on choice of initial seeds • Works only with continuous data Deepak George, IIM Bangalore
  • 15. Hierarchical Clustering Agglomerative Clustering: • Bottom Up • Each object is initially considered as a single-element cluster • At each step, the two clusters that are the most similar are combined into a new bigger cluster • Repeated until all points are member of just one single big cluster Divisive Clustering: • Top Down • Initially all objects are assigned to a single cluster • At each step, the most heterogeneous cluster is divided into two. • Repeated until all objects are in their own cluster Deepak George, IIM Bangalore
  • 16. Measuring Dissimilarity between two clusters Deepak George, IIM Bangalore
  • 17. Hierarchical Clustering Visual Explanation Deepak George, IIM Bangalore
  • 18. Hierarchical Clustering Algorithm * The Elements of Statistical Learning Deepak George, IIM Bangalore
  • 19. Hierarchical Clustering algorithm animation X1 X2 Deepak George, IIM Bangalore Hierarchical_animation.gif
  • 20. Hierarchical Clustering Advantages • No need to choose K before running the algorithm • Dendrogram will give visual guidance in choosing K • Can use any dissimilarity measure • Works on any kind of data including categorical and mixed • Does NOT require ANY assumptions about data distribution Disadvantages • Doesn’t scales well on large dataset • Doesn’t guarantee global optima Deepak George, IIM Bangalore
  • 21. Density-based spatial clustering of applications with noise DBSCAN Parameters: 1. Minpts - Minimum number of points required to form a cluster 2. Epsilon – Radius of the circle drown around a point within which all points falling inside the circle belong to the same cluster. DBSCAN Fundamentals • Clusters are considered zones that are sufficiently dense. • Points that lack neighbours i.e. not dense do not belong to any cluster are classified as noise • DBSCAN can return clusters of any shape Deepak George, IIM Bangalore
  • 22. DBSCAN Algorithm Density Reachable Not Density Reachable Deepak George, IIM Bangalore
  • 23. DBSCAN algorithm animation Deepak George, IIM Bangalore DBSCAN_animation.gif
  • 24. Advantages • It can discover any number of clusters • Clusters of varying shapes and sizes can be obtained using the DBSCAN algorithm • It can detect and ignore outliers Disadvantages • Assumes that’s clusters are of uniform density • The epsilon value could be sensitive • Too small a value can result in elimination of spare clusters as outliers • Too large a value would merge dense clusters together giving incorrect clusters DBSCAN Pros & Cons Deepak George, IIM Bangalore
  • 25. General recommendations Profiling • Identify the unique properties of each cluster and give appropriate labels • Identify which feature is dominating in which cluster • Ensure that clusters are well separated and can be explained from business point of view Appropriate Dissimilarity measure • For mix data try Gower distance Feature scaling • Always scale/normalize the features before training the clustering algorithm Stability check • Before clustering split data into training and test. • Run the same final clustering model on both • If clustering is stable, you will get the same metrics in both the datasets Deepak George, IIM Bangalore