Unsupervised learning: Clustering

Deepak George
Staff Data Scientist
Unsupervised Learning: Clustering
K-Means, Hierarchical Clustering & DBSCAN

➢ Data Science Career
▪ General Electric
▪ Accenture Management Consulting
▪ Mu Sigma
➢ Highlights
▪ 1st Prize Best Data Science Project (BAI 5) – IIM Bangalore
▪ Co-author of Markdown Optimization case published at Harvard Business School
▪ Kaggle Bronze medal – Toxic Comment Classification
▪ Kaggle Bronze medal - Coupon Purchase Prediction (Recommender System)
▪ SAS Certified Statistical Business Analyst: Regression and Modeling Credentials
➢ Education
▪ Indian Institute Of Management Bangalore - Business Analytics & Intelligence
▪ College Of Engineering Trivandrum - Computer Science Engineering
➢ Passion
▪ Deep Learning, Photography, Football
▪ Profile
▪ linkedin.com/in/deepakgeorge7/
▪ https://github.com/deepakiim
Deepak George, IIM Bangalore
2
About Me

1. Introduction to clustering and unsupervised learning
2. K means
3. Divisive and agglomerative clustering (Hierarchical)
4. Density-based clustering (DBSCAN)
5. Recommendations
Agenda

What is Unsupervised Learning?
• Training data is labelled
• Used for predict the label
• Classification and Regression
• Training data is unlabelled
• Used for finding patterns in the data
• Clustering, Dimensionality reduction, Association Rules
.

What is Clustering?

What is Norm?
Let p ≥ 1 be a real number. The p-norm (also called of Lp norm) of vector x =(x1, x2 ….,xn)
• Norm measures the magnitude (or size, length) of vector
• On an intuitive level, the norm of a vector x measures the distance from the origin to the point x.
Geometric Interpretation of L2 Norm
Consider a unit ball containing the origin.
The Euclidean norm of a vector is simply the factor by which the ball must
be expanded or shrunk in order to fit the given vector exactly

Dissimilarity/Proximity Matrix
Euclidean distance
Dissimilarity Matrix
Weighted Dissimilarity Matrix
Data Matrix (n*p)
Dissimilarity Matrix (n*n)
Distance is inversely proportional to Similarity

Types of Clustering Algorithms
1.Combinatorial algorithms
2.Mixture modelling
3.Mode Seekers

Minimizing W(C) is equivalent to maximizing B(C) given that T is
constant for any given data.
C(i) is the encoder that we seek which assigns the ith observation to the kth cluster
Within Cluster Point Scatter
Total point scatter
Between Cluster Point Scatter
Combinatorial algorithm directly specify a mathematical loss function and attempt to minimize it through some combinatorial
optimization algorithm. Since the goal is to assign close points to the same cluster, a natural loss function would be
Combinatorial Algorithm

K-Means Visual Explanation
Random seeds Assign Update
It is intended for situations in which all variables are of the quantitative type, and squared Euclidean distance

K-Means Mathematical Explanation
(for special case of K means)
* The Elements of Statistical Learning Deepak George, IIM Bangalore

K-Means algorithm animation
X1
X2
Kmeans_animation.gif

X1
X2
K-Means starting seed position issue animation
Kmeans_starting_issue_animation.gif

K-Means Clustering
Advantages
• Scales well on large dataset
• Does NOT require ANY assumptions about data distribution
Disadvantages
• Assumes clusters are spherical
• Assumes clusters are approximately equal in size
• Can only use Euclidean dissimilarity
• Choosing the wrong K
• Doesn’t guarantee global optima
• Could depend on choice of initial seeds
• Works only with continuous data

Hierarchical Clustering
Agglomerative Clustering:
• Bottom Up
• Each object is initially considered as a single-element
cluster
• At each step, the two clusters that are the most similar
are combined into a new bigger cluster
• Repeated until all points are member of just one single
big cluster
Divisive Clustering:
• Top Down
• Initially all objects are assigned to a single cluster
• At each step, the most heterogeneous cluster is divided
into two.
• Repeated until all objects are in their own cluster

Measuring Dissimilarity between two clusters

Hierarchical Clustering Visual Explanation

Hierarchical Clustering Algorithm
* The Elements of Statistical Learning Deepak George, IIM Bangalore

Hierarchical Clustering algorithm animation
X1
X2
Hierarchical_animation.gif

Hierarchical Clustering
Advantages
• No need to choose K before running the algorithm
• Dendrogram will give visual guidance in choosing K
• Can use any dissimilarity measure
• Works on any kind of data including categorical and mixed
• Does NOT require ANY assumptions about data distribution
Disadvantages
• Doesn’t scales well on large dataset
• Doesn’t guarantee global optima

Density-based spatial clustering of applications with noise
DBSCAN Parameters:
1. Minpts - Minimum number of points required to form a cluster
2. Epsilon – Radius of the circle drown around a point within which all points falling inside the circle
belong to the same cluster.
DBSCAN Fundamentals
• Clusters are considered zones that are sufficiently dense.
• Points that lack neighbours i.e. not dense do not belong to any cluster are
classified as noise
• DBSCAN can return clusters of any shape

DBSCAN Algorithm
Density Reachable
Not Density Reachable

DBSCAN algorithm animation
DBSCAN_animation.gif

Advantages
• It can discover any number of clusters
• Clusters of varying shapes and sizes can be obtained using the DBSCAN algorithm
• It can detect and ignore outliers
Disadvantages
• Assumes that’s clusters are of uniform density
• The epsilon value could be sensitive
• Too small a value can result in elimination of spare clusters as outliers
• Too large a value would merge dense clusters together giving incorrect clusters
DBSCAN Pros & Cons

General recommendations
Profiling
• Identify the unique properties of each cluster and give appropriate labels
• Identify which feature is dominating in which cluster
• Ensure that clusters are well separated and can be explained from business point of view
Appropriate Dissimilarity measure
• For mix data try Gower distance
Feature scaling
• Always scale/normalize the features before training the clustering algorithm
Stability check
• Before clustering split data into training and test.
• Run the same final clustering model on both
• If clustering is stable, you will get the same metrics in both the datasets

Unsupervised learning: Clustering

More Related Content

What's hot (20)

Similar to Unsupervised learning: Clustering (20)

Recently uploaded (20)

Unsupervised learning: Clustering