Clustering

Clustering is the task of dividing the population or data
points into a number of groups such that data points in
the same groups are more similar to other data points
in the same group than those in other groups. In simple
words, the aim is to segregate groups with similar
traits and assign them into clusters.
Clustering

Let’s understand this with an example. Suppose, you
are the head of a rental store and wish to understand
preferences of your costumers to scale up your
business. Is it possible for you to look at details of
each costumer and devise a unique business strategy
for each one of them? Definitely not. But, what you
can do is to cluster all of your costumers into say 10
groups based on their purchasing habits and use a
separate strategy for costumers in each of these 10
groups. And this is what we call clustering.
Overview

Hard Clustering: In hard clustering, each data
point either belongs to a cluster completely or
not. For example, in the above example each
customer is put into one group out of the 10
groups.
Soft Clustering: In soft clustering, instead of
putting each data point into a separate cluster, a
probability or likelihood of that data point to be in
those clusters is assigned. For example, from the
above scenario each costumer is assigned a
probability to be in either of 10 clusters of the
retail store.
Types of Clustering

Types of clustering algorithms
Connectivity models
Centroid models
Distribution models
Density Models
Since the task of clustering is subjective, the means
that can be used for achieving this goal are plenty.
Every methodology follows a different set of rules for
defining the ‘similarity’ among data points.

K-means clustering
K-means clustering is one of the simplest and popular
unsupervised machine learning algorithms. ... In
other words, the K-means algorithm identifies k
number of centroids, and then allocates every data
point to the nearest cluster, while keeping the
centroids as small as possible.

Hierarchical clustering
Hierarchical clustering, also known as hierarchical
cluster analysis, is an algorithm that groups similar
objects into groups called clusters. The endpoint is a
set of clusters, where each cluster is distinct from
each other cluster, and the objects within each
cluster are broadly similar to each other.

Hierarchical clustering can’t handle big data well
but K Means clustering can. This is because the
time complexity of K Means is linear i.e. O(n) while
that of hierarchical clustering is quadratic i.e.
O(n2).
In K Means clustering, since we start with random
choice of clusters, the results produced by running
the algorithm multiple times might differ. While
results are reproducible in Hierarchical clustering.
Difference between K Means and Hierarchical
clustering

K Means is found to work well when the shape of
the clusters is hyper spherical (like circle in 2D,
sphere in 3D).
K Means clustering requires prior knowledge of K
i.e. no. of clusters you want to divide your data
into. But, you can stop at whatever number of
clusters you find appropriate in hierarchical
clustering by interpreting the dendrogram

Recommendation engines
Market segmentation
Social network analysis
Search result grouping
Medical imaging
Image segmentation
Anomaly detection
Clustering has a large no. of applications spread
across various domains. Some of the most popular
applications of clustering are:
Applications of Clustering

Classification and regression
trees (CART)
Neural Networks
Stay Tuned with
Topics for next Post

Clustering

More Related Content

What's hot (20)

Similar to Clustering (20)

More from Learnbay Datascience (20)

Recently uploaded (20)

Clustering