Clustering

Presented By,
Manasi C. Kadam
Sharmishtha P. Alwekar
Ganesh H. Satpute
Deepak D. Ambegaonkar
Rajesh V. Dulhani

Under the guidance
Prof. G. A. Patil
Mr. Varad Meru

Agenda

 Introduction
 Clustering
 K-means clustering algorithm
 Canopy clustering algorithm
 Complexity Evaluation
 Conclusion
 Future Enhancement
 References

Introduction

 Tedious task to maintain large Data
 Types
1. Structured
2. Unstructured

Introduction to Data
analysis

 Extracting information out of data
 Two types
1. Exploratory or descriptive
2. Confirmative or inferential

Clustering
(Aka Unsupervised Learning)

 Goal is to discover the natural grouping(s) between
objects
 Given n objects find K groups on measure of
“similarity”
 Organizing data into clusters such that there is
• high intra-cluster similarity
• low inter-cluster similarity
 Ideal cluster - set of points that is compact and
isolated
 Ex. K-means algorithm, k-medoids etc.

Problems in clustering

 Cluster can differ in size, shape & density
 Presence of noise
 Cluster is a subjective entity
 Automation

Clustering Algorithm

 Types of Clustering Algorithm
1. Hierarchical
2. Partitional
 Hierarchical – recursively finds nested clusters
 Types
1. Agglomerative
2. Divisive
 Partitional - finds all the clusters simultaneously
ex. K-means

K-means algorithm



K-means Algorithm
(contd.)

 Goal of K-means is to minimize the sum of the
squared error over all K clusters

Class Diagram of K-means


Parameter for K-means

 Most critical choice is K
 Typically algorithm is run for various values of K and
most appropriate output is selected

 Different initialization can lead to different output

Canopy Clustering

 Traditional clustering algorithm works well when
dataset has either property.
 Large number of clusters
 A high feature dimensionality
 Large number of data points.
 When dataset has all three property at once
computation becomes expensive.
 This necessitates need of new technique, thus
canopy clustering

Canopy Clustering
(contd.)

 Performs clustering in two stages
1. Rough and quick stage
2. Rigorous stage

Canopy Clustering
(contd.)

 Rough and quick stage
 Uses extremely inexpensive method
 divides the data into overlapping subsets called
“canopies”
 Rigorous stage
 Uses rigorous and expensive metric
 Clustering is applied only on canopy

Flowchart of Canopy
Clustering


Output of K-means on
Mathematica on Same Dataset


Output of K-means on R on
Same Dataset


Output of K-means on
Microsoft Excel on Same
Dataset


Output of canopy on Excel on
Same Dataset


Complexity

 Complexity of K-means is O(nk), where n is number
of objects and k is number of centroids
 Canopy based K-means changes to O(nkf2/c)
 c is no of canopies
 f is average no of canopies that each data point falls
into
 As f is very small number and c is comparatively
big, the complexity is reduced

Conclusion

 Implemented K-means Algorithm
 Verified Result on Mathematica, R
 Implemented Canopy Clustering
 Verified Result on Excel

Future Enhancement

 Learning Hadoop and MapReduce
 Parallelizing K-Means based on MapReduce and
comparing the implementation
 Running All the of K-means on standard dataset

References

 Anil K. Jain, “Data Clustering: 50 Years Beyond K-
Means”
 Andrew McCallum et al., “Efficient Clustering of
High Dimensional Data Sets with Application to
Reference Matching”

Clustering

More Related Content

What's hot (20)

Similar to Clustering (20)

Clustering