SlideShare a Scribd company logo
Presented By,
Manasi C. Kadam
Sharmishtha P. Alwekar
Ganesh H. Satpute
Deepak D. Ambegaonkar
Rajesh V. Dulhani

   Under the guidance
    Prof. G. A. Patil
    Mr. Varad Meru
Agenda
                    
   Introduction
   Clustering
   K-means clustering algorithm
   Canopy clustering algorithm
   Complexity Evaluation
   Conclusion
   Future Enhancement
   References
Introduction
                   
 Tedious task to maintain large Data
 Types
     1.   Structured
     2.   Unstructured
Introduction to Data
          analysis
             
 Extracting information out of data
 Two types
  1. Exploratory or descriptive
  2. Confirmative or inferential
Clustering
     (Aka Unsupervised Learning)
                         
 Goal is to discover the natural grouping(s) between
  objects
 Given n objects find K groups on measure of
  “similarity”
 Organizing data into clusters such that there is
       • high intra-cluster similarity
       • low inter-cluster similarity
 Ideal cluster - set of points that is compact and
  isolated
 Ex. K-means algorithm, k-medoids etc.

Problems in clustering
             
   Cluster can differ in size, shape & density
   Presence of noise
   Cluster is a subjective entity
   Automation
Clustering Algorithm
            
 Types of Clustering Algorithm
   1. Hierarchical
   2. Partitional
 Hierarchical – recursively finds nested clusters
    Types
     1.   Agglomerative
     2.   Divisive
 Partitional - finds all the clusters simultaneously
     ex. K-means
K-means algorithm

           
K-means Algorithm
          (contd.)
             
 Goal of K-means is to minimize the sum of the
 squared error over all K clusters
Flowchart
   
Class Diagram of K-means
          
Parameter for K-means
          
 Most critical choice is K
    Typically algorithm is run for various values of K and
     most appropriate output is selected

 Different initialization can lead to different output
Canopy Clustering
             
 Traditional clustering algorithm works well when
  dataset has either property.
   Large number of clusters
   A high feature dimensionality
   Large number of data points.
 When dataset has all three property at once
 computation becomes expensive.
 This necessitates need of new technique, thus
 canopy clustering
Canopy Clustering
          (contd.)
             
 Performs clustering in two stages
  1. Rough and quick stage
  2. Rigorous stage
Canopy Clustering
         (contd.)
            
 Rough and quick stage
   Uses extremely inexpensive method
   divides the data into overlapping subsets called
    “canopies”
 Rigorous stage
   Uses rigorous and expensive metric
   Clustering is applied only on canopy
Flowchart of Canopy
    Clustering
        





Source: Ref [2]
Output of K-means on
Mathematica on Same Dataset
            
Output of K-means on R on
      Same Dataset
           
Clustering
Output of K-means on
Microsoft Excel on Same
       Dataset
          
Clustering
Output of canopy on Excel on
       Same Dataset
            
Complexity
                  
 Complexity of K-means is O(nk), where n is number
 of objects and k is number of centroids
 Canopy based K-means changes to O(nkf2/c)
   c is no of canopies
   f is average no of canopies that each data point falls
    into
 As f is very small number and c is comparatively
 big, the complexity is reduced
Conclusion
                 
 Implemented K-means Algorithm
 Verified Result on Mathematica, R
 Implemented Canopy Clustering
 Verified Result on Excel
Future Enhancement
             
 Learning Hadoop and MapReduce
 Parallelizing K-Means based on MapReduce and
  comparing the implementation
 Running All the of K-means on standard dataset
References
                   
 Anil K. Jain, “Data Clustering: 50 Years Beyond K-
  Means”
 Andrew McCallum et al., “Efficient Clustering of
High Dimensional Data Sets with Application to
Reference Matching”

Thank You

More Related Content

PPTX
Grid based method & model based clustering method
PPTX
05 k-means clustering
PPT
Machine Learning Project
PPT
3.5 model based clustering
PPTX
Kmeans
PPT
3.2 partitioning methods
PDF
K-Means, its Variants and its Applications
PPTX
Clustering on database systems rkm
Grid based method & model based clustering method
05 k-means clustering
Machine Learning Project
3.5 model based clustering
Kmeans
3.2 partitioning methods
K-Means, its Variants and its Applications
Clustering on database systems rkm

What's hot (20)

PPT
Clustering (from Google)
PPT
Lec4 Clustering
PPT
Dataa miining
PPT
Clustering: Large Databases in data mining
PPTX
Clustering ppt
PDF
Big data Clustering Algorithms And Strategies
PPTX
K means clustering algorithm
PPT
Lect4
PPT
5.4 mining sequence patterns in biological data
PDF
Cg33504508
PDF
Bj24390398
PDF
Optics ordering points to identify the clustering structure
PDF
Current clustering techniques
PPTX
Large Scale Data Clustering: an overview
PPTX
K-means Clustering with Scikit-Learn
PPTX
L4 cluster analysis NWU 4.3 Graphics Course
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
PDF
K means Clustering
PPT
Enhance The K Means Algorithm On Spatial Dataset
PPT
Chapter 11 cluster advanced : web and text mining
Clustering (from Google)
Lec4 Clustering
Dataa miining
Clustering: Large Databases in data mining
Clustering ppt
Big data Clustering Algorithms And Strategies
K means clustering algorithm
Lect4
5.4 mining sequence patterns in biological data
Cg33504508
Bj24390398
Optics ordering points to identify the clustering structure
Current clustering techniques
Large Scale Data Clustering: an overview
K-means Clustering with Scikit-Learn
L4 cluster analysis NWU 4.3 Graphics Course
Welcome to International Journal of Engineering Research and Development (IJERD)
K means Clustering
Enhance The K Means Algorithm On Spatial Dataset
Chapter 11 cluster advanced : web and text mining
Ad

Similar to Clustering (20)

PPT
Lec4 Clustering
DOCX
K means report
PPT
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
PDF
Bb25322324
PPTX
Data clustring
PDF
A survey on Efficient Enhanced K-Means Clustering Algorithm
PDF
Premeditated Initial Points for K-Means Clustering
PDF
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
PDF
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
DOCX
Neural nw k means
PPT
Basic Clustering Algorithms in Data Warehouisng and Data Miningppt
PDF
10 clusbasic
PPT
Data Mining and Warehousing Concept and Techniques
PPT
CLUSTERING
PPT
My8clst
PPTX
Unsupervised learning clustering
PPT
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
PPT
10 clusbasic
PPT
lecture12-clustering.ppt
Lec4 Clustering
K means report
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Bb25322324
Data clustring
A survey on Efficient Enhanced K-Means Clustering Algorithm
Premeditated Initial Points for K-Means Clustering
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Neural nw k means
Basic Clustering Algorithms in Data Warehouisng and Data Miningppt
10 clusbasic
Data Mining and Warehousing Concept and Techniques
CLUSTERING
My8clst
Unsupervised learning clustering
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
10 clusbasic
lecture12-clustering.ppt
Ad

Clustering

  • 1. Presented By, Manasi C. Kadam Sharmishtha P. Alwekar Ganesh H. Satpute Deepak D. Ambegaonkar Rajesh V. Dulhani Under the guidance Prof. G. A. Patil Mr. Varad Meru
  • 2. Agenda   Introduction  Clustering  K-means clustering algorithm  Canopy clustering algorithm  Complexity Evaluation  Conclusion  Future Enhancement  References
  • 3. Introduction   Tedious task to maintain large Data  Types 1. Structured 2. Unstructured
  • 4. Introduction to Data analysis   Extracting information out of data  Two types 1. Exploratory or descriptive 2. Confirmative or inferential
  • 5. Clustering (Aka Unsupervised Learning)   Goal is to discover the natural grouping(s) between objects  Given n objects find K groups on measure of “similarity”  Organizing data into clusters such that there is • high intra-cluster similarity • low inter-cluster similarity  Ideal cluster - set of points that is compact and isolated  Ex. K-means algorithm, k-medoids etc.
  • 6.
  • 7. Problems in clustering   Cluster can differ in size, shape & density  Presence of noise  Cluster is a subjective entity  Automation
  • 8. Clustering Algorithm   Types of Clustering Algorithm 1. Hierarchical 2. Partitional  Hierarchical – recursively finds nested clusters  Types 1. Agglomerative 2. Divisive  Partitional - finds all the clusters simultaneously ex. K-means
  • 10. K-means Algorithm (contd.)   Goal of K-means is to minimize the sum of the squared error over all K clusters
  • 11. Flowchart
  • 12. Class Diagram of K-means 
  • 13. Parameter for K-means   Most critical choice is K  Typically algorithm is run for various values of K and most appropriate output is selected  Different initialization can lead to different output
  • 14. Canopy Clustering   Traditional clustering algorithm works well when dataset has either property.  Large number of clusters  A high feature dimensionality  Large number of data points.  When dataset has all three property at once computation becomes expensive.  This necessitates need of new technique, thus canopy clustering
  • 15. Canopy Clustering (contd.)   Performs clustering in two stages 1. Rough and quick stage 2. Rigorous stage
  • 16. Canopy Clustering (contd.)   Rough and quick stage  Uses extremely inexpensive method  divides the data into overlapping subsets called “canopies”  Rigorous stage  Uses rigorous and expensive metric  Clustering is applied only on canopy
  • 17. Flowchart of Canopy Clustering 
  • 19. Output of K-means on Mathematica on Same Dataset 
  • 20. Output of K-means on R on Same Dataset 
  • 22. Output of K-means on Microsoft Excel on Same Dataset 
  • 24. Output of canopy on Excel on Same Dataset 
  • 25. Complexity   Complexity of K-means is O(nk), where n is number of objects and k is number of centroids  Canopy based K-means changes to O(nkf2/c)  c is no of canopies  f is average no of canopies that each data point falls into  As f is very small number and c is comparatively big, the complexity is reduced
  • 26. Conclusion   Implemented K-means Algorithm  Verified Result on Mathematica, R  Implemented Canopy Clustering  Verified Result on Excel
  • 27. Future Enhancement   Learning Hadoop and MapReduce  Parallelizing K-Means based on MapReduce and comparing the implementation  Running All the of K-means on standard dataset
  • 28. References   Anil K. Jain, “Data Clustering: 50 Years Beyond K- Means”  Andrew McCallum et al., “Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching”