SlideShare a Scribd company logo
6
Most read
14
Most read
16
Most read
Clustering Algorithms: An Introduction
Classification Method of Supervised learning Learns a method for predicting the instance class from pre-labeled (classified)  instances
Clustering Method of unsupervised   learning Finds “natural” grouping of instances given un-labeled data
Clustering Methods Many different method and algorithms: For numeric and/or symbolic data Deterministic vs. probabilistic Exclusive vs. overlapping Hierarchical vs. flat Top-down vs. bottom-up
Clusters:  exclusive vs. overlapping a k j i h g f e d c b
Example of Outlier x  x x  x  x  x x  x x  x  x  x  x x  x x xx  x x  x  x  x  x  x x x  x x x  x x  x  x  x x  x  x x  x x Outlier
Methods of Clustering Hierarchical (Agglomerative): Initially, each point in cluster by itself. Repeatedly combine the two “nearest” clusters into one. Point Assignment: Maintain a set of clusters. Place points into their “nearest” cluster.
Hierarchical clustering Bottom up Start with single-instance clusters At each step, join the two closest clusters  Design decision: distance between clusters E.g. two closest instances in clusters vs. distance between means Top down Start with one universal cluster Find two clusters Proceed recursively on each subset Can be very fast Both methods produce a dendrogram
Incremental clustering Heuristic approach (COBWEB/CLASSIT) Form a hierarchy of clusters incrementally Start:  tree consists of empty root node Then:  add instances one by one update tree appropriately at each stage to update, find the right leaf for an instance May involve restructuring the tree Base update decisions on  category utility
And in the Non-Euclidean Case? The only “locations” we can talk about are the points themselves. I.e., there is no “average” of two points. Approach 1:  clustroid   = point “closest” to other points. Treat clustroid as if it were centroid, when computing intercluster distances.
“ Closest” Point? Possible meanings: Smallest maximum distance to the other points. Smallest average distance to other points. Smallest sum of squares of distances to other points. Etc., etc.
k  – Means Algorithm(s) Assumes Euclidean space. Start by picking  k , the number of clusters. Initialize clusters by picking one point per cluster. Example: pick one point at random, then  k  -1 other points, each as far away as possible from the previous points.
Populating Clusters For each point, place it in the cluster whose current centroid it is nearest. After all points are assigned, fix the centroids of the  k   clusters. Optional : reassign all points to their closest centroid. Sometimes moves points between clusters.
Simple Clustering: K-means Works with numeric data only Pick a number (K) of cluster centers (at random) Assign every item to its nearest cluster center (e.g. using Euclidean distance) Move each cluster center to the mean of its assigned items Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)
K-means clustering summary Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Too sensitive to outliers
K-means variations K-medoids  – instead of mean, use medians of each cluster Mean of 1, 3, 5, 7, 9 is  Mean of 1, 3, 5, 7, 1009 is Median of 1, 3, 5, 7, 1009 is  Median advantage: not affected by extreme values For large databases, use sampling 5 205 5
Examples of Clustering Applications Marketing:  discover customer groups and use them for targeted marketing and re-organization Astronomy:  find groups of similar stars and galaxies Earth-quake studies:  Observed earth quake epicenters should be clustered along continent faults Genomics:  finding groups of gene with similar expression And many more.
Clustering Summary unsupervised many approaches K-means – simple, sometimes useful K-medoids is less sensitive to outliers Hierarchical clustering – works for symbolic attributes
References This PPT is complied from: Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor, Morgan Kaufmann Publishers, March 2006. ISBN 1-55860-901-6
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

More Related Content

PPTX
Clusters techniques
PPT
Cluster analysis
PPTX
Hierarchical clustering.pptx
PPT
1.8 discretization
PPT
3.3 hierarchical methods
PPTX
Concurrency Control in Distributed Database.
PPTX
05 Clustering in Data Mining
PPT
3.1 clustering
Clusters techniques
Cluster analysis
Hierarchical clustering.pptx
1.8 discretization
3.3 hierarchical methods
Concurrency Control in Distributed Database.
05 Clustering in Data Mining
3.1 clustering

What's hot (20)

PPTX
PPTX
Data cube computation
PPTX
Data Mining: clustering and analysis
PPTX
Brute force method
PPT
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
PPTX
K MEANS CLUSTERING
PPTX
K means clustering
PPTX
Data discretization
PDF
Clustering training
PPTX
Dbscan algorithom
PPT
K mean-clustering algorithm
PPTX
Introduction to Clustering algorithm
PPTX
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
PPT
Clustering: Large Databases in data mining
PPT
Swap-space Management
PDF
Hierarchical Clustering
PPT
2.5 backpropagation
Data cube computation
Data Mining: clustering and analysis
Brute force method
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
K MEANS CLUSTERING
K means clustering
Data discretization
Clustering training
Dbscan algorithom
K mean-clustering algorithm
Introduction to Clustering algorithm
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Clustering: Large Databases in data mining
Swap-space Management
Hierarchical Clustering
2.5 backpropagation
Ad

Viewers also liked (20)

PPS
Introduction to Apache Hive
PPT
System Init
PPT
Excel Datamining Addin Intermediate
PPT
Powerpoint paragraaf 5.3/5.4
PPTX
Introduction to Data-Applied
ODP
Miedo Jajjjajajja
PPTX
Matlab Text Files
PPTX
LISP: Scope and extent in lisp
PPTX
RapidMiner: Advanced Processes And Operators
ODP
Oratoria E RetóRica Latinas
PPTX
LISP: Errors In Lisp
PPTX
MED dra Coding -MSSO
PPTX
RapidMiner: Setting Up A Process
PPT
Webmining Overview
PPTX
C,C++ In Matlab
PPTX
LISP: Declarations In Lisp
XLSX
PPTX
LISP: Type specifiers in lisp
PPTX
LISP:Object System Lisp
Introduction to Apache Hive
System Init
Excel Datamining Addin Intermediate
Powerpoint paragraaf 5.3/5.4
Introduction to Data-Applied
Miedo Jajjjajajja
Matlab Text Files
LISP: Scope and extent in lisp
RapidMiner: Advanced Processes And Operators
Oratoria E RetóRica Latinas
LISP: Errors In Lisp
MED dra Coding -MSSO
RapidMiner: Setting Up A Process
Webmining Overview
C,C++ In Matlab
LISP: Declarations In Lisp
LISP: Type specifiers in lisp
LISP:Object System Lisp
Ad

Similar to Clustering (20)

PPT
Basic Clustering Algorithms in Data Warehouisng and Data Miningppt
PPT
Data Mining and Warehousing Concept and Techniques
PPTX
Unsupervised learning Algorithms and Assumptions
PPT
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
PPT
10 clusbasic
PDF
10 clusbasic
PPT
data mining cocepts and techniques chapter
PPT
data mining cocepts and techniques chapter
PPT
CLUSTERING
PPT
Lecture8 clustering
PPTX
Lecture 11
PPT
ClusetrigBasic.ppt
PPT
15857 cse422 unsupervised-learning
PPTX
Clustering
PPTX
Cluster Analysis
PPTX
Cluster Analysis
PPTX
Cluster Analysis
PPT
My8clst
PDF
clustering in DataMining and differences in models/ clustering in data mining
Basic Clustering Algorithms in Data Warehouisng and Data Miningppt
Data Mining and Warehousing Concept and Techniques
Unsupervised learning Algorithms and Assumptions
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
10 clusbasic
10 clusbasic
data mining cocepts and techniques chapter
data mining cocepts and techniques chapter
CLUSTERING
Lecture8 clustering
Lecture 11
ClusetrigBasic.ppt
15857 cse422 unsupervised-learning
Clustering
Cluster Analysis
Cluster Analysis
Cluster Analysis
My8clst
clustering in DataMining and differences in models/ clustering in data mining

More from DataminingTools Inc (20)

PPTX
Terminology Machine Learning
PPTX
Techniques Machine Learning
PPTX
Machine learning Introduction
PPTX
Areas of machine leanring
PPTX
AI: Planning and AI
PPTX
AI: Logic in AI 2
PPTX
AI: Logic in AI
PPTX
AI: Learning in AI 2
PPTX
AI: Learning in AI
PPTX
AI: Introduction to artificial intelligence
PPTX
AI: Belief Networks
PPTX
AI: AI & Searching
PPTX
AI: AI & Problem Solving
PPTX
Data Mining: Text and web mining
PPTX
Data Mining: Outlier analysis
PPTX
Data Mining: Mining stream time series and sequence data
PPTX
Data Mining: Mining ,associations, and correlations
PPTX
Data Mining: Graph mining and social network analysis
PPTX
Data warehouse and olap technology
PPTX
Data Mining: Data processing
Terminology Machine Learning
Techniques Machine Learning
Machine learning Introduction
Areas of machine leanring
AI: Planning and AI
AI: Logic in AI 2
AI: Logic in AI
AI: Learning in AI 2
AI: Learning in AI
AI: Introduction to artificial intelligence
AI: Belief Networks
AI: AI & Searching
AI: AI & Problem Solving
Data Mining: Text and web mining
Data Mining: Outlier analysis
Data Mining: Mining stream time series and sequence data
Data Mining: Mining ,associations, and correlations
Data Mining: Graph mining and social network analysis
Data warehouse and olap technology
Data Mining: Data processing

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Modernizing your data center with Dell and AMD
PDF
Encapsulation theory and applications.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Cloud computing and distributed systems.
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.
Modernizing your data center with Dell and AMD
Encapsulation theory and applications.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Understanding_Digital_Forensics_Presentation.pptx
MYSQL Presentation for SQL database connectivity
Digital-Transformation-Roadmap-for-Companies.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectral efficient network and resource selection model in 5G networks
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Unlocking AI with Model Context Protocol (MCP)
Encapsulation_ Review paper, used for researhc scholars
CIFDAQ's Market Insight: SEC Turns Pro Crypto
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
“AI and Expert System Decision Support & Business Intelligence Systems”

Clustering

  • 2. Classification Method of Supervised learning Learns a method for predicting the instance class from pre-labeled (classified) instances
  • 3. Clustering Method of unsupervised learning Finds “natural” grouping of instances given un-labeled data
  • 4. Clustering Methods Many different method and algorithms: For numeric and/or symbolic data Deterministic vs. probabilistic Exclusive vs. overlapping Hierarchical vs. flat Top-down vs. bottom-up
  • 5. Clusters: exclusive vs. overlapping a k j i h g f e d c b
  • 6. Example of Outlier x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x Outlier
  • 7. Methods of Clustering Hierarchical (Agglomerative): Initially, each point in cluster by itself. Repeatedly combine the two “nearest” clusters into one. Point Assignment: Maintain a set of clusters. Place points into their “nearest” cluster.
  • 8. Hierarchical clustering Bottom up Start with single-instance clusters At each step, join the two closest clusters Design decision: distance between clusters E.g. two closest instances in clusters vs. distance between means Top down Start with one universal cluster Find two clusters Proceed recursively on each subset Can be very fast Both methods produce a dendrogram
  • 9. Incremental clustering Heuristic approach (COBWEB/CLASSIT) Form a hierarchy of clusters incrementally Start: tree consists of empty root node Then: add instances one by one update tree appropriately at each stage to update, find the right leaf for an instance May involve restructuring the tree Base update decisions on category utility
  • 10. And in the Non-Euclidean Case? The only “locations” we can talk about are the points themselves. I.e., there is no “average” of two points. Approach 1: clustroid = point “closest” to other points. Treat clustroid as if it were centroid, when computing intercluster distances.
  • 11. “ Closest” Point? Possible meanings: Smallest maximum distance to the other points. Smallest average distance to other points. Smallest sum of squares of distances to other points. Etc., etc.
  • 12. k – Means Algorithm(s) Assumes Euclidean space. Start by picking k , the number of clusters. Initialize clusters by picking one point per cluster. Example: pick one point at random, then k -1 other points, each as far away as possible from the previous points.
  • 13. Populating Clusters For each point, place it in the cluster whose current centroid it is nearest. After all points are assigned, fix the centroids of the k clusters. Optional : reassign all points to their closest centroid. Sometimes moves points between clusters.
  • 14. Simple Clustering: K-means Works with numeric data only Pick a number (K) of cluster centers (at random) Assign every item to its nearest cluster center (e.g. using Euclidean distance) Move each cluster center to the mean of its assigned items Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)
  • 15. K-means clustering summary Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Too sensitive to outliers
  • 16. K-means variations K-medoids – instead of mean, use medians of each cluster Mean of 1, 3, 5, 7, 9 is Mean of 1, 3, 5, 7, 1009 is Median of 1, 3, 5, 7, 1009 is Median advantage: not affected by extreme values For large databases, use sampling 5 205 5
  • 17. Examples of Clustering Applications Marketing: discover customer groups and use them for targeted marketing and re-organization Astronomy: find groups of similar stars and galaxies Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Genomics: finding groups of gene with similar expression And many more.
  • 18. Clustering Summary unsupervised many approaches K-means – simple, sometimes useful K-medoids is less sensitive to outliers Hierarchical clustering – works for symbolic attributes
  • 19. References This PPT is complied from: Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor, Morgan Kaufmann Publishers, March 2006. ISBN 1-55860-901-6
  • 20. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net